A survey of data modeling

Submitted by Larry on 19 August 2025 - 12:30pm

There are many different ways of modeling data. They all have their place, and all have places where they are a poor fit.

The spectrum of options below are defined mainly by the degree to which they differentiate between read and write models, and correspondingly how powerful-but-also-complex they are. "Model" in this case usually corresponds to a class, or a class with one or more composed classes.

Varieties of data modeling

Arbitrary SQL

In this case, there is no formal data definition beyond the SQL (or other database) schema. The application just runs arbitrary SQL queries, both read and write, wherever it sees fit.

In a slightly better variant, SQL queries are all confined to selected objects that act as an API to the rest of the application. Arbitrary code does not call SQL, but it can call a method on this object that will call SQL.

The SQL could be hand-crafted, use a query builder of one kind or another, or a little of each.

This approach may work at a very small scale, where building something more formal isn't worth the effort. However, the tipping point where it is worth the effort comes very, very early.

CRUD

The most widely used approach is known as "Create Read Update Delete" (CRUD). Those are the four standard operations. In this case, the system models a series of data objects called Entities. While technically Entities do not need to correspond 1:1 to a particular database table, in practice that is often the case. An entity could also have dependent tables, the details of which are mostly hidden.

CRUD is usually managed by an ORM, or Object-Relational Mapper. The ORM attempts to hide all SQL logic from the user, providing a consistent interface pattern. A user Reads (loads) an Entity by ID, possibly Updates it (edits some value), and then saves it back to the database. The user only interacts with the Entity object.

There are two main variants of ORM: Active Record, in which the Entity object has direct access to the database connection to load and save itself, and Data Mapper, in which the Entity is ignorant of its storage and a separate service (a mapper, or repository, or various other names) is responsible for the loading and saving. Active Record is often easier to implement from scratch, so it is popular with RAD-oriented tools (like Ruby on Rails or Laravel). It is, however, a vastly inferior design as it severely hinders testing, encapsulation, and more advanced cases. The effort to set up a Data Mapper is almost always worth it, as the effort is not substantially higher for a skilled developer.

CRUD falls down in three key areas:

  • It assumes the "read model" and "write model" are the same. While in simple cases that may be true, they often have different validation requirements. For example, "last updated" or "last login time" are likely fields that are not needed on the write model, as the system manages them directly; they are either absent or optional. On the read model, however, we expect them to be always present. That difference cannot be easily captured in a single unified object. (Some workarounds do exist, but they are workarounds only.)
  • Relationships. A key value of SQL is data being "relational." That is, Entity A may have a "contains" or "uses" or "is parent of" relationship with Entity B, or with another Entity A. In SQL, this is almost always captured using a foreign key, and many-to-many relationships are captured with an extra join table. Mapping that into objects is often difficult, especially for complex data where an Entity spans multiple tables. It can also lead to severe performance problems, especially the "SELECT N+1 Problem," in which a series of Entity A objects are loaded, then as they are used each one lazily loads its related Entity B, resulting in "N+1" queries.
  • Listing and Querying. SQL is very good at building searches across arbitrary data fields. That's what it was built for. Object models frequently are not. They are fine for straight read/write operations, but less so for "find all products that cost at least $100 bought by a customer over the age of 50 in the last 3 months." That generally requires either dropping down to manual SQL to get a list of entity IDs, then doing a bulk-read on them, or a complex query builder syntax (either method calls or a custom string syntax) that translates high level relationships into low-level relationships. Tools like Java Hibernate or Doctrine ORM take the latter approach, which is one reason they are so large and complex.

An ORM in concept also does not offer any native way to create compound views, showing a subset of fields from 3 different related entities, for example. Some ORMs provide a mechanism of some sort, but rarely are they as capable or efficient as just writing SQL.

The impedance mismatch between object models and relational models has been called "The Vietnam of Computer Science," meaning "you keep trying to do it more, and it just gets worse the more you do." Simple ORMs are straightforward to build, but have an upper bound on complexity before they become too unwieldy.

CRAP

There is a variant of CRUD known as Create Read Archive Purge (CRAP), which does not get anywhere near as much use as it should. In this approach, each Entity is not updated in place when modified. Instead, an entirely new copy of the Entity is stored in the database, along with some version identifier. That gives each Entity a history of its state over time, with a built in ability to review that history and revert to an earlier state.

No Entity is deleted; if an entity needs to be deleted, a new version of it is saved that has a "deleted" flag set to true. Any SQL that interacts with the Entity must then be written to exclude older versions and deleted versions, unless specifically instructed not to.

If the historical data of a given Entity is no longer valuable, or is not valuable after a period of time, a separate Purge command can remove old revisions, including removing deleted entities entirely. The time frame for such purges and whether they can be user-triggered varies with the implementation.

The advantage is, of course, the history and rollback ability. It's also relatively easy to extend it to include forward revisions, which are revisions that will become the active revision at some point in the future (either upon editorial approval or some time trigger).

The downside is the extra tracking required, which means every bit of SQL that interacts with a CRAP Entity needs to be aware of its CRAPpiness. Writing arbitrary custom SQL becomes more problematic in this case, as a query that forgets to account for old revisions or deleted entities could result in unexpected data. That is especially true with more complex relationships. It also implies questions like "should Entity A getting a new revision cause Entity B to get a new revision, too? Should Entity A point to Entity B, or a specific revision of Entity B?" All possible answers to those questions are valid in some situations but not others. There may also be performance considerations if there are many revisions of many Entities, although that is a solvable problem with smart database design.

Nonetheless, I would argue CRAP is still superior to CRUD in most editorial-centric environments (news websites, company sites, etc.).

Projections

An extension available to both CRUD and CRAP is Projections. Usually Projections are discussed in the context of CQRS or EventSourcing (see below), but there's no requirement that they only be used there.

A Projection is the fancy name for stored data that is derived from other stored data. When the primary data is updated, an automated process causes the projection to be updated as well. That automation could be in application logic or SQL triggers/stored procedures; I would even consider an SQL View (either virtual or materialized) to be a form of Projection.

Projections are useful when you want the read version of the data structured very differently than the write version, or want it presented in some way that is expensive to compute on-the-fly.

For example, if you want a list of all sales people, their weekly sales numbers, the percentage change from last week, ordered by sales numbers, that could be expensive to compute on the fly. It could also be complex, if that data has to be derived from individual sale records and those sale records are spread across multiple tables, and sales team information is similarly well normalized across multiple tables. Instead, either on a schedule or whenever a sale record is updated, some process can compute that data (either the whole table or just update the one record it needs to) and save it to a sales_leaderboard table. Viewing that information is then a super simple, super fast single-table SELECT query.

If that table ever becomes corrupted or out of date, or we just want to change its structure, the data can just be wiped and rebuilt from the existing primary data. Projections are always expendable. If not, they're not Projections.

A system can use many or a few Projections as needed, built in a variety of ways. As usual, there's more than one way to feed a cat. If heavily used, Projections form essentially the entire read model. There's no need to read Entities from the primary data, except for update purposes.

Technically, any search index (Elasticsearch, Solr, Meilisearch, etc.) is a Projection. There is no requirement that the Projection even be in SQL, just that it is expendable, rebuildable data in a form that is optimized for how it's going to be read.

CQRS

The next level in read/write separation is Command Query Responsibility Segregation** (CQRS). CQRS works from the assumption that the read and write models are always separate.

Often, though not always, the write models are structured as command objects rather than as an Entity per se. That could be low-level (UpdateProduct command with the fields to change) or high-level (ApprovePost with a post ID).

The read models could be structured in an Entity-like way, but do not have to be. CQRS does not require using Projections, though they do fit well.

The advantage of CQRS is, of course, the flexibility that comes with having fully independent read and write models. That allows using the type system to enforce write invariants while having completely separate immutable read models. It also allows separating both read and writes from the underlying Entity definitions; a single update command may impact multiple entities, and a read/lookup can easily span entities.

The downside of CQRS is the added complexity that keeping track of separate read and write models entails. It requires great care to ensure you don't end up with a disjointed mess. Martin Fowler recommends only using it within one Bounded Context rather than the system as a whole (though he does not go into detail about what that means). If the read and write models are "close enough," CRUD with an occasional Projection may have less conceptual overhead to manage.

Event Sourcing

The most aggressive separation between read and write models is Event Sourcing. In Event Sourcing, there is no stored model. The primary data that gets written is just a history of "Events" that have happened. The entire data store is just a history of event objects, with some indexing support.

When loading an object (or "Aggregate" in Event Sourcing speak), the relevant Events are loaded from the store and a "current status" object is built on-the-fly and returned. In practice, in a well-designed system this process can be surprisingly fast. The Event stream also acts as a built-in log of all actions taken, ever.

Event Sourcing also leans very heavily on Projections. Projections can represent the current state of the system as of the most recent event, in whatever form is desired. Storing an Event can trigger a handler that updates Projections, sends emails, enqueues jobs, or anything else.

Importantly, events can be replayed. That means, for example, creating a new Projection requires only writing the routine that creates the projection, then rerunning the entire Event stream on it. It will then build the Projection appropriately. If the Projection is updated, migrating a projected database table is simple: Delete the old one, create the new one, rerun the Event stream. Every database table, search index, etc. except for the Event stream itself are disposable and can be thrown out and recreated at will.

The downside is that Event Sourcing, like CQRS, requires careful planning. It's a very different mental model, and not a good fit for all situations. Banking is the classic example of where it fits, and where a history of actions taken is the most important data. A typical editorial CMS, however, would be a generally poor fit for Event Sourcing, as most of what it's doing is very CRUD-ish. Nearly all events would be some variation on PostUpdated.

Depending on the complexity of the data, building reasonable Projections could be a challenge. If Entities/Aggregates are loaded from the Event stream, they may be easy or complex to reconstitute.

General advice

(This section is, of course, quite subjective.)

All of these models have their trade-offs, and pros/cons. For most standard applications, I would argue that CRUD-with-Projections is the least-bad approach. The ecosystem and known best practices are well established. Edge cases where the read and write models need to differ can often be handled as one-offs, if the system is designed with that in mind. That sort of edges it into CQRS space in limited areas, which is both helpful and risky if viewed as a slippery slope.

Even in a CRUD-based approach, it's possible to have slightly different objects for read and write. If the language supports it, the read objects can be immutable, while the write objects are mutable aside from select key fields (primary key, last-updated timestamp, etc.), which may even be omitted. The line between this split-CRUD approach and CQRS is somewhat fuzzy, though, so be mindful that you don't over-engineer CRUD when you should just use CQRS.

For workflow-heavy applications (like change-approval, or scheduled publishing, etc.), CRAP is likely worth the effort. The ability to have forward and backward revisions greatly simplifies many workflow approaches, and provides a nice audit trail.

Regardless of the approach chosen, it is virtually always worth the effort to define formal, well-typed data objects in your application to represent the models. Using anonymous objects, hashes, or arrays (depending on the language) is almost always going to cause maintenance issues sooner rather than later. Even if using CQRS or just queries that bypass a CRUD ORM, every set of records read from the database should be mapped into a well-typed defined object. That is inherently self-documenting, eliminates (or at least highlights as needing attention) many edge cases, provides a common, central place for in-memory handling of those edge cases (eg, null handling), and so forth.

Additionally, any database interaction should be confined to select, dedicated services whose have exclusive responsibility for interacting with the database and turning results into proper model objects. This is true regardless of the model used.

If doing CRUD or CQRS, it may be tempting to optimize updates to only update individual fields that need updating rather than updating an entire Entity at once, including unnecessary fields. I would argue that, in most cases, this is a waste of effort. Modern SQL databases are quite fast and almost certainly smarter than you are when it comes to performance. If you are using a well-established ORM that already does that, it's fine, but if rolling your own the effort involved is rarely worth it. At that point, you're almost merging CRUD and CQRS commands anyway.