ORMs vs. Query Builders: Database portability

There has been some discussion in recent days regarding Object-Relational Mappers (ORMs), Drupal, and why the latter doesn't use the former. There are, actually, many reasons for that, and for why Drupal doesn't do more with the Active Record pattern.

Rather than tuck such discussion away in an issue queue, I figured it better to document a bit more widely.

On ORMs in Drupal

I highly recommend that people read The Vietnam of Computer Science. It does an extremely good job of laying out why OO<->SQL mapping is a fundamentally intractable problem. We can do some, in some cases, but it very very quickly becomes very very ugly. The only viable solution is to rely on SQL as the primary system and treat the object structure as a secondary add-on, or to treat the object structure as the primary system and the data persistence system as an unstructured data store.

Because of its stateless nature, PHP doesn't lend itself well to the latter method. Anything done in-memory needs to be completely reinitialized on every page request. That means you need to move the entire data storage and searching (never forget searching/querying/slicing/dicing!) into a separate persistent storage engine akin to an SQL database. Such systems do exist, but have three important problems where our usage is concerned:

  1. They're still relatively immature and have not stabilized, and most have not proven themselves capable of scaling very large and very small.
  2. They are not at all standardized between vendors and implementations, meaning we would need to pick one and hope that it would be the winning strategy later on.
  3. They are not universally available, and universal availability is a key goal for Drupal.

In contrast, it is difficult to find a web host that doesn't offer MySQL. PostgreSQL is available on many, and anyone running their own server can install it. SQLite requires effort to not have available in PHP 5.2. While there are big differences between different SQL implementations, they are far far more standardized than object databases are, especially if you include the new wave of "naive" databases. The available tools to work with them (both PHP-based tools like PDO and user-space tools like desktop GUIs) are very mature and robust and widely available. SQL is also a very widely available skill set. You can pick it up anywhere (to varying degrees of competency, of course), and any CS program in the world worth its salt includes at least one class in some SQL dialect.

So for now and for the foreseeable future, favoring relational data over object data for PHP, and Drupal in particular, seems the better strategy than gutting and rewriting Drupal entirely for CouchDB, Bigtable, Oracle-specific object stores, or whatever else.

Given that a relational database is still the most sensible primary storage system for Drupal to use, we should make it as easy as possible to leverage, but no easier lest we lose power in the process. That fits with the Rasmus Lerdorf-coined term for PHP's architecture of "Leaky Abstraction". We don't want to prevent people from accessing the data store directly, which means giving them SQL access so they can do things we didn't think of when creating the API. That precludes creating our own heavy ORM, as that's yet another Drupal specific API people would need to learn.

While yes, it may be possible to create a Drupal-specific data access system that is easier to learn than SQL, there would be about 10 people who understood it and a few hundred who knew how to use it. Compare that to the hundreds of thousands of people who already know SQL well enough to hit the ground running in Drupal. I'd like to use more conventional and standard approaches, patterns, and techniques in Drupal, not fewer.

There's also the challenge of creating such an abstraction in the first place. Currently, the trend in Drupal is toward a unified API and structure for primary data entities (nodes, users, etc.) that forces data into a very specific structure, fields. Think "CCK everywhere". Now look at the hundreds of hours that have gone into CCK and Fields. Now look at how messy the code for CCK is in Drupal 6. Now look at how the API breaks in so many places already when you do unconventional things. Now look at how there are already massive performance issues that haven't been resolved yet, except with the cop-out of "uh, use caching". And the people working on this problem are really really smart people. It's really hard to do well, and that's just for a small subset of the larger ORM problem. Multiply all of those issues by 10 and you approach the challenge that is making Drupal all-ORM without someone else solving the problem for us already.

We did briefly consider Doctrine early in the DBTNG development life cycle. (More specifically, some of the Doctrine folks tried for a few weeks to sell us on just using Doctrine.) We decided against it on the grounds that Doctrine was way too big for us to include (I think someone said the code base was larger than core, but I'm not sure if that's true), it was yet another custom string-based pseudo-language that people would have to learn (whereas basic SQL is a widespread skill), and it didn't give us the flexibility to deal with non-primary-data entities whereas SQL allows all kinds of unexpected flexibility. DBTNG, by contrast, is very thin, flexible, and for someone who already knows SQL is almost trivial to pick up even with the query builders.

Every abstraction layer adds a performance cost, too. "Just add more abstraction" is not always a viable answer, as there are costs to any abstraction.

The DBTNG approach

When I started working on Drupal 7's new database layer, back before it was called "DBTNG", my goal was to make writing secure, portable SQL easier, not easy :-), and provide a lot of new features for SQL writers. Although SQL is far more standardized than object databases are, it's still an iffy standard at best. Imagine trying to write a definitive grammar parser for English, in all its variants: British, Scottish, New Zealander, Outback Australian, plus the dozen or so distinct North American versions. (Really, can anyone understand what someone from Brooklyn is saying? :-) ) That's what writing a truly vendor-agnostic SQL abstraction layer is like, except that you can't rely on the human brain's ability to guess what is meant by context. SQL servers don't do fuzzy parsing. Doing that at runtime in interpreted code like PHP is a recipe for performance disaster.

PDO takes care of a lot of the abstraction for us, by providing a layer, in compiled C code, where differences in prepared statements are normalized and abstracted. Of course, even that is imperfect at times as we see from the need to provide alternate handling for BLOB and LOB data types in PostgreSQL and Oracle, or provide our own user-space buffering for SQLite.

The only viable solution that does not involve runtime parsing and re-creation of SQL strings (which as noted above is slow, error-prone, and extremely hard to do) is to require developers to provide a pre-parsed data structure that can be customized for each database. That's what the query builders are for, as they describe the SQL behavior in structured terms that are then fairly easily compiled to each target database. In all modesty, I think we did a far better job there than I every expected us to on the first try. (OK there were several iterations, but this approach has been fairly stable for over a year now.) But even those introduce overhead. SelectQuery-based queries are more verbose to write and slower to execute than just providing an SQL string, because they have to be built up (a dozen or more method calls) and then compiled before being executed. Sometimes that is a very good trade-off, other times not.

Of course, that means there are many use cases for which we do not have a solution. Not all queries are valid in all SQL dialects. The only ways to ensure that such queries work cross-database are:

  1. Parse all query strings and mutate them into DB-specific SQL. As discussed above, this is simply infeasible due to both performance concerns and the shear effort involved. Some databases have concepts that simply do not exist in other databases.
  2. Require every single database query to go through a query builder so that we have a pre-parsed data structure to work with. We do this already for modifier queries, but that's pushing it. Select queries are much more common, and the performance cost there much higher. And even then, some mappings would be non-trivial. Plus, writing a built Select query is indisputably more work for the developer than just writing a literal query in the majority of cases. (Those cases where it's not? Well, use the query builder. It's a worthwhile trade-off there.)
  3. Require module developers to write portable SQL where possible. The only performance cost here is on the developer, not at runtime. Most queries are, in fact, reasonably portable. The cost to the developer here, given that PDO handles prepared statements for us, is fairly low. It's only an issue on more complex queries.

The only option that doesn't make Drupal Way Too Slow(tm) is #3: Write portable SQL where possible and submit bug reports where it hasn't been done. That's the approach that Drupal has taken, since doing more would be cost prohibitive.

If someone manages to come up with a brilliant solution that lets us do #1 without serious performance degradation, I'm all for it. However, I don't see that happening until Drupal 8, if ever. Database vendors need to get their act together and better standardize their SQL dialects.

Some problems we simply cannot solve on our own.

Comments

That fits with the Rasmus

That fits with the Rasmus Lerdorf-coined term for PHP's architecture of "Leaky Abstraction".
Nitpicking, I know, but I think that quote belongs to Joel Spolsky.

Could be

I first heard the term from Rasmus; it's certainly possible that he got it from somewhere else. It looks like Joel is talking about something else, though. Joel is talking about "all abstractions leak, deal". Rasmus' big deal is/was "make your abstractions leaky so that you don't code yourself into a corner". That's why I say we still want/need to allow direct SQL access in Drupal and have an SQL database that is well-normalized, to ensure that we always have the ability to fall back to raw SQL if node_load() just doesn't work for what we're trying to do. (I actually do that rather often myself, for performance reasons.)

ORM

The only viable solution is to rely on SQL as the primary system and treat the object structure as a secondary add-on, or to treat the object structure as the primary system and the data persistence system as an unstructured data store.

I think these are two extremes, and Object-Relational Mapping is meant to fall somewhere in the middle. If the DB is allowed to evolve separately from the domain layer, then you don't have to sacrifice your database schema or your rich domain layer. That is the primary goal of an ORM, not abstracting away the DB.

Great write-up

Hi Larry, thanks for this great and detailed write-up.

What would add perspective to this piece is to talk about what other web application frameworks have done.

http://en.wikipedia.org/wiki/List_of_web_application_frameworks

Kieran

Volunteer?

Thank you, Kieran, for volunteering to research how other projects have handled the data question! I look forward to your report.

Active Record vs Data Mapper

I think one problem is that many people's experience with ORM is limited to Active Record implementations. Unfortunately, the more complex and properly normalized your data model is, the less likely Active Record will do what you need.

Data Mapper type ORMs (most notably Hibernate for Java, and SQLAlchemy for Python) offer a great bridge between OO and SQL that can handle just about anything you throw at them, including many-to-many relationships, inheritance chains, and even seamless database sharding. These more advanced ORMs also offer the ability to use hand-written SQL whenever needed, so you can fine-tune queries that need better performance.

I don't know of any established Data Mapper packages for PHP, but at least one is in the works: http://phpdatamapper.com/

Hibernate and SQLAlchemy also offer a full set of DB independent SQL generation functions, but I find that DB independence is a myth and usually not needed anyway (How many projects switch their database?). Most moderately complex databases will have triggers, functions, and sometimes custom data types, and there won't be abstraction layers that can deal with those differences anytime soon.

Nice to hear SQLAlchemy mentioned

I'm a big fan of SQLAlchemy and a little underwhelmed by the endless Active Record implementations out there. SQLAlchemy in particular isn't trying to hide SQL from the developer so much as add some powerful OO tools on top of it - it just so happens that the main one happens to be an optional ORM (eg based on the Data Mapper and Unit of Work patterns). It still provides ways of using SQL query builders or even plain old SQL queries if desired.

But I agree that the SQLAlchemy and Hibernate approaches probably wouldn't work well in the stateless world of PHP at least without some extra persistence (eg long running in memory caches etc) available - and that gets difficult to set up for the average Drupal site.

To me it sounds like Drupal 7 has taken the right approach, and I had a feeling from somewhere that Larry is already familiar with how Hibernate works (corrections welcome of course).

Closures...

Quite liking the idea of using closures for isolating SQL from the rest of the application.

Then the problem is similar to language translation, its just a matter of picking the right closure for the right database.

What about pre-compiled SQL?

I bet you could achieve something like the #1 option if you provide some kind of pre-compiled mechanism.

What I mean by that is, provide a way to define the SQL, and then have a step when you are setting up Drupal (or a module) which checks which type of database you are using, then compiles all the SQL to actual strings. Thus, you get the speed of direct SQL, but with the flexibility of supporting any number of databases you want.

There might be pain if you switch databases for an existing system, but that could be minimized if the way the SQL is compiled could be re-done with ease.

Just a thought....

Option #1

I actually ended up going the route of option #1 with Flourish. I used a combination of parsing strings out of the SQL queries and then regular expressions to manipulate the SQL. Obviously I can't support every feature across all databases, but I have a pretty solid set documented on the Flourish SQL page. Currently MySQL, PostgreSQL, SQLite, Oracle and SQL Server are all supported. I'm hoping to add DB2 support in the future.

Part of the reason I wanted to provide translation was because so many developers are fluent in SQL, and while I did build an ORM on top of the SQL translation, there are plenty of situations where raw SQL is a better fit. I haven't done a whole lot of performance profiling, however with non-naive usage of SQL, there really shouldn't be too many calls per script execution, and the regular expressions should be faster than the actual queries for all but the most trivial of queries.

At some point in the future I'd like to do some work with seeing how I can cache the translated SQL statements to provide additional performance benefits. Flourish encourages a sprintf-style syntax for data escaping, which should allow for fairly simple addition of prepared statements (coming soon) and caching the translated SQL before actual values are bound.

Good reply...

Hi Crell,
very good reply, after reading, i support this "notable quote":

http://www.garfieldtech.com/quotes/118
:)
Andrea