Abstracting databases

There has been some discussion recently, in IRC, issue queues, and blogs, about the Drupal 7 database API and its impact on supporting different database engines. While I am still trying to avoid large amounts of public distraction, especially when we're supposed to be trying to get a Drupal 6 beta 1 out the door, I feel it's important to get a few points cleared up lest they lead to confusion later.

Before we begin, in the interests of disclosure, my own database experience is about 95% MySQL and 5% MS SQL. I last touched Oracle back in grad school, very very briefly, and have never worked with DB2. So, there's my bias.

I am also quite in favor of getting Drupal working on more database backends. That is not the main focus of my work with PDO (richer API potential is), but it is a feature that I wan to enable if at all possible.

Background

First off, we need to understand that there is no one type of database abstraction layer. The best treatment of this subject I've seen is over on Xaprb's blog (geez, people, you can't use your real name on a tech blog?), where he lays out four different types of database layers. In short:

Access libraries
The mysql_, mysqli_, pgsql_, etc. functions all fall into this category. All they do is provide language bindings, really, and are not really an abstraction layer.
Common API libraries
This level provides a common language API to multiple databases, but doesn't abstract database operations (SQL statements) themselves. PDO lives here.
Portable SQL libraries
This is the first layer that provides any sort of SQL abstraction. How much abstraction varies widely. Examples here include PEAR::DB, PEAR::MDB2, ADOdb, and Drupal's existing database functions.
Object-Relational Mappers
I actually disagree with limiting this layer to just ORMs, as Active Record libraries and any other "write logic, not SQL" library falls into this category. The basic idea here is that you do not write SQL. You call ->save() on an object and the SQL just sorta happens for you in whatever way it needs to. Propel, Doctrine, and strict Rails frameworks would all qualify here.

I highly recommend reading Xaprb's blog, as it goes into much more detail about the pros and cons of each method.

Drupal's current database API is a mixture of levels 2 and 3, written using strictly level 1 APIs. That was all well and good in 2000, but not in 2008 (when Drupal 7 is expected to ship).

Goals

The goal of the new API is to take an existing and well-tested level 2 API, PDO, and implement a mixture of levels 3 and 4 on top of it. That provides two main advantages, in order of importance (to me, at least):

  1. Richer, more powerful APIs that make getting real work done easier.
  2. A higher level of abstraction under which to offer more varied database backends (level 1).

The reason we can do that is because the level 1->level 2 heavy lifting is done for us. That lets us focus on higher levels. And make no mistake: those higher levels are important. The whole point of an API is to make doing common or ugly things simple and consistent. There is nothing fundamentally wrong with abstracting out common SQL patterns to non-SQL syntax, if it makes the programmer's job easier. Sometimes it makes sense, sometimes it doesn't. True, it is not "the job of the database abstraction layer" to abstract away SQL itself... if you are talking about level 2 database layers. If you're talking about level 3 or level 4, then abstracting away SQL itself, at least to some extent, is the whole point.

Challenges

Each of our stated goals has its own challenges, of course.

First, we want to make sure we build a rich API that makes simple and moderately complex things easy while making hard things possible and hopefully easier. We don't want to exclude certain hard edge cases from the realm of possibility, because you know that as soon as you do you'll find a reason you need that obscure functionality and will be kicking yourself for making it impossible. We also don't want to force people to learn a completely obscure and proprietary data access layer lingo the way that, say, Doctrine does. (Nothing against Doctrine per se, it just wouldn't work for Drupal.) That means we need to abstract far enough to make common tasks easier but not so far that non-common tasks are impossible.

For supporting more databases, there are a whole host of problems. For one, the SQL standard itself is ill-supported by, well, everything. Even common tasks like inserting data or creating tables are not common across all databases. CREATE TABLE syntax is subtly different in every database. Some need extra treatment for values that aren't trivially small (Large OBjects, or LOBs), and that treatment is very different on different databases. In order to support widely different SQL syntax, we need to have some syntax that module developers can use that will work on all of them without having to switch() on the database type and run different code in each case.

That means we can do one of three things:

  1. Support a lowest-common-denominator only.
  2. Invent our own SQL-ish syntax that each database backend can mutate as it needs.
  3. Provide a non-SQL layer for developers that can be translated into database-specific SQL on the fly (level 4).

Method 2 we dismiss out of hand. SQL is a serialized data format for structured (relational) data. It is serialized by PHP code (in our case) to a string, and then deserialized by the SQL server, which has a very big and complex algorithm for doing so that is way better than anything we could write in PHP. If you need to parse and modify a serialized data structure, you have a design flaw. It means you serialized the data too early. It's also slow and error-prone. Let us not speak of it again.

Method 1 stands in the way of supporting more databases because different databases aren't direct subsets of each other's functionality. Oracle is not a strict subset of DB2's features, DB2 is not a strict subset of PostgreSQL, etc. That means we'd be limited to functionality that existed in every supported database and existed the same way. If we take this argument to an extreme, that leaves out INSERT queries entirely since they work differently for Oracle, DB2, and MySQL. Even Oracle admits that "Platform-neutral" SQL is not a good idea. Clearly that is not an option.

That leaves Method 3: Provide a non-SQL layer on top of the SQL drivers that works on data structures rather than raw SQL, and let each database backend build the SQL for us. That is, actually, exactly what Drupal's existing database API and nearly all other level 3 and 4 abstraction layers do. It's just a matter of degrees between db_query_range() and an Active Record's objects. It also is the perfect place to implement that richer API we want, too.

"But wait!" I hear you cry. "SQL is already an abstraction layer on top of relational databases." That is an absolutely wonderful theory. In practice, though, SQL is not standardized enough to make writing non-trivial queries that works on an arbitrary databases possible. Blame database vendors, blame the SQL standards committees, blame Elvis, it doesn't matter. SQL is not, in a practical sense, sufficiently standardized to be treated as a standard.

Simplicity

An important factor in designing that abstraction layer is that it has to be simple. It has to be simple enough for developers to use it without more effort than it would take to simply write SQL for each database directly. A key attribute of a good interface design (API, computer, or physical) is that it makes doing the "right thing" easy.

Now, let's consider that the vast majority of Drupal sites right now run on MySQL. PostgreSQL is nominally supported, but most contrib modules are never tested or vetted against PostgreSQL. Many don't even include PostgreSQL install queries, because the developers are MySQL people and have no concept of PostgreSQL's syntax differences. As Edison Wong (who has been working very hard to get Oracle support into Drupal in recent months) points out, we have special BLOB handling wrappers for PostgreSQL because PostgreSQL requires BLOB data types to be escaped differently than strings. How many modules actually do that, however? I would wager most contrib module authors, by number, don't even know that PostgreSQL requires different handling for certain data types. They aren't scared by it because they don't even know about it.

That was one of the many goals behind the Schema API in Drupal 6, as I understand it. There is now a non-SQL abstraction for schema management (table creation, modification, and deletion) that has database-specific code to translate that non-SQL code (nested arrays, of course) into database-specific SQL. That buys developers both database portability and richer data introspection. That means, and this is very important:

MySQL-centric developers write PostgreSQL-compatible Data Definition code by default without having to know about PostgreSQL's idiosyncracies.

That's the "right way" to handle cross-database compatibility. Abstract it just enough that the developer writing the code doesn't have to know the ins and outs of a half-dozen different database systems. We have an army of MySQL-familiar developers. Getting them to learn PostgreSQL and Oracle and DB2 as well is a losing battle. Getting them to use APIs that are familiar to them but support those databases automatically is a better strategy.

For Oracle, the main idiosyncrasy is LOB handling. The concept of LOBs does not exist in the mind of a MySQL-based developer. From their point of view, the requirement to specially escape LOBs or issue multiple statements to handle them correctly is a bug, not a feature. The lowest common denominator here is Oracle (and DB2), not MySQL. Dragging Drupal down to the lowest common denominator will only raise the barrier to new developers (most of whom will have a MySQL background if they have an SQL background at all), alienate existing developers, and hurt the project in the long run.

(I freely admit that there may be very good technical reasons why Oracle and other "Enterprisey" databases require the user to jump through extra syntactic hoops for LOB data. I really don't know. But to a MySQL developer, which is virtually our entire labor pool right now, it's pointless busy work.)

So how do we get MySQL developers to write Oracle or DB2 code without having to deal with Oracle or DB2? Just as with PostgreSQL, we give them an API that does it for them and provides added functional benefit. To wit, db_insert($table, $fields). At the module developer level, all they do is provide the system with a table and fields to insert. At an abstract level, that's all an INSERT statement is. SQL is just a serialization of it, and in the case of INSERT a very bad one if you're doing prepared statements because you have to repeat every field three times: once in the field list, once as a placeholder, and once as a value. If you have long queries with lots of fields, or god forbid a dynamic set of fields to insert, the code gets horribly ugly. Every Drupal developer can handle a single associative array, however. For MySQL, that array gets translated into an actual prepared statement query like this:

<?php

function insert($table, $fields) {
 
$insert_fields = array_keys($fields);
 
$insert_values = array_values($fields);

 
$placeholders = array_fill(0, count($insert_values), '?');

  return
$connection->query('INSERT INTO {' . $table . '} (' . implode(',', $insert_fields) . ') VALUES (' . implode(',', $placeholders) . ')', $insert_values);
}
?>

For Oracle, it would look something more like this (untested, since I don't actually use Oracle, but this is the general idea):

<?php
function insert($table, $fields, $delay = FALSE) {
 
$insert_fields = array_keys($fields);
 
$insert_values = array_values($fields);

 
$schema = schema_get_schema($table);

 
$placeholders = array_fill(0, count($insert_values), '?');

 
$sql = 'INSERT INTO {' . $table . '} (' . implode(',', $insert_fields) . ') VALUES (' . implode(',', $placeholders) . ')';
 
$statement = $connection->prepare($sql);

 
$i = 1;
  foreach (
$fields as $field => $value) {
   
$type = in_array($schema['fields'][$field]['type'], array('text', 'blog')) ? PDO::PARAM_LOB : NULL;
   
$statement->bindParam($i++, $value, $type);
  }

 
$statement->beginTransaction();
 
$statement->execute();
 
$statement->commit();

  return
$statement;
}
?>

Now, all of those module developers who cut their teeth on MySQL and think a LOB is a slow throw of a baseball get a nice, structured way to execute an insert statement, that handles type-safe escaping and prepared statements for them without any effort on their part, and we can translate it for MySQL, Oracle, or any other database we want. The developer doesn't even have to provide field type data to indicate which fields should be LOBs, because the Schema API provides that information for us. More powerful API, multi-database support for free, single interface. Everybody wins!

Conclusion

As always, this post ended up far longer than I intended it to. The story of my life. :-) However, I hope it will put to rest the question of "favoring one database". We should favor developers, and make their life as easy as possible. The hard stuff, which includes BLOB an LOB handling, is not the developers' job. It's the abstraction layer's job. That's what it means to "abstract" the problem in the first place.

Happy coding!

Comments

The list looks malformed to

The list looks malformed to me, so I don't know what #2 you're talking about. Looking at the HTML, it's a mixture of empty <li> and <P>...

1. Support a lowest-common-denominator only.
2.
3. Invent our own SQL-ish syntax that each database backend can mutate as it needs.

Provide a non-SQL layer for developers that can be translated into database-specific SQL on the fly (level 4).
4.

Bah!

That's what I get for writing HTML after midnight. *sigh* Code is fixed now. Thanks.

Meaningless for any abstraction if only working with MySQL

I would like to comment about this point by point:

  1. Supporting the lowest common denominator with as rich of functionality as it can is always the main concern of build a GOOD database abstract layer. Sorry that I need to use ADOdb as reference once again: it support for totally 19 different databases, without losing much functionalities, and widely used among many open source (and close source) projects. It is "proved as successful by reality". On the other hand, as we are now supporting MySQL and PgSQL (nominally) only, don't you think we should reference to a GOOD example, and even be closer to their progress before brain storming for some impracticable idea?
  2. How can this db_insert() serve for different requirement? As we all know that that is many possible syntax provide by SQL for INSERT statement, abstract it with a "over simplified" API will greatly reduce the flexibility of programming
  3. Our API is targeting for developers, but not end users. It is fine if a module developer don't understand the variation between MySQL/PgSQL/Oracle/DB2/etc, as this is the duty of database abstract layer developer's (as you, and as me). But that is no sense to say that: our module developers are able to misunderstand about our abstraction layer. We ABSTRACT it as we NEED IT based on compatibility concern, so developers SHOULD FOLLOW IT with NO QUESTION. E.g. as we provide db_query_range() for our module developer, that is their fault if they don't use it, but not the fault of our abstraction
  4. LOBs is not a bug, but a historical problem. Most enterprise-scale relational database start their business for more than 25 years, a day that don't even have PC with 486DX. During that period, both storage and computing power of main frame computer are far lesser than that of today's mobile phone. A separated LOBs handling with limited feature is a must for such cases. Please don't compare that with our young competitor: MySQL is just started from 1995, the days with GB size harddisk storage. It is young enough for it to dueling with GB size string storage: TEXT and LONGTEXT. Maybe some days we will have TB-scale storage with new database system coming with TB scale supporting, and you may also say MYSQL as buggy ;p
  5. Please isolate our database abstract layer from schema API, even though they are now put together. I don't seems current schema API as ultimated model: XML and XML Schema Definition (XSD) may even be a better options, as it is an industrial standard when compare with PHP array/object modeling. It is no point to highly integrate db_insert() with schema API, and check its status every times during data insertion. On the other hand, what will happened if schema is defined on-the-fly, as like as using CCK? The "right thing" that we should due with, is collect enough information from programmer, and provide enough flexibility for their development; but not trying to abstract something that should be the duty of programmer, on the other hand limit their creativity, and highly integrated with something it need not to integrate with :)
  6. I think you don't need to guess about how PDO + Oracle dueling with LOBs problem: they provide a very good example:
    <?php
    $db
    = new PDO('oci:', 'scott', 'tiger');
    $stmt = $db->prepare("insert into images (id, contenttype, imagedata) " .
     
    "VALUES (?, ?, EMPTY_BLOB()) RETURNING imagedata INTO ?");
    $id = get_new_id(); // some function to allocate a new ID

    // assume that we are running as part of a file upload form
    // You can find more information in the PHP documentation

    $fp = fopen($_FILES['file']['tmp_name'], 'rb');

    $stmt->bindParam(1, $id);
    $stmt->bindParam(2, $_FILES['file']['type']);
    $stmt->bindParam(3, $fp, PDO::PARAM_LOB);

    $stmt->beginTransaction();
    $stmt->execute();
    $stmt->commit();
    ?>

    Please don't miss out the needs of "EMPTY_BLOB()", "INSERT INTO ... RETURNING ... INTO ..." style, and "PDO::PARAM_LOB" :)
  7. Handling LOBs value correctly IS the duty of a developer, or better says as "the duty of a professional and skillful developer". A developer should always takes the duty of database schema design and its implementation. They should understand the reality, choose a suitable data type, and working with it correctly. This shouldn't be the duty of abstract layer developer. Don't take away their duty ;p

Don't just focusing and working with MySQL only, as we are talking about database abstraction. We will need to understand the differences between different database, their PROS and CONS, before implementing a suitable and powerful ABSTRACTION. It is no point to call myself as "database abstract layer developer" if I just know about MySQL, or PgSQL, or Oracle, or DB2, or any ONE AND THE ONLY ONE database system. It is always welcome to join Oracle driver (http://drupal.org/node/165788), DB2 driver (http://drupal.org/node/165788) and MSSQL driver (http://drupal.org/node/74308) development, as it is a good chance for enrich the understanding between different database. Don't waste your time, come and join the development on today :)

Take a look at Gallery 2

I'm pretty happy with the storage abstraction layer of Gallery 2. There are sure things that are less than ideal, but it's pretty good.

I'm mentioning it because it seems to address most of your needs:
- Using adodb to abstract PHP's DB access API
- API to abstract some DB features that differ often and aren't covered by adodb
- ORM with load/save methods
- API to add/get/update/delete table records
- API for free-form SQL statements
- Parameter-binding everywhere
- XML/annotation based definition of the schema / table structure / upgrade (ALTER) code
- Transparent data-normalization when fetching data from / putting data into the DB, including BLOBs
- Support for transactional and non-transactional sessions
- Forcing proper collation (unicode)
- Provides a locking backend

Currently we support MySQL, PostgreSQL, Oracle, DB2, and MSSQL. SQLite has just been added.

Not sure what we'll do when dropping PHP 4 support next year. I guess we'll use PDO for all databases.
And I guess that will also mean that we'll replace adodb with a lighter solution, merging some features from adodb (sequence abstraction, metacolums, metatables) into our DB specific storage classes.

What's missing are optional params to tell the API whether the queries need to go to a master or can go to a slave DB instance.
On the other hand, you can tell it what operations need to be executed within the transaction and which must be outside of the transaction.

Unfortunately, there's not much documentation.
There are the API docs:
http://gallery.menalto.com/apidoc/GalleryCore/Classes/GalleryStorage.html
And some tutorials:
- http://codex.gallery2.org/Gallery2:Alter_Table_Tutorial
- http://codex.gallery2.org/Category:Gallery_2:Development
- http://www.combined-minds.net/tutorials/8/Gallery_2_Module_Development:_...

Example interactions:

$item->save(); // calls Entity::save() which calls $storage->saveEntity($this)

GalleryCoreApi::addMapEntry('SomeTableName', array('someField' => false, 'otherField' => 'foo'));
// It knows that 'SomeTableName::someField is a boolean and will store the value as data-type that the underlying storage understands

// Free-form SQL
$query = 'SELECT [SomeTable::someField], [SomeTable::otherField], [OtherTable::foo]
                FROM [SomeTable], [OtherTable]
                WHERE [SomeTable::id] = [OtherTable::id]
                    AND  [SomeTable::someField] = ?';
$searchResults = $storage->search($query, array($dataToBind), $someQueryOptions);
// Now iterate over the search results

// Iterate over results from getMapEntry
$searchResults = GalleryCoreApi::getMapEntry('SomeTableName', array('someField' => false, 'otherField' => 'foo'));

How a good example :)

wow i love some of these features, would we able to implement them in D7?

  1. XML/annotation based definition of the schema / table structure / upgrade (ALTER) code: how beauty :)
  2. API for free-form SQL statements
  3. API to add/get/update/delete table records: seems something similar as Crell propose in D7; BTW, would we able to add this support by abstracting on top of free-form SQL statements? i guess we are able to do so, without totally revamp existing implementation. handy method usually abstract more, and so trade off with performance, right?
  4. Transparent data-normalization when fetching data from / putting data into the DB, including BLOBs: this is something that D6 LOBs patch (http://drupal.org/node/147947) proposing for :)
  5. Currently we support MySQL, PostgreSQL, Oracle, DB2, and MSSQL. SQLite has just been added: this is something that Drupal is now missing, how pity :(

replacing ADOdb maybe a good idea if we are not talking about backward compatibility, e.g. drop PHP4 supporting. BTW, i guess ADOdb is always a good example for us to cloning with :)

on the other hand, seems those master/slave handling should belongs to database level, but not database abstraction level? most enterprise databases (says oracle) provide a complete solution for serve this needs, and this is also the filed of most database vendors fighting for. should we take away their duty? i guess not ;p