Drupal 7 database plans

I've been talkin' up some evil plans I have fer Drupal's database system in Drupal 7 lately, without goin' into a great deal o' detail, Ya lily livered swabbie! Fire the cannons! For th' most part, I've been tryin' t' avoid distractin' scallywags, meself included, from th' considerable work still remainin' on Drupal 6. However, thar has been recent work an' discussion o' makin' post-freeze changes t' Drupal 6's database system, an' even backportin' them t' Drupal 5. Those changes revolve mostly aroun' database replication, which Drupal currently doesn't support in any meaningful way. That becomes important, though, on very heavy sites like, say, I dern't know, Drupal.org, ye scurvey dog. :-)

Those changes, though, impact th' API changes we can make in Drupal 7, All Hands Hoay, me Jolly Roger (OK technically they dern't, since we change APIs all th' time, but if we can set thin's up so as t' minimize API changes o'er time that does make life easier fer everyone.) For that reason, at Dries suggestion, I'm goin' t' try an' lay out now a skeleton o' what I'm plannin' fer Drupal 7's database system an' how we can start addin' replication support t' Drupal 6 in a way that flows into it.

Before I begin, I just want t' clarify that I am in no way tryin' t' undermine th' database-related work that others have done latey, particularly Barry Jaspan with th' schema system an' David Strauss with numeri'us SQL cleanup patches that have kept drupal.org from dyin' completely in recent weeks. Both have done great work an' me hat's off t' them. Aarrr! My goal here is t' avoid duplicate or conflictin' work, not t' disparage th' great work that has already been done.

That said, here's th' short version o' where I intent t' try an' push Drupal 7's database API.

Step one, replace th' PHP 3-era PHP APIs fer databases (mysql, pgsql) with PHP 5's PDO, to be sure. It should be no surprise to anyone that I'm plannin' t' port th' backend t' PDO, fer reasons already stated, All Hands Hoay, by Davy Jones' locker! :-) That will allow us a number o' new options, which in turn has a knock-on effect t' other thin's we can do. The main one is type-agnostic prepared statements. Drupal already emulates prepared statements in user-space an' has fer years, but pushin' that down into C code simplifies code an' lets us focus on more important thin's, feed the fishes PDO's prepared statements have two key features we care about at th' moment:

  • They're type-agnoistic. That is, ye dern't need t' specify in advance if a field is an int, a strin', a float, or whatever. PDO itself knows how t' quote/cast thin's properly. That means at th' PHP level ye can simplify yer code without a loss in security.
  • They're associative-array-based. And let's face it, if ye dern't like associative arrays ye're dealin' with th' wrong CMS. :-)

There's plenty o' other thin's t' like about PDO, but those be th' features that be relevant t' us right now.

There's one other annoyin' factor. Right now, all Drupal queries go through db_query*(). Ye'll be sleepin' with the fishes! db_query*() takes a variable number o' parameters. Variable parameter count sucks, because it makes default parameter values difficult or impossible. db_query() also takes an array o' parameters, but that's not used very often.

Now, let's consider database replication. Without goin' into detail about how database replication works (I'll be honest, I am not an expert in th' subject), in order t' be able t' write queries against a master server or a slave server ye need t' know somethin' about th' query. Oho! You need t' know if 'tis a mutator query (INSERT, UPDATE, DELETE, CREATE, etc.) -- an' if so if a delay in writin' th' data t' disk is acceptable -- or if 'tis a read query (SELECT), an' if so whether data that's potentially a few seconds stale is acceptable, with a chest full of booty. Right now, db_query*() gives us none o' that information.

So let's give ourselves that information, an' simplify SQL's sometimes nasty syntax at th' same time.

The API

First, we modify th' $db_url variable structure. Specifically, we extend th' array depth by one level so that any given database entry can include master an' slave variants that list th' connection strin's fer th' master an' slave servers (in PDO's connection strin' syntax, so we dern't need t' parse an' repack it), pass the grog! Both o' those can themselves then be an array, in which case one is selected at random. That gives us support fer multiple slave servers as well as a way t' define multiple master servers. Naturally we still have a single server degenerate case as now. Prepare to be boarded! Prepare to be boarded! The vari'us forms fer $db_url then include:

<?php
$db_url
= 'mysql://u:p@host/dbname';

$db_url['default'] = 'mysql://u:p@host/dbname';

$db_url['default']['master'] = 'mysql://u:p@host/dbname';
$db_url['default']['slave'] = 'mysql://u:p@host/dbname';

$db_url['default']['master'] = 'mysql://u:p@host/dbname';
$db_url['default']['slave'][] = 'mysql://u:p@host/dbname';
$db_url['default']['slave'][] = 'mysql://u:p@host/dbname';
$db_url['default']['slave'][] = 'mysql://u:p@host/dbname';
?>

Now that we have multiple servers defined, how do we tell th' system which t' use? We could use a dedicated slave function, or specify a flag in th' db_query*() call itself, by Blackbeard's sword. It would be cleaner t' do it in th' db_query*() call itself, but thar's that pesky variable parameter count problem. But, since PDO wants an array fer prepared statements anyway... get rid o' it, Ya swabbie! db_query() takes only an array o' values, not an array or variable parameter count. We then have a function signature that looks like this:

<?php
db_query
("SELECT ... ", $values=array(), $slave_safe=FALSE) {...}
?>

Arguably we could assume that a query is slave-safe unless specified otherwise, but I dern't know which case is more common in Drupal right now. Fetch me spyglass! We can sort that out later. We could also, potentially, merge db_query_range() in here as well by addin' two more optional parameters, but that's a discussion fer later as well.

So now we can easily flag those SELECT queries that should get run against th' slave server du jour, if available. Otherwise, th' query gets run against th' master server, o' which we could potentially have multiple. (Master-master replication configuration is way out o' scope fer this blog.) Next, we restrict db_query() t' just SELECT statements, or possibly SELECT an' schema statements. Perhaps it should even be renamed db_select().

Because th' return value is a PDO result object rather than a primitive mysql result resource, an' because th' API is th' same fer all databases, we dern't need db_fetch_array() an' db_fetch_object() anymore, All Hands Hoay! And swab the deck! The result object has its own methods that work just fine, an' even implements SPL object iterators so ye can foreach() o'er it, ya bilge rat, Hornswaggle All that fer free. Load the cannons, ya bilge rat! I like free.

What o' INSERT, UPDATE, an' DELETE queries then, me Jolly Roger Those get their own functions, too, with a chest full of booty. Rather than han'-writin' those out, though, we can take advantage o' th' fact that those query types have very simple, regular syntax. They're really just associative arrays broken out in different ways, so we can setup a function t' just pass an associative array o' values in an' let th' code do th' rest. For a look at how they might work, see th' db_helpers module in th' helpers module package. And swab the deck, we'll keel-haul ye! For a lengthy explanation o' why I think they're a really good thin', see th' original thread an' this blog comment. PDO's type-insensitive prepared statements, though, solve most o' th' security concerns they have. The signatures, then, looks like this:

<?php
db_insert
($table_name, $fields, $delay=FALSE) {...}
db_update($table_name, $fields, $where, $delay=FALSE) {...}
db_delete($table_name, $where, $delay=FALSE) {...}
db_replace($table_name, $fields, $where, $delay=FALSE) {...}

// So for example:
db_insert('node', array(
 
'nid'=>1,
 
'vid'=>1,
 
'title'=> 'my first node',
 
// ...
));
?>

Most o' th' reasonin' fer such syntax is in th' links above, but I will note th' $delay parameter which can toggle INSERT DELAYED an' UPDATE LOW_PRIORITY under MySQL or th' equivalents in other DBMSes. In MySQL those be only useful on MyISAM tables, but given th' number o' sites that run MyISAM 'tis a nice addition.

So we now have all o' our query types split out semantically, in a PDO-friendly array syntax, with clear ways t' determine th' master/slave database that should be used, with type-safe prepared statements, an' with meta-data useful fer some database systems, in a database-agnostic way. Score. :-)

First steps

The short-term baby step in that direction, then, is slave server support. Fetch me spyglass, I'll warrant ye! For that, we do three thin's.

  1. Enhance $db_url's syntax t' handle th' above configuration fer declarin' slave servers.
  2. Modify db_query(), db_query_range(), an' pager_query() t' take an extra parameter, $slave_safe, but only in th' array version.
  3. Convert core t' use array-based db_query*() an' $slave_safe=TRUE where appropriate. The non-array version still exists fer now, so code that doesn't take specific advantage o' $slave_safe has no change an' we minimize th' post-freeze API changes.

That is a bit different from th' currently pendin' patch's implementation, but I believe it is a better long-term structure.

Concept needs feedback

That status doesn't exist on Drupal.org yet, so I'll fake it. :-) There's still some internal bounty I ha'nae nailed down yet an' feedback on th' API is certainly appreciated, even though major work on it won't happen until at least a Drupal 6 RC.

One o' these days I will also learn t' not say somethin' will be a short explanation, because I should know by now that I no nay ne'er, e'er give one.

In any case, let's make Drupal.org faster.

Comments

I think it is also important

I think it is also important t' support multiple distribution database,fer example,I want th' session table installed in another database server,t' let more site t' share it .Also,I want some drupal site config info such as locale language ,menu,type info installed in a database,an' node data installed in another database.I think these also can improve performance fer drupal.

sorry t' me poor English.

I agree with Junphine

I'm not much into database architecture an' me Drupal knowledge is currently limited t' theme development but if what Junphine were bein' possible, that would be awesome.

And Junphine, no need t' apologise fer yer English, 'tis just fine

Table level partitioning

Aye, figurin' out how t' leave room fer table-level partitionin' support might be a good bounty consideration, by Davy Jones' locker.

I believe this is called "horizontal partitionin'" (vs "vertical partitionin'") -- here tables (or rows) can be distributed among multiple databases, Dance the Hempen Jig Thus, each database contains a subset o' th' data rather than a copy o' th' data.

This is useful in cases where ye have 50,000 concurrent users hammerin' yer database. As it stands, Drupal 7's recommended solution would be t' scale usin' database replication. However, in extreme scenarios database replication might not be desired. Support fer database partitionin' has two important advantages compared t' replication based solutions:

  1. Your data is always up-t'-date. And swab the deck! Nay master-slave synchronization woes, no need t' deal with lag, etc. In environments where ye can't accept delays (i.e. a big e-commerce website that does a lot o' financial transactions?), this can be very important.
  2. You dern't have t' replicate yer data 10 times. When ye have a couple hundred gigabytes worth o' data, havin' t' replicate that on 8 slaves an' 2 masters isn't too bad. However, if ye're dealin' with petabyte storage .., Get out of me rum! ye might not want t' replicate yer database more than strictly necessary? One copy fer backup purposes will do.

But not just that, fer th' longest time, scallywags have wanted easier partitionin' support fer completely different reasons -- just re-read th' comments above ...

People want t' setup a shared database (not necessarily a dedicated server) t' store user names an' user profiles, Avast me hearties, with a chest full of booty! There is a lot o' data that ye might want t' share among different sites. Not because ye have t' (scalability) but because ye can (convenience). And swab the deck! Fire the cannons! Random example: I want association.drupal.org, groups.drupal.org an' drupal.org t' share th' same user profiles. Or: I want t' share me taxonomy terms among two sites. Or: imagine I'm a big media corporation with 10 Drupal sites -- I might want t' create a "node server" so that I can push nodes out through different channels (i.e. websites).

Some o' this might affect th' proposed bounty. It's just not clear how exactly, on a dead man's chest, we'll keel-haul ye! ;-)

Auto-increment?

I think betwixt th' Drupal 6 move t' auto-increment IDs an' existin' support fer table prefixin' (which under MySQL can also be used fer cross-database access on th' same server) we've already got th' ability t' share users betwixt sites on th' same server without insane hackery. (I just wrote said insane hackery this month, so 'tis fresh in me mind. And swab the deck! :-) ). There's also now OpenID.

For splittin' tables betwixt separate servers or splittin' a single table across servers, I'm afraid me knowledge thar right now is next t' none, Ya lily livered swabbie, and a bucket o' chum! Wouldn't that break joins? If anyone who has worked with such a setup has input on how t' make this API more extensible in that direction, please speak up. :-)

Table prefixing?

I'm not an expert in DB architectures an' I started developin' fer Drupal only a month ago, but I can do two considerations about splittin' tables across DBs (I were bein' developin' th' concept fer a CMS I were bein' projectin').

  1. As long as th' two tables we want t' query (let's say, nodes an' users.. a common case) be on th' same server, th' JOINs won't break. I weren't sure about this, so I tried it, an' it works. You'll need only t' specify th' DB name in th' query (fer example, instead o' nodes, usin' db1.nodes an' instead o' users, usin' db2.users) t' make it work. I dern't know how it can affect performance, though.
  2. The easiest way I see t' achieve this would be usin' th' already existin' table prefixin' facility. In th' config file thar might be an index o' th' tables that aren't stored on th' main DB (kind o' $db_tables = array('table_name' => 'db_id') or $db_tables = array('table_name' => 'db_id.table_prefix_')). Ahoy, on a dead man's chest!

My 2 shillin's, hope they're useful. :)

Sounds good

Overall, I'm in favor o' th' proposed bounty. Goin' with PDO seems like a natural evolution -- 'tis th' logical next step fer Drupal. Fire the cannons! Yaaarrrrr! (Still, be thar any statements from th' PHP core team with regard t' PDO? Do they see th' /auld/ database backends disappear in PHP6 or PHP7? Yaaarrrrr! Such statements would make this decision an absolute no-brainer.)

Last time I checked, PDO were bein' built-in into PHP 5.1, but not into PHP 5.0. Are ye suggestin' that we get rid o' Drupal's current mysql, mysqli an' pgsql backends in favor o' a generic PDO one, or will we add a PDO driver just like we have database backends now?

Of course, we'd build our own layer on top o' PDO -- if only t' support load balancin' through database replication -- but it still begs th' question: how much o' th' existin' database infrastructure do ye intend t' whack?

For example, I'd like t' see us take one step further an' properly support BLOB/CLOB handlin'.

PDO is primary

The mysql extension hasn't been part o' th' default compile o' PHP since 5.0, I believe, me Jolly Roger Virtually every shared host adds it back in, an' any worthwhile Linux distribution makes it trivial t' install, but 'tis not part o' th' standard package. PDO has been part o' th' default compile since 5.1. mysqli an' pgsql have no nay ne'er been part o' th' default compile, AFAIK.

There apparently is still some degree o' work happenin' on th' auld drivers, as th' mysqlnd library were bein' just committed t' th' PHP 6 tree this week as an alternative t' libmysql, but not a great deal. I'm not aware o' any plans fer mysql_, mysqli_, or pgsql_ t' be removed completley, but most PHP 5-only projects seem t' be usin' PDO already, I believe. It is th' current trend.

I am indeed suggestin' we drop our existin' mysql_, mysqli_, an' pgsql_ implementations in favor o' a PDO-based framework. We dern't get th' main advantage o' PDO, type-safe prepared statements, otherwise. While mysqli has prepared statements, they're not as cleanly implemented as PDO's, IMO. They also dern't seem t' support named-parameters, which is a PDO feature I quite like fer readability. Keepin' mysql_ an' pgsql_ aroun' would effectively defeat th' purpose.

You can extend th' PDO an' PDOStatement classes, so what I envision is extendin' both o' those t' Drupal versions where we can do extra stuff, then extendin' th' DrupalPDO class fer each database we support. That class would have any additional db-specific code required. The db_* functions then all become database-agnostic then wrappers aroun' th' active database. That then allows us multiple connections t' multiple database types simultaneously as well. Somethin' vaguely like (will likely word-wrap):

<?php
function db_query($query, $args, $slave_safe) {
  return
DPDO::getConnection('active', $slave_safe)->query($query, $args);
}
?>

getConnection() would be a factory method that returns th' specified database connection object (creatin' it if necessary singleton-style), an' th' connection object's query() method would contain any DB-specific code that were bein' needed. Of course, fer straight SELECT queries thar likely won't be any, so that can go in th' main DPDO class an' simply be inherited, on a dead man's chest! (Yay fer PHP 5 havin' a workin' object model!)

So fer how much o' th' existin' code I intent t' whack, well, most o' it. :-) Some will become unnecessary, some will get moved up into a connection class, some will get rewritten fer PDO-isms, Ya horn swogglin' scurvy cur! Ye'll be sleepin' with the fishes! The front-facin' API I see as bein' what's described above.

In order t' avoid a FAPI-sized patch, though, I believe we can write a compatibility layer into th' new code t' start with so that th' printf()-style prepared statements still function, albeit not optimally. Yaaarrrrr! (See me afore attempts at a PDO driver fer how.) Then we go through piecemeal an' convert th' rest o' core t' th' new APIs like we're doin' now with th' page-split stuff, then remove th' BC code before freeze, All Hands Hoay! That way we avoid a "break every patch in th' queue an' break this patch every time thar's any sort o' commit" patch, which is always a PITA. :-)

Bein' primarily a MySQL lubber me knowledge o' BLOB/CLOB is minimal. That's somethin' I'd have t' talk t' hswong about when we get closer t' implementation.

From what I can see, the

From what I can see, th' standard mysqli_* will not be removed any time soon.

mysqlnd will power mysqli_* _and_ PDO.

PDO is slower than mysqli_ an' doesn't implement all th' features mysqli_ offers. My advice would be; dern't drop support fer th' 'native' functions.

Lose the advantages

Aye, PDO is slower than native mysql_ an' mysqli_, but then it does more, too. Drupal <=6 re-implements so much stuff on top o' it t' get not-quite th' same functionality that in me tests it ended up bein' a wash, performance-wise. I'm not sure what mysqli_ features PDO doesn't have, but Drupal doesn't actually use much in th' way o' really advanced database features as they're all database-specific, an' if anythin' Drupal looks t' be gettin' more database agnostic, not more MySQL-specific (fer better or worse). Havin' t' implement a wrapper that can handle both PDO an' mysqli behind th' scenes also makes it less-featured, more complex, an' therefore slower.

There's also no mysqlnd driver fer PDO at present. There may be in th' future, but I dern't know o' anyone workin' on it.

Snapshot backups

One possible outcome or side-effect which might be interestin' might be an ability t' mark a slave as bein' used fer snapshot backups, because current backup solutions fer Drupal, although usable, be not really great an', fer instance, typically dern't address backup atomicity, th' way a snapshot would, and dinna spare the whip! It might go that way:

  1. when a database backup task has t' be started...
  2. pendin' updates be applied up t' th' snapshot point
  3. a designated slave stops receivin' updates, which be queued somewhere (where ?)
  4. backup is performed from th' slave
  5. it could still serve read-only queries durin' th' backup, assumin' these have been marked as those fer which, in yer words, data that's potentially a few seconds stale is acceptable
  6. th' updates be applied in order from th' queue
  7. it receives updates again ; normal processin' resumed

Now, this doesn't address th' potential o' discrepancy regardin' files vs database, but it would at least provide real-time consistent backups o' th' DB.

insert into select?

th' insert syntax seems a bit limited. how would ye do INSERT INTO SELECT queries?

active record type abstraction

Hi Larry,

I were bein' wonderin' what ye thought o' usin' this type o' abstraction an' buildin' out an active record type database abstraction, Ya lily livered swabbie, All Hands Hoay! If we could combine this with Views UI an' th' work bein' done on Views 2, then we might be able t' have a powerful database abstraction that has

a) a gui front end
b) powerful oo api

This would get us further into havin' a real "framework" that drupal developers could use in contrib module s an' no nay ne'er have t' think about database schemas. This would enable us t' improve performance by improvin' th' performance o' th' core an' thus improve global performance.

Thoughts?

I've done it

Gah, scallywags always ask me about this. :-) I actually have a PDO-based active record-ish implementation that requires *no code generation* sittin' on me hard sail, waitin' fer me t' polish it off an' publish it, avast. I really need t' do that sometime soon... :-) Of course, our node structure is considerably more elaborate than what a basic Active Record can handle nicely, an' a lot o' scallywags in Drupal really dislike query builders fer one reason or another. That said, that's somethin' I am hopin' t' start a discussion on at DrupalCon Barcelona next month.

Great idea but don't forget...

Hi thar,

This all sounds wonderful as scalin' this way is very important fer big sites (as already mentioned above). I have only one thin' t' add t' this since thar be plans ahead fer modifyin' core queries overall.

Don't forget about makin' Drupal use DB transactions. I can assist in doin' this. Havin' Drupal core use transactions properly is, IMHO, th' most important thin'. How many times has anyone been codin' modules an' a query in their module's hook_whatever() implementation screws up yet th' database is modified partially (in a way that doesn't make any sense).

This will move Drupal away from bein' a "toy content system" t' bein' a more viable (an' reliable) platform fer corporate use even with th' huge API changes betwixt Drupal releases.

I hope that even if these changes dern't go ahead fer D7. I'll be given th' opportunity t' build transaction support fer D7.

--
Sammy Spets
Synerger Pty Ltd
http://synerger.com

query method signature

The "trailin'" $slave_safe argument in db_query($query, $args, $slave_safe) bugs th' heck out o' me. If we're goin' through some trouble t' cleanup th' callin' arrangement t' get rid o' a variable number o' arguments, why not make sure th' problem stays solved fer th' foreseeable future by doin' it this way:

function db_query($query, $args, $options = array('slave_safe' => FALSE)) {...}

Or somethin' equivalent. Yaaarrrrr! That way, some other important, perhaps database-specific option can't come along an' bite us in th' future. :-)

It may go without sayin', but if we're goin' t' use factories an' singletons, we'd better be sure we get our classes *just* right, and a bottle of rum, I'll warrant ye! It'll be worth spendin' a little extra time on th' bounty.

In general, I'm 95-99%+ in support o' th' proposal so far.

Already there

Hi Chris. Actually, I'm 100% with ye on that. Fire the cannons! In th' current version o' th' code in me sandbox, th' $slave_safe variable has been replaced with an options array, one key o' which is what database "target" t' use (generally either "default" or "slave"), ye scurvey dog. Other supported options include what fetch mode t' use, whether t' fetch into a classic object, etc.

I definitely agree that we want t' get somethin' like this right.