RFC: Drupal pluggable system handlers

Submitted by Larry on 17 June 2008 - 12:07am

Recently I've been talking up various ideas for pluggable subsystems in Drupal in IRC and the other usual haunts. Ideas have been percolating in my head, but so far I have been remiss in actually writing them down. Yesterday, however, I had an epiphany to solve the primary issue I was trying to work out, so I present a hopefully workable RFC (for real, not IETF version) for pluggable subsystems in Drupal.

I am posting this over to Planet PHP as well to invite commentary from those who aren't already embedded in the Drupal mindset. :-)

Background and definitions

The first important question is what exactly a "pluggable system" means. After all, Drupal already has a modular extensible system: Hooks. Why do we need another one?

The problem is there are many ways to make an extensible system; some are better suited to certain types of extension than others. Drupal's Hook mechanism can be described, as webchick so eloquently puts it, as "Hey, I'm about to do X. Who wants to do something with/about it?" That is, hooks are a procedural implementation of the Observer pattern, with passive registration rather than active registration.

It's not perfect (nothing is), but Hooks, properly implemented, end up being an extremely powerful-yet-cheap extension mechanism. Because Drupal uses a lot of bare data structures, we are able to use Hooks not only for traditional Observer behavior but also for Inversion of Control, allowing the core system to simply act as a router and mapper, letting modules do all of the hard work. The recent growth in "registry-style" hooks (hook_menu() and hook_theme() in Drupal 6, for instance) is a great example of that.

However, there is another important type of extensibility that Drupal does not currently handle well at all. If I may borrow webchick's eloquent style, I would describe it as "Hey, I need X done. Who is going to take care of it for me?" While Drupal does include that sort of logic in a few places, most notably the menu handler system, it is not implemented in an extensible, consistent manner. Nevertheless, it is a pattern that Drupal uses often, or rather needs to use. Consider swappable caching systems, swappable session handling, swappable password hashing (coming in Drupal 7), swappable image libraries... And there are many more that we could have if we had a good way to implement it. Right now, though, we don't. Think output renderers, user management, user display name generation, the list goes on.

I debated what to call these pluggable systems, but have for now settled on "Handlers". Earl Miles pointed out to me that what is described above is very similar to the concept of Handlers in Views 2 and Panels 2, and the implementation I describe below is, while not the same as the "ofchaos suite", in some ways inspired by it.

Handlers

At a 10,000 foot perspective, a Handler is defined as:

A self-contained piece of code that
is called explicitly by some other piece of code to
handle some particular set of related operations and
can be "swapped out" for another implementation of the same interface with no changes to the calling code and ideally
multiple implementations can exist in the same request in parallel.

That's great. We've just described the idea behind any function or subroutine. :-) If it's a multi-faceted subroutine, we've just defined an Object. The only tricky part is requirement #5, that multiple implementations can co-exist. Not all pluggable systems need such functionality, but many do.

That is the primary failure of Drupal's current de facto mechanism, conditional includes. Currently, Drupal handles multiple implementations of the cache, password, database, and session systems by specifying via a hard-coded variable in settings.php which of a number of files to include, each of which define the same set of functions with alternate implementations. While that works, it is, quite simply, sloppy. It does not permit multiple simultaneous implementations. It requires hard-coding paths (albeit relative paths) in a config file; there is no cohesion between related pieces of functionality (cache_get() and cache_set() for instance) other than function name prefixes; It is difficult or impossible to configure via an admin interface; and duplicate names break the shiny new code registry.

Another key problem with the conditional include method is code duplication. The Drupal 6 database layer is an excellent example. The ext/mysql and ext/mysqli drivers share around 50% the same exact code. Duplicating that code between them is obviously a bad idea. The alternative, which we do, is to have both drivers include yet another file, database.mysql-common.inc, which includes the overlap. That doesn't work at all if we want to have handlers added by modules rather than hard-coded into the includes directory. We need something better.

While it would be possible to layer functions (as described in my earlier article), that requires each function implementation to duplicate the same pass-through code. If you need to call a lot of routines on a given subsystem, the extra function calls can get expensive. It also doesn't solve the shared code problem. It is, overall, a poor solution.

However, the requirements described above map almost exactly to an extremely common OOP pattern: The Factory pattern. In the simplest sense, a Factory is a routine you call to return an object that matches a given interface, but the exact implementation is determined by some encapsulated logic, however simple or complex it needs to be. The canonical example here is a shipping system; the system calls a factory object with a product to be shipped and its destination. The factory looks up (from somewhere) the closest warehouse to the user's location that has the product, determines which shipping partner (FedEx, UPS, USPS, etc.) would be cheapest given those two locations, creates a shipping object for that partner, and returns it. All the caller knows is that it has an object that conforms to the Shipper interface. New shipping partners can be added by just adding a new class to the system and poof.

In PHP, the factory doesn't even have to be an object. If the factory logic is simple enough, it can be a simple function that returns an object.

Self-contained: A sidebar

I want to highlight two key words in requirement #1 above: Self-contained. A self-contained system has a number of advantages; it is easier to debug, easier to develop, easier to test, and easier to unit test. The up-side is that it pushes all interaction with the outside world to a very narrow pinhole, its defined interface, and only interacts with the rest of the world passively as it gets configured by others calling it. The downside is that it pushes all interaction with the outside world to a very narrow pinhole, its defined interface, and only interacts with the rest of the world passively as it gets configured by others calling it.

Drupal currently is very much not self-contained. That is one of the things that makes it fast, because it can take shortcuts, but also one of the things that makes unit testing it extremely hard. In the interest of "embracing testing", therefore, I propose that we go ahead and standardize on a Handler object having no contact with the outside world except for its specific domain, save through well-defined methods. That means no variable_get()s, for instance. All of those should go in the factory, so that we can completely configure a Handler implementation in isolation and therefore unit test it to death more easily.

Early access

The main challenge implementing such a system in Drupal poses is pre-database initialization. A few pluggable systems, such as the cache system, sometimes need to initialize before the database does. The database is our primary storage mechanism, where we would store information on, say, which handler to use for a given system. Ideally we'd also want to rely on the registry to lazy-load just the Handler implementations we need, which if they are classes it can do automagically but only if we have a working database. This is a problem.

However, let us consider why we need stuff to happen pre-database. Well, there is the database layer itself. That can't rely on this mechanism period, but that's a special case. For the rest, particularly the cache system, the issue is that we need to be able to use non-database caching. We want a site to be able to, say, use Memcache for page caching, and be able to serve cached anonymous pages without ever hitting the database. Why avoid the database? Well, all things being relative database access is very slow. Having to connect to a MySQL database on another server just to do a single lookup to see what file to parse to get the CacheMemcache class is rather wasteful.

Fortunately, not all databases are slow. In particular, PHP 5 includes an integrated copy of SQLite. SQLite uses the PDO interface, which is what I am pushing with all my might to move Drupal to ASAP. It is also fast on read operations, because the "connection" cost is just a file stat call. Write operations are slow, though, for the same reason. The lookup tables for handler configuration should be very static. So if we move those to an SQLite database, we can still "use the database" for our configuration without "using the database". Neato.

The new database API includes support for exactly that sort of trickery. Master/slave replication is handled through "targets"; that is, a query can be specified to try to use one particular database connection (say, a slave server) but silently fall back to a default (the master server) if the selected target isn't found. The exact same setup can be used for selected system tables, such as the registry, system table, and handler-lookups. Simply set those queries to run against a "system" target if possible, and then have an automatic way to replicate selected tables from one target to another when they change. If you don't have SQLite available, then everything runs through the main database and you don't get as much of a benefit from Memcache cache implementations. If you do, you can skip the main database more often.

To be fair, there is still a performance hit for the cost of simply loading and parsing the core database code. That is not a negligible number of nanoseconds. However, any other mechanism I have devised involves a lot of manual hacking about in settings.php. The elegance and simplicity we gain from being able to use "a database" is, I believe, worth the extra code load, especially if we can further optimize the database code to load faster (we can) and make use of class autoloading to reduce the overall code weight of Drupal in general.

The implementation

Enough talking, on with the code!

I am going to use the cache system as an example, mostly because it is a well-understood and fairly simple API. This is also in very much draft form, but I hope to have enough of the concept down that the API makes sense. Let's start off with a basic interface needed by all Handlers:

<?php
interface HandlerInterface {
  public function setProperty($var, $val);
}
?>

All handlers must implement this interface. At a basic level all it does is define a way to set environment properties on a handler object. Those are called from within the factory with the result of variable_set() et al instead of putting those inside the handler, which gives us better encapsulation.

We also need a way to define both things that can be handled and things that can handle them. For this, we use two registry hooks:

<?php
function hook_slot_info() {
  return array(
    'cache' => array(
      'targets' => array('default', 'block', 'filter', 'page'),
      'interface' => 'CacheInterface',
      'factory' => 'cache',
      'properties' => array('thingA', 'thingB'),
    ),
  );
}
function hook_handler_info() {
  return array(
    'cache' => array(
       'database' => array(
         // Translate on load, not define, like hook_menu().
         'name' => 'Database',
         'class' => 'CacheDatabase',
       ),
       'memcache' => array(
         'name' => 'Memcache',
         'class' => 'CacheMemcache',
       ),
       'mock' => array(
         'name' => 'Mock caching, does nothing',
         'class' => 'CacheMock',
       ),
    ),
  );
}
?>

There's actually a great deal going on here in these few simple lines. First, we define two concepts, a slot, which is a "thing into which a handler gets plugged", and there is a handler, which is "an object that gets used by a slot". The term "slot" is borrowed from Qt, and is probably misused here. I welcome suggestions on better names, provided it doesn't become a bike shed thread. :-) A slot has a unique ID (cache), an explicit Interface (CacheInterface) that all implementations must implement, a list of the properties that it is expected to have, a factory function that will be used for accessing the appropriate handler (I couldn't think of any for the cache system to use, so the above is just for illustration), and zero or more targets.

A target here behaves in a similar fashion to the database system. It allows for multiple simultaneous implementations. If specified, there must always be a target called "default" plus some number of additional targets. In this case, the page cache, block cache, filter cache, etc. can all be specified as separate targets. Each target can then have its own handler hooked up, so we can use, say, memcache for page caching but an SQLite database connection for filter caching. If a target doesn't have a handler defined for it, it uses the handler for the default target. If a given slot doesn't define any targets, then there is only ever one target, default.

We then define the handlers. Each handler is defined as bound to a specific slot (the first array key), and defines the class for that handler. That class must implement the interface defined in the slot_info hook, which in turn must extend HandlerInterface. We don't care where the handler class or the interface are defined in code; the registry will take care of that for us.

And of course, both hooks have a corresponding alter hook so that other modules can do whatever they need; usually that will mean adding targets (such as views caching).

<?php
function views_slot_info_alter(&$slots) {
  $slots['cache']['targets'][] = 'views';
}
?>

Both hooks are saved to the database in dedicated tables, not in the cache tables. There are two reasons for that. One, if we needed to use the cache to get to the slot/handler info then we couldn't use handlers for the cache. Two, memory. Having to load and de-serialize those arrays is not cheap, particularly on memory. The menu system is a much better model here.

We then have our cache Interface and our cache implementations, like so:

<?php
interface CacheInterface extends HandlerInterface {
  public function get($cid);
  public function set($cid, $data, $expire = CACHE_PERMANENT, $headers = NULL);
  public function clear($cid = NULL, $table = NULL, $wildcard = FALSE);
}
class CacheDatabase implements CacheInterface {
  protected $properties = array();
  public function setProperty($var, $val) { /* ... */ }
  public function get($cid) { /* ... */ }
  public function set($cid, $data, $expire = CACHE_PERMANENT, $headers = NULL) { /* ... */ }
  public function clear($cid = NULL, $table = NULL, $wildcard = FALSE) { /* ... */ }
}
?>

Again, these can live virtually anywhere, although presumably provided by a module, because the registry will be able to find them and load them when needed. However, it means fewer things in /includes and therefore more things that can be put into modules where they belong.

Finally, there is the factory function. All it does is create and return singletons for the appropriate target. In the interest of simplicity, we require all factories to have the same function signataure. We also have a "factory factory" for indirect access.

A what? A factory factory is a factory that returns factories, of course! If that doesn't make any sense, think of it as module_invoke but for handlers. In fact, we'll even name it the same way.

<?php
function handler_invoke($slot, $target = 'default') {
  $function = get_factory_for($slot);
  if (drupal_function_exists($function)) {
    return $function($target);
  }
  return NULL;
}
function cache($target = 'default') {
  static $targets = array(); 
  if (empty($targets[$target])) {
    $class = get_class_for_target($target);
    $driver = new $class();
    // If there were any properties, this would make more sense.
    $driver->setProperty('thingA', variable_get('thingA', 'stuff'));
    $driver->setProperty('thingB', variable_get('thingB', 'morestuff'));
    $targets[$target] = $driver;
  }
  return $targets[$target];
}
?>

A real implementation would include more error checking, of course, but you get the idea. The two magic functions listed above, get_factory_for() and get_class_for_target(), still have to be figured out. They could not use the cache or variable systems, only the database directly. That should be fine, however, and a reasonably expedient implementation could, I have no doubt, be written.

We could then call the cache system in one of two ways:

<?php
handler_invoke('cache', 'page')->get($cid);
cache('page')->get($cid);
?>

The latter is more self-documenting and easier to read, but the former will do a lazy-load for us. If the factory function is expected to already be loaded, go ahead and use the direct version. If not, use the indirect to be sure. Because objects are resources and therefore always behave as if they are passed by reference (in PHP 5), everything still works.

Explanation and discussion

Note that while here we're implementing a singleton for each target, that is not strictly required by the interface. If it made sense to, we could recreate the object each time. In that case we're using it as a more traditional factory, but that's fine. Hooks pull double duty quite well (hook_nodeapi vs. registry hooks), so handlers can, too.

It is also important to note that we are using only interfaces here; there are no subclasses. That allows us to use subclasses on the concrete implementations if it makes sense. Say, all of the Cache implementations will share some code from a parent CacheGeneral abstract class; or if we are defining an interface that's very SOAP-like, and want to extend the PHP-native SoapClient class and tack on the extra interface, we can.

Also note that because PHP is weakly typed, it is true that there is no reason why we must use interfaces at all. They are simply syntactic sugar. I like sugar. :-) They act as a form of syntactic self-documentation. They also ease development because you know, syntactically, at the compiler level, what you need to implement. If you don't, PHP itself will yell at you before you introduce bugs.

Because properties are assigned rather than pulled in internally, we have control over the environment of the handler. There will be only a single set of unit tests needed for each slot, and writing a new handler is as simple as implementing the interface and then banging on it until it passes all of the already-existing tests.

There is also a performance benefit to using objects here. Consider an image handling system. You could very easily be calling 20-30 operations on a given implementation. While you could make a fresh call for each:

<?php
image('default')->drawLine();
image('default')->drawCircle();
image('default')->scaleBy(2);
// ...
?>

That's a lot of extra function calls. Instead, you can simply grab the object once and save yourself a lot of function calls and redundant target definition:

<?php
$image = image('default');
$image->drawLine();
$image->drawCircle();
$image->scaleBy(2);
?>

In fact, when using a non-singleton handler that would be preferred, since each call to image() would give you a new object anyway.

Some systems may be small enough that they don't really need a full class; just a function will do. In those cases, the added overhead of a single-method class is tiny compared to the simplicity gained by not having to deal with both function and class options. I'm not even convinced that it would be noticeable at all.

Implementation

For implementation, I would recommend building the above structure and implementing it in just one subsystem, the password handling. That's a system that is only needed if the database is active anyway, and is small enough that conversion is easy as a proof of concept.

After the main system is in, we can convert other systems as they seem logical to do so, in parallel. Presumably by the time we get to things like the cache system, the new database layer will have landed and we'll have added an SQLite driver and a table replicator, so we can leverage a non-database database for those lookups transparently. And all will be right with the world.

Request for Comments

I now don my flame-retardant suit and throw the above architectural proposal out to the Drupal community for consideration. (And any PHP architectural gurus, too!)

Brilliant stuff!

I always appreciate your in depth discussions of design patterns in drupal, where do you find the time to pump out so many?!? The Larry Garfield pattern is an Overloaded Observer Factory (sorry).

But on to the topic, I had a brief discussion with chx on a train in barca, where I was talking about wanting a module_invoke_all "kill switch". This is not really a good design pattern, but I sometimes run into an issue where I want a given module to let other modules run, but then to also be able to decide that I don't want any modules to take this hook after I've got it. This of course requires doing the weight juggling dance which I hate in system, but sometimes it makes sense. How would a case like this be handled?

In my case it was NAT and nodeapi I believe. I had a complicated requirement where I wanted NAT on a vocab, but not to have it fire under some circumstances. What this meant was that I had to actually set a variable called $node->no_nat, and hack nat to not run when present. It would have been nicer to just lower NAT's weight and then instruct the module_invoke_all handler to either just skip that one, or stop altogether and return

Neato

This kind of stuff would make a lot of sense for Version Control API backends too... only that "don't use variable_get/set()" won't run well on those because there's considerable amounts of VCS specific configuration. But I probably misunderstood this anyways, and maybe you meant something along the lines of "don't use variable_get/set() for data that is used in the same way by each handler".

Anyways, great stuff, I love reading your articles on design and abstraction :)

That's what properties are for

The goal of avoiding variable_get()/variable_set() (as well as globals, callouts to other handlers, and anything else that makes the system less testable) is to avoid any interaction with the outside world that is not happening through a narrow, easily-controllable, easily-testable pinhole.

Version Control API is also a great match for this setup, I agree. If it has a lot of configuration, then all of that configuration happens in the factory. You'd just have a lot more than 2 properties, set them up in the factory with setProperty(), and then internally just reference those values instead of re-calling variable_get(). That makes it easier to unit-test new Version Control API backends, because the object is more loosely coupled.

Ok, got it

The difference is not that I've got a lot more properties, but that those properties differ between the various backends. For example, the CVS backend needs a list of CVS modules in the repository whereas the SVN backend wants to know where the trunk, branches and tags directories are.

So my issue was that you were defining properties per slot, that is, "one configuration to rule them all". That's not quite feasible for Version Control API backends, and the factory can't set up stuff that it doesn't know about as it only handles the generic parts.

If it's just for testability, I guess that issue could be resolved by splitting the configuration and "runtime" parts, where one could have a variable_*() based, admin-form configurable property object on the one hand and a hard-coded test property object on the other one. Inversion of control would then insert either of those ($svn_backend->setProperty('svn_specific_settings', $hardcoded_svn_specific_settings)).

P.S.: One of these days, we should think of a policy for C style lower-case variable names vs. camel casing, because that slowly grows more annoying as the amount of object oriented code in Drupal increases.

Interesting

P.S.: One of these days, we should think of a policy for C style lower-case variable names vs. camel casing, because that slowly grows more annoying as the amount of object oriented code in Drupal increases.

For Drupal 7 and on, there is a standard: Follow PHP's lead. PHP language standard, as implemented by the engine, is function_name(), ClassName::methodName(). Drupal should follow that. There will be camel case classes and methods used in Drupal if we make use of any of PHP's native classes such as SPL, so embrace the language and go with it.

As to your other point, hm, that is a tricky question. On the one hand, I am not suggesting we make it impossible for handlers to call variable_get(), just a convention that we don't do so, since it hurts testability and modularity. And sometimes we won't be able to fully decouple a subsystem; the CacheDatabase handler, for instance, will rather need access to the database. :-)

On the other, I'd hate to drop the idea of fully self-contained handlers so easily. You do point out a valid use case. I am not entirely sure I follow your proposal, however. The only semi-automated way I can see to make that work is for each handler definition to also define additional properties that only it needs, and then rely on the factory to populate those values out of the variable table. That could be factored out to a utility function, or worked into a method somehow. The problem is that we're then binding such properties to the variable system, which is post-database only. I don't know if we have any pre-database systems that would need custom configuration of that sort. Probably not but it's still worth noting.

Can anyone thing of a cleaner solution?

indeed

yes, i think this makes good sense. multiple implementations at once is a terrific goal. sqlite should be very helpful as you say.

@jacob - now that we have a code registry it should be easy to add weight for each hook implementation. we intend to follow the before/after model chx showed at http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/chx/weight…. patches welcome.

Larry, this is an excellent

Larry, this is an excellent write-up and sounds very interesting. I agree with pretty much everything you said here: It gives a lot more flexibility to e.g. the cache system (e.g. different handlers for different targets), and, especially, standarization - combining about 5 similar-yet-different APIs into one clear structure is definitly the Right Way (tm). The general layout you propose appears very clean to me.
I will try to find some free time to actually participate in this code-wise. I would definitely be interested.

A Thought-provoking Proposal

I like the proposal, and am enthusiastic about a few things.

First, I think the factory pattern is the right choice for this sort of subsystem. One of the things that most often drives me crazy with PHP applications is the hoops developers go through to *not* use the factory pattern. In part, I attribute this to the "un-codish" way that factories tend to look in PHP compared to, say, Java, Python, or Ruby.

Second, I am very happy to see the suggestion that interfaces be used. You mentioned traceability as one good reason for using interfaces. But another reason is the contract nature of an interface. Even in a weakly typed language, the presence of the interface goes a long way toward guaranteeing that the library does what it is supposed to.

Again, this "contract" thing underscores the very heart of your proposal. Modules implementing hooks are not "promising" to do anything. But a handler is. That is, as I understand it, one of the key differences (if not THE key difference).

I'm a little worried about the lazy load vs. non-lazy version you presented above. While having a cache() function might indeed be more self-documenting, it introduces an ambiguity into the code (is cache initialized?) while simultaneously introducing an inefficiency (requiring that the cash be initialized even if it's not needed).

Lazy loading has two advantages: (1) the developer doesn't have to ask whether the load has already happened (which means less boilerplate code), and (2) it is clearly more efficient -- especially when using a pattern like Singleton (or some of the related wrapper-style patterns).

That said, I don't see why cache() could not just be a convenience function for wrapping the handler_invoke() method.

(I might just be misunderstanding the difference between the two -- I'm having hard time imagining how exactly the two mystery functions will work -- especially get_class_for_target().)

Wow... this is the most exciting post I've seen in a long time. Since working with the Drupal 6 mail system I've been thinking about this. But I'd come nowhere near a solution like this.

Backward

I think you're getting the twin factory functions backward, based on this:

That said, I don't see why cache() could not just be a convenience function for wrapping the handler_invoke() method.

handler_invoke() is a convenience wrapper for cache(), not vice versa. The factory function for different slots (I still need a better name there...) may be very different. We can't generalize that into one function to rule them all. So the primary mechanism for accessing a given slot's registered handler is via the direct factory function.

However!

It may be the case that the factory function is not already loaded. If so, it can trivially be loaded with drupal_function_exists().

However!

You may want to not go that route, and keep drupal_function_exists() out of your business logic code. If so, you can use handler_invoke() which is a very simple wrapper for it. Use whichever makes sense in your case; both have the exact same effect.

Does that make more sense?

And yes, the mail system is another good target for this architecture. :-)

Handlers are Very Welcome

As you noted hooks are very procedural and are inherently limited. It's good to implement handlers and delegates for a variety of reasons. They are OO and support interfaces and polymorphism. It's loads more flexible as you can call them at will from different places in your code. I imagine this will make programming more easier in situations where you need delegates and events.