Handlers in core: Concept Needs Review

Submitted by Larry on 14 December 2008 - 7:04pm

Some time ago, I posted an RFC for pluggable "system handlers". It generated a fair bit of feedback, nearly all of it positive. That was followed up with a presentation in Szeged, which generated even more positive feedback.

So what's happened since then? Well, a fair bit. There's working code, but there are still some key gotchas to sort out. That gives us a couple of options for how to proceed, for which I would like feedback, particularly from core developers and maintainers. (Dries, webchick, this means you! :-) )

Overview

First, a brief overview of, conceptually, how Handlers have evolved since my original RFC.

Slot: A slot (I still am not 100% on this name) is a system or subsystem which we want to be easily swappable. "Cache" is a slot, since we want a pluggable cache system. Other such systems include session handling, password generation, image manipulation, email sending, and potentially file storage, path caching, HTTP requests, user name generation, and various others.
Handler: A Handler is a class that implements the interface for a given slot. That is, a Cache Handler is one particular implementation of the cache system and can be transparently swapped out for another Cache Handler. A Handler communicates with the outside world only through an Environment object.
Environment object: Jacob Petsovits noted a key problem with the original RFC that setting properties on a handler was far too limiting, as different handlers might care about different variables and such. The solution that I developed is to pass in to each handler an "environment object", that is, a front-end to accessing the rest of Drupal. All variable_get() calls become $this->env->variableGet(), for example. While that does add more layers of indirection, it does give us the optimal combination of power (handlers can still do anything) and testability. It's also an extremely common pattern in the OO world. See the Szeged presentation linked above for more details on the how and why. (Note: This is different than the "Context" system used by Panels.)
Targets: Here's where it gets complicated. :-) One of the features that I really wanted to include in Handlers is multiple routing. That is, different handlers can be responsible for the same slot depending on the conditions of the action in question. For example, we could wire up page caching to the database but the smaller menu cache to memcache. Or for file storage we could map image files to the local file store and video files to a CDN, but only if they're larger than 1 MB. That gives us a great deal of flexibility in how we configure a Drupal site, and is based on the same logic as database targets in DBTNG. Kudos to perennial source of inspiration Jeff Eaton for a late-night chat at Ogilvie Transportation Center in Chicago that made me realize how this could and should work. Each slot defines its own targets, which can be multi-dimensional.

The code in the 2.x branch of the handler module in contrib contains all of the above code, mostly working. However, I've hit a snag with the targets system. Originally in Szeged I had a system that broke all target mappings if anyone changed the mapping parameters. After an all-night brainstorming session Jeff and Kyle Cunningham at Drupal Camp Chicago (which was awesome), we devised an improved target mapping system that is inspired by the D6 menu system's materialized path logic. That's the code that can be found in the 2.x branch of the handler module now.

How about core?

Sadly while the multi-routing works, configuring the multi-routing does not. It's just too complex when you allow each slot to define arbitrary targets. Drat. At this point, however, I've decided that too many people that liked the idea are waiting on handlers to do cool stuff (myself included), and perusing the Drupal 7 core issue queue it is becoming more and more evident that we need something like this in core as soon as possible. Just a few of the issues that seem to cry out for handlers include:

Abstract SimpleTest browser into its own object: How about a curl-based implementation, a raw socket one, a simple string-based one... That way we can have a simple one in core that works everywhere and a curl-based one that simpletest can require.
Abstract session handling to an object: We want swappable session backends, right?
Option to disable IP logging: For sites that need heavy anonymity, we have to trade off the flood control. How do we balance that? Let the site admin decide by swapping out the flood engine.
Pluggable password hashing framework: Peter Wolanin is already embracing this approach, in very-simplified form, for the new pluggable password hashing in D7.
Adaptive path caching: It's really hard to figure out a single path alias caching system that will work for all sites, and even harder to proper test a new approach in the wild. Make that swappable, ship 2 with core (the one-at-a-time method and a load-them-all method) and let new systems be developed independently, then we can move them into core once they've proven themselves in the wild.
Swappable mail systems: One for production, one for development that doesn't actually send, one for silly Windows/PHP servers that do it differently, all configured via radio button.

That's not counting the obvious cases like the cache system, for which any function-based implementation breaks the registry. I'm sure I've forgotten others, too.

Options

I see several possible ways forward at this point. I would like input from all and sundry on which we should take, but especially from the core maintainers who, you know, have final say on these things. :-)

Continue in contrib. Do nothing in core for now, try to get Handlers 2 sorted out in contrib, then revisit the question again in a year for Drupal 8. Given the number of important threads above where we really need a system like this, I don't like this option.
Fast-track it. Pour brain power into solving the remaining issues with Handlers 2, then move that into core. Nice as this would be, I'm not entirely sure that throwing brains at the problem will fix it in a timely manner. Plus, many of our big brains are otherwise occupied with important matters, such as Fields in Core.
Drop multi-routing. Cut back the functionality to just have a single active handler per slot. No multi-routing, no multi-dimensional routing. The entire cache system uses a single handler, but we can then still easily swap out the cache system, or anything else using handlers. We can then revisit the muti-routing problem another time, possibly in contrib during the D8 dev cycle.
No formal handlers, just a pattern. This is what the password system is looking to do right now, with the pending RTBC patch. Rather than a separate index for handlers and explicit hooks, it just uses the variable system to store which password engine is active and loads a class. There's an interface, but no common interface for any handlers. There's also no environment object.

Personally I favor option #3. Most subsystems don't actually need multi-routing, and fewer still need multi-dimensional routing. The code needed for #3 already exists in the handlers module, and can be moved into core fairly easily (or as easy as it is to get anything new into core). It also maintains the explicit definition of slots and handlers and, most importantly, the environment object. I can't over-state how important that extra layer of indirection is toward making it easier to develop and test new systems for Drupal. The more loosely-coupled Drupal's subsystems are, the better for everyone.

While option #4 is the path of least resistance, I don't think it's the best alternative. We lose any sort of standardization, self-documentation, and much of the encapsulation (from the environment object) by going that route. It also doesn't give us any natural upgrade path to "full handlers" once all the multi-routing weirdness is sorted out. By introducing the slot/handler/environment structure now, we can add targets back into the mix later once they've matured some more, possibly even via contrib.

We are also then dependent on the variable system for our configuration, typically, which means on the database. Ideally we want the cache to be able to initialize and run without hitting the database so that we can have an entirely memcache-served site. That, however, does require that we have a different way of getting to the handler configuration, as well as at least part of the registry available without the database. (settings.php is ideal for that.) Both are solvable if we have a separate system.

Request for comments

So, there we are. How should we proceed? Who is willing to help with any of the above approaches if we go with them? Is this all just a pipe dream, or can we get some traction to allow new Drupal subsystems to be developed and improved faster, with better unit testing, and more admin configuration all in one package?

And what the heck do we call slots other than "slot"? :-)

alternatives to "slot"

just a stream of ideas here:

system
modus
pattern
scheme
policy
structure
entity
complex (noun)
economy
organism
regimen
strategy
theory

Regarding "slot" - just have

Regarding "slot" - just have a look at how many times you wrote "subsystem", which is, IMHO, the proper and self-explanatory term. Calling it differently feels like overengineering, especially since the term subsystem is already used by many Drupalers. "Component" would be an alternative, but isn't as descriptive as subsystem. Also project issue components for the Drupal project are already called "xyz system" (e.g. "theme system").

The rest is rather above my head, but I'd vote for option #5, basically a slight variation of #2:

Just do it, and defer Drupal 7 until it's done. :)

"subsystem" +1

I agree with sun: the most logical name would be "subsystem". It's the first thing that crossed my mind. It makes most sense because "subsystem" implies that it's part of a larger whole (here Drupal itself), is self-contained (the "system" part) and is replacable (the "sub" part).

Also, "slot" sucks, because in Qt (the C++ toolkit in which KDE is written), each slot can be connected to multiple "signals". There's no "signal/slot" concept here at all, so that might confuse Qt developers who also happen to be Drupal developers (as far as I know, only snufkin (Balazs Dianiska) and myself).

So, "subsystem" +1 :)

I don't have the time to think about this more deeply, but I wholeheartedly agree with your reasoning.

Hm

I've seen jpetso on Planet KDE, too, so I think there's a few others floating around. :-) I did get the name slot from Qt, but I agree it's a poor match to what they mean by slot.

I suppose I could deal with hook_subsystem(), but that only covers half of what Handlers could be used for. Something like username generation or password hashing is rather small to warrant the name subsystem, but can use the exact same mechanism. (That's the reason handlers are so potentially powerful.)

Slots

If I've understood correctly, then your 'slots' are just drivers? Just like how in database abstraction layers you have drivers for MySQL, MSSQL, Oracle etc. and just like how in some payment abstraction layers you have drivers for PayPal, 2Checkout, Google, Checkout etc.

Driver == handler

"Driver" in this case is closer to the Handler than to the slot. The slot is the dohicky for which the Handler is the driver. So we have the database system, which has a MySQL driver (handler), Postgres driver (handler), etc. The database can't use the same architecture as the generic system since the generic system needs to sit atop the database, but the conceptual structure is very very similar to the way DBTNG works.

#3

At this point I'm a fan of #3. It seems like a good place to shoot for in drupal 7. We can expand on it in drupal 8.

Going towards systems like this makes sense. A lot of drupal sites user theme hacks to achieve the same type of thing which is a hack to do through the theme system.

Another case is to look at preprocessing. Different people want to do js preprocessing different ways. Some have hacked core to add compressing in. Some have developed modules like the sf cache module.

At the very least we need a consistent pattern (see #4).

Since I'm planning on writing the js preprocessing pluggable subsystem I'll throw some time at this patch.

Handlers

Swappable Mail System: http://drupal.org/node/331180

I also think we should apply the same thing to cache_set and cache_get. Not quite a full class, but allow them to swap functionality, so things like memcache and advanced cache could swap out the functionality.

Already there, but broken

I'll add that link to the original article, thanks. :-)

We can already swap out the cache system with alternate include files that declare the same functions. That's the problem. That mechanism fundamentally breaks the registry because then the same function is defined twice. The Handlers/class approach is intended to fix that problem in a clean and standardized fashion. It would become cache()->get() and cache()->set().

Search is waiting for something like this too

I got partway towards refactoring core search to accept various handlers here: http://drupal.org/node/282192. I implemented it for a single handler at a time, though, and to keep existing functionality we would need to support multiple.

Multiple

Well, multiple targets add a fair bit of complexity to the issue. I had them in the original design, then dropped that in favor of multi-dimensional targeting because I saw very quickly that we would need that for some use case. That of course ran into a considerable amount of complexity that I've not yet fully solved. I agree that search is another very good use case, especially with the push toward external search services like Solr, Sphinx, etc.

We could potentially have a core version that supports just one and only one target dimension, a la DBTNG. I'd really want to implement it in a forward-compatible way, though, and I don't know if that would keep us from being able to do it properly and in a reasonable amount of time.

Bah. :-)

Standardize

I think we should standardize the way handles are, well, handled. Much like what's in your Handler module. We need to 1) Instead of having something like cache()->get() and cache()->set(), where cache() and mail() become factory methods which each manage how the handlers and managed, what if we took one more step back, to have a standard factory handler:

<?php
// Retrieve the cache system using the default cache handler.
$cache = system('cache');
$cache->get(); $cache->set();
// Retrieve the mail system using the default mail handler.
$mail = system('mail');
$mail->send(); // etc
// Explicitly retrieve the cache system that implements memcache.
$cache = system('cache', 'memcache');
$cache->get(); $cache->set();
?>

Along with this system() factory function, which statically maintains the instances of the handlers, we have a new hook_system(), which retrieves which handlers are available from each module.

<?php
// System module hook_system... We are defining a new system.
function system_system() {
  return array(
    'cache' => array( // A cache system
      'pdo' => 'SystemCache', // class SystemCache to use PDO
    ),
  );
}
// Memcache module hook_system
function memcache_system() {
  return array(
    'cache' => array( // Cache systems
      'memcache' => 'Memcache', // class Memcache
    ),
  );
}
// SMTP module hook_system
function smtp_system() {
  return array(
    'mail' => array( // Any mail systems.
      'smtp' => 'SmtpMailSystem', // class SmtpMailSystem
    ),
  );
}
?>

I really think we should consolidate how these handlers are managed, and how the targets are routed.

Extra Step

Doing something like:

<?php
  $cache = system('cache');
  $cache->get(); $cache->set();
?>

seems like there is an extra step. I have some code that works like this now and it's an annoyance to have to get the object when I need it to use it. Something better would be:

<?php
  system('cache')->get();
  system('cache')->set();
?>

But, even that doesn't do it for me.

It might be good to have a generic system like this and have specific use cases like cache which have their own factory function. It may even be a wrapper around something like system(). Thoughts?

Factory factory

My original RFC covered this as well, with a factory factory called handler_invoke() or as I ended up implementing it just handler(). The idea is that you can call the factory function for that subsystem, or the generic handler factory which will subcall to the factory function after doing a drupal_function_exists() to ensure that it's loaded. So the following would be equivalent:

<?php
handler('cache')->get();
cache()->get();
?>

It's NOT that cache() wraps handler(). handler() wraps cache(). See this comment in the earlier thread. :-) Within a handler, however, you'd call $this->env->handler('cache'), so that you have a fully abstracted path to that other subsystem.

So everything in Rob's post above is already taken care of. :-)

Thoughts

So in principle, I support this initiative 110%.

I spend a lot of my time professionally standing up in front of a room full of new Drupal developers and explaining to them how Drupal works. It always pains me to say things like, "This is how such and such works. Oh, EXCEPT..." The fact that all of these various subsystems each have their own unique swappable interfaces is the quintessential example of this, and standardizing these is a worthy goal.

The multiple routing targets thing is interesting (once you explained it to me 40 times ;)), but I agree with the desire to punt this until later and get the basic framework for standardizing these things into Drupal 7 in the meantime, which still buys us quite a lot.

However, in order to evaluate this further, I need to spend some time studying the code syntax, as I'm a "visual" person in that respect. A diagram of some kind wouldn't hurt either. :D But you have at least one core committer's thumbs-up for the general idea. We'll have to wait and see what Dries says though, since he *definitely* is the one who needs to take a look at this, given his more in-depth understanding of object-orientation and of Drupal's underlying subsystems.

Just some food for thought, though.

At the time I responded to this, there were 14 comments. Of these, 6 were talking about naming conventions around slots vs. subsystems vs. X, and 5 more were about other places where subsystems might be needed and general agreement on approach. Only *3* of these comments were critiquing/discussing the deeper fundamentals of the proposal. This says something to me. I think this indicates that this concept is still hard for people to grasp, and so they are focusing instead on bits that make more sense to them, since that is a place they can more easily help out. I know that's what I'm doing right now, at least. ;)

We need to consider with this and any other major architectural proposals that a huge portion of our user-base (I would estimate 80%) are at the hobbyist, self-taught, "know enough to be dangerous" level of PHP, and are *not* hardcore geeks with years of academic study in computer science. We're already asking a lot of these folks in D7 with DBTNG and the registry system. For every new thing we throw at people we need to compensate with N "killer features" to justify people taking the plunge.

So basically, let's be very careful when designing the syntax for these systems that we allow people to re-use knowledge as much as possible. $env brings a lot of advantages, but it's a barrier to people who know PHP and are used to using straight-up $_GET and $_POST that their knowledge is no longer transferable ad they need to learn a "Drupalism." Eaton suggested retaining wrapper functions for frequently-used subsystems (e.x.: cache_get() and cache_set() vs. cache()->get() vs. cache()->set()). I don't know the precise answer to these, but basically we should really stop and consider the DX impact of these changes and how we can help ease those hobbyists into "real world" development.

Further explanation

Well, the natural diagram for this would be UML, but if you're uncomfortable with OO and "real world" development then UML may not help you all that much. :-)

I share your concern for keeping the developer learning curve low. That's why for Handlers I'm proposing a fairly simple, very "shallow" (as in, no deep inheritance trees) approach. Arguably, hooks as Drupal implements them are more unconventional and weird than the factory pattern in OO. The way factory functions work is also very close to the DBTNG approach to db_query() (which internally is already using a factory, singleton, and all the same stuff as here), so it's only half a new concept to learn if people are already learning DBTNG.

I also disagree that one needs to be a "hardcore geek with years of academic study in computer science" in order to understand OO. Some uses of OO, sure. Some OO systems are a complete nightmare in that regard. :-) But in general I am more optimistic than you about the ability of the "average developer" to pick up a class, interface, and function that returns an object. Especially given the number of times we hear "Drupal isn't really modular because it has no classes", there's probably a fair number of people out there who don't know how to deal with Drupal because it's NOT OO, too.

$env is a bit stranger at first glance, I agree. However, consider the flipside. How often have you tried to repurpose some existing code that happens to rely on arg() or $_GET, and find that you can't because it's assuming that it's being called on one very specific page in one very specific circumstance. It happens to me far too often. :-) Using $env for handlers is the exact same logic as never using arg() inside a page handler but relying on more robust menu callback arguments instead. (I recall that being a major push in Drupal 5, and the Drupal 6 menu system goes a step further still.)

Too, I want to be clear that this is not a "let's make Drupal OO" proposal, even if it may sound like it from the discussion. It's a "let's offer another extension mechanism to Drupal in addition to hook, and use OO syntax for it because it's really really easy to do in OO". I predict that only a very small portion of the code in Drupal is going to be handler objects for quite some time, and most of those will be out of the way of most developers. In fact, looking at the possible core use cases listed above the only one that the average module developer is going to run into that often is the cache system and maybe sessions, depending on how that ends up being implemented. Mail possibly, again depending on how it gets implemented.

The vast majority of developers will never interact directly with pluggable path caching engines, simpletest browsers, password hashing mechanisms, IP logging strategies, and so forth. The only people that would are those that want to swap out those subsystems, which is something that in core you cannot do now to begin with (other than manually swapping out include files, which we all agree barely counts), so requiring those people to have basic OO knowledge is not particularly onerous.

Once there is a standardized mechanism in core, however, contrib authors can leverage the same mechanism themselves to make their modules more extensible. I've already spoken to several that really want the ability to offer handlers for their own modules. How far will it go? Well, like any other feature that's up to contrib authors to decide for their use case. I won't even try to predict that. :-)

I don't really see a value to function cache_get() { return cache()->get(); } aside from backward compatibility. The code isn't any more complicated, and the chained methods is something that we're going to have to cope with for DBTNG anyway so it's not really a "new" idiom. And if people really want, the two line version still makes perfectly good conceptual sense. "Get me the caching subsystem! OK, caching subsystem, set this!"

If that's a trade off we have to make in order to get handlers into core, though, I'm willing to live with that for now and see how contrib response.

This says something to me. I

This says something to me. I think this indicates that this concept is still hard for people to grasp, and so they are focusing instead on bits that make more sense to them,

On the contrary, I think there's a basic agreement in the developer community that a mechanism like this is necessary for the very reasons you mention. This way of implementing it is relatively uncontroversial and doesn't rely heavily on language-specific magic; the tricky details are the configuration stuff and how to frame the concept for developers (i.e., the naming -- naming the 'things' is as important as naming the functions IMO)

Naming, and strategies

One possibility is "Hardpoints" -- on a military aircraft, it's a fixed point on the plane's frame where something can be mounted for a mission. It's a little nerdy, but it feels like a better conceptual analogue than the more abstract 'slots' or the overused 'subsystem'.

The important part, IMO, s to go with option #3: multi-routing as part of the configuration is unecessary for a first iteration, IMO. The example of CacheRouter module in contrib demonstrates that creative developers can come up with 'wrapper' handlers to solve complex needs if it's necessary. The tricky part is getting the consistent mechanism in place to handle the core functionality.

My only real concern is the slow but steady syntax creep that we're forcing new developers to master. As single-purpose functions like "cache_set" and "db_query" and so on are replaced by bits like $this->env->foo->bar_baz() we push people towards one of two options: maintaining more 'bits' in their head when they want to do something simple, or giving up and treating each operation in Drupal as a magic incantation, the words of which are beyond their ken.

I'm not terribly worried about it in one case, but it feels like with D7 we're pushing in that direction everywhere in Drupal. The functionality we're getting is definitely good, but accessibility for hobbyists is critical: it's helped fuel Drupal's growth and it's where the long tail of its community will always lie.

My Thoughts

I really like the concept of mutli-routing, and I think that there are some definite real world use cases for it. That being said, I think for now the multi-routing should be dropped. While possible, it adds a lot of complication implementation wise. It's more important to get these concepts into core than to have every last feature at the ready.

I strongly favor this last

I strongly favor this last solution:

4. No formal handlers, just a pattern. This is what the password system is looking to do right now, with the pending RTBC patch. Rather than a separate index for handlers and explicit hooks, it just uses the variable system to store which password engine is active and loads a class. There's an interface, but no common interface for any handlers. There's also no environment object.

There is really very few actual code to share between handlers implementations. The actual routing logic is far different between say the mail sending backend and the cache storage. I predict that your one-size-fits-all, over complicated, "multi-routing" approach will prove to be good intentions but useless code.

Let's promote the pattern. We don't need the code.

Environmental object = monad

For functional programming enthusiasts, the "environmental object" that Larry describes above is basically a monad. As he emphasizes, they are extremely testable. I support liberal use of them anywhere they're practical.