PHP Group By with Arrays

By far the most common idiom when using SQL from a web application (PHP or otherwise) is simply listing records. The standard logic looks something like this (give or take real templating):

<?php
$result
= mysql_query("SELECT tid, name, size, color FROM things ORDER BY name, size");
print
"<table>\n";
print
"<tr><th>Name</th> <th>Size</th> <th>Color</th></tr>\n";
while (
$record = mysql_fetch_object($result)) {
  print
"<tr>\n";
  print
"<td><a href='viewthing.php?tid={$record->tid}'>{$record->name}</a></td>\n";
  print
"<td>{$record->size}</td>\n";
  print
"<td>{$record->color}</td>\n";
  print
"</tr>\n";
}
print
"</table>\n";
?>

That's all well and good, but in practice can be quite limiting. Why? Because you can't then group records, that is, display not one but several tables, one for each color. SQL, of course, offers a GROUP BY clause. That doesn't do what we want, however. GROUP BY is an aggregate clause, and is used for creating totals and summaries of records. We want to cluster records by a field that is not the ordering field, or a value that is calculated off of the record itself.

I've generally used two different methods for PHP-side grouping, one of them much cleaner and more flexible at the cost of a little performance.

Method 0

I'm including this method as the worst way, for completeness. This can often be the first idea that comes to mind but it's generally poor because it has a variable number of queries.

Let's use our previous example where we have a table containing things that have a name, size, and color, and of course an ID. For readablity, we'll also introduce a new function that renders a given record (as a table row or a list item or whatever) so that we can focus on the logic around it.

<?php
function render($record) {
 
$output = "<tr>\n";
 
$output .= "<td><a href='viewthing.php?tid={$record->tid}'>{$record->name}</a></td>\n";
 
$output .= "<td>{$record->size}</td>\n";
 
$output .= "<td>{$record->color}</td>\n";
 
$output .= "</tr>\n";
  return
$output;
}
?>

Now, here's the lazy method, which you really shouldn't do:

<?php
$colors
= mysql_query("SELECT DISTINCT color from things ORDER BY color");
while (
$color = mysql_fetch_object($colors)) {
 
$result = mysql_query("SELECT * FROM things WHERE color='{$color->color}'ORDER BY name, size");
  print
"<table>\n";
  print
"<caption>{$color->color}</caption>\n";
  print
"<th>Name</th> <th>Size</th> <th>Color</th>\n";
  while (
$record = mysql_fetch_object($result)) {
    print
render($record);
  }
  print
"</table>\n";
}
?>

See why this method is so bad? For n groups, you're always executing n+1 queries. That's an unnecessary performance hit. It also means that if you're summarizing a write-heavy table (say, a logging table) you could easily have your data change between queries from another simultaneous user. That makes any summary reports suspect. If you know you have only a small, fixed number of groups (say, there's only 3 colors in the system, ever) then you can sometimes get away with this method, but it's still not a good idea.

Method 1

The somewhat more performant method involves queuing records and testing edge conditions.

Now, let's build our color-based table. What we'll do is print the start of each list, then when we detect that we've changed our key value we'll print our ending and start over.

<?php
$result
= mysql_query("SELECT tid, name, size, color FROM things ORDER BY color, name, size");
$last_color = '';
$is_first = TRUE;
$output = '';
while (
$record = mysql_fetch_object($result)) {
  if (
$last_color != $record->color) {
    if (!
$is_first) {
      print
"</table>\n";
    }
   
$is_first = FALSE;
    print
"<table>\n";
    print
"<caption>{$last_color}</caption>\n";
    print
"<th>Name</th> <th>Size</th> <th>Color</th>\n";
    print
$output;
   
$output = '';
   
$last_color = $record->color;
  }
 
$output .= render($record);
}
print
"</table>\n";
print
"<table>\n";
print
"<caption>{$last_color}</caption>\n";
print
"<th>Name</th> <th>Size</th> <th>Color</th>\n";
print
$output;
print
"</table>\n";
?>

Take a moment to read that over and see what's going on. We're building up each record until we see that our key value has changed. Then we flush it out, conditionally closing the previous table unless it's the first table in which case the previous table doesn't need to be closed because there isn't one. That last sentence should have made you blink with that many conditionals. And then let us not forget the last iteration, which we need to flush out separately.

While a bit better than n+1 queries, since we're running only a single query, the PHP for it is rather hairy. The actual display code is duplicated, which means changing anything means changing it twice. If the logic is at all more complicated than just a print, then that could be quite a challenge. It's also harder to read; we have two extra control variables and two extra conditionals. Yuck.

Method 2

So what's better? Pre-process the data. PHP makes nested arrays fast and flexible, so let's use them.

<?php
$result
= mysql_query("SELECT tid, name, size, color FROM things ORDER BY color, name, size");
$set = array();
while (
$record = mysql_fetch_object($result)) {
 
$set[$record->color][] = $record;
}
foreach (
$set as $color => $records) {
  print
"<table>\n";
  print
"<caption>{$color}</caption>\n";
  print
"<th>Name</th> <th>Size</th> <th>Color</th>\n";
  foreach (
$records as $record) {
    print
render($record);
  }
  print
"</table>\n";
}
?>

That's certainly much shorter. It's also easier to follow. In part one, we cluster our records however we want. Then we simply foreach() over each one. Output code exists only once, but we stil get the same grouping effect. It's also very extensible. Suppose we now wanted to first group records by the first letter of their name, then by color. All we have to do is add an extra level of nesting.

<?php
$result
= mysql_query("SELECT tid, name, size, color FROM things ORDER BY color, name, size");
$set = array();
while (
$record = mysql_fetch_object($result)) {
 
$set[$record->name[0]][$record->color][] = $record;
}
ksort($set);
foreach (
$set as $letter => $colors) {
  print
"<h2>{$letter}</h2>\n";
  foreach (
$colors as $color => $record) {
    print
"<table>\n";
    print
"<caption>{$color}</caption>\n";
    print
"<tr><th>Name</th> <th>Size</th> <th>Color</th></tr>\n";
    foreach (
$records as $record) {
      print
render($record);
    }
    print
"</table>\n";
  }
}
?>

Note that we're also adding a ksort() line in there. Remember that we're fetching records by color, then name. That means we could get back [blue, shirt], [green, pants], which the while loop will then turn into an array with keys [s, p], which is backwards. The ksort() will properly order the top array while maintaining the key/value associations for the lower arrays.

You only need to add in the extra sorting step if the value you're grouping on is a value you want to display and you want the final lists further down to order by something else. In actual practice, it's quite rare that you really need to do that so you can simply reorder the ORDER BY clause to follow your groupings, then let your SQL engine do the work for you.

But what about the performance cost of that extra iteration? If you're dealing with thousands of records, that could get expensive. While we can't eliminate that hit completely, we can potentially eliminate a different loop instead.

PHP, by default, will actually buffers the results from many types of SQL databases, including MySQL. That is, when you run mysql_query(), what you get back is not a direct connection to the SQL database; PHP issues a query, gets back a reference to a result set, iterates through the entire result set and builds up a PHP resource that is a list of records, then closes the record set on the database server. That makes the mysql_query call a bit slower, but subsequent row fetches are faster since there's no trip to the SQL database. It also means that even if you have multiple result sets available in PHP, there's only one real result set ever open on the SQL server at a time. Many database engines won't allow an individual connection to have more than one result set open at a time, so PHP's buffering allows you as a developer to prentend that you can have as many record sets open as you want.

If we know that we're not going to have to issue another query from within any of the foreach() loops, however, then we can tell PHP to not loop internally since we're going to do it ourselves. In MySQL, that's done using mysql_unbuffered_query():

<?php
$result
= mysql_unbuffered_query("SELECT tid, name, size, color FROM things ORDER BY color, name, size");
$set = array();
while (
$record = mysql_fetch_object($result)) {
 
$set[$record->color][] = $record;
}
mysql_free_result($result);
foreach (
$set as $color => $records) {
  print
"<table>\n";
  print
"<caption>{$color}</caption>\n";
  print
"<tr><th>Name</th> <th>Size</th> <th>Color</th></tr>\n";
  foreach (
$records as $record) {
    print
render($record);
  }
  print
"</table>\n";
}
?>

Note that we are explicitly freeing the $result object when we're done with it. While PHP is nice about cleaning up unused memory when it goes out of scope, we need to be more polite to the MySQL server. By removing some of PHP's intermediary magic we can get a nice performance boost at the cost of features we're not going to be using here anyway. Score.

Happy coding!

Comments

CMS

hey larry, for a content management system, i would recommend Stellent. It's not open source, however. I use it at work and its great.

Tcl IDE

one more question, do you know of an IDE that supports Tcl? Tcl is ridiculously similar to C/C++. I need it for work.

Never used TCL

Well there's a name I've not seen in a while. :-)

I've never worked in TCL, and don't know of any IDEs for it. There could be an Eclipse plugin, possibly; I've not looked.

Also, I already work with a half-dozen CMSes at work. :-) For anything outside of work I focus on Drupal, because it's open source, well built, and extremely flexible. I'm not interested in using non-open source systems unless I have to. Watch this space for an upcoming Drupal site launch that I'm rather proud of.

haha, somehow i found ur

haha, somehow i found ur domain out there in cyberspace. I remember it from when we were roommates. I was like, Larry Garfield?? I know him! Anyway, I think ur blog is very interesting and insightful. I'll definitely keep reading/commenting. We gotta get gianfranco up on here too and have a little roomie reunion, haha.

Thanks

Thanks for sharing your knowledge. It is appreciated.

Thanks a lot for this very

Thanks a lot for this very useful code !

Great code tutorial. At first

Great code tutorial. At first the render() function made no sense. Ijust plugged in my own code there. The array grouping works well.

Help with you tutorial

Hi I like your tutorial and I am trying to apply it but got into some issues that I hope you could help me solve.
Here is my PHP syntax below which list all brands/manufacturers from a SQL table alphabetically.
What I'd like to do is group the records with a heading A for brand name starting with A, B for brand names starting with B....

// Get manufacturers info
global $db;

if (!MANUFACTURERS_ALL_EMPTY_SHOW) {
$manufacturers_query = "select distinct m.manufacturers_id, m.manufacturers_name, m.manufacturers_image, mi.manufacturers_url
from (" . TABLE_MANUFACTURERS . " m
left join " . TABLE_MANUFACTURERS_INFO . " mi on m.manufacturers_id = mi.manufacturers_id)
left join " . TABLE_PRODUCTS . " p on m.manufacturers_id = p.manufacturers_id
where p.manufacturers_id = m.manufacturers_id
and (p.products_status = 1
and p.products_quantity > 0)
order by m.manufacturers_name";
} else {
$manufacturers_query = "select m.manufacturers_id, m.manufacturers_name, m.manufacturers_image, mi.manufacturers_url
from " . TABLE_MANUFACTURERS . " m
left join " . TABLE_MANUFACTURERS_INFO . " mi on m.manufacturers_id = mi.manufacturers_id
order by manufacturers_name";
}

$manufacturers = $db->Execute($manufacturers_query);

$manu_content = '';
$manu_row=0;

while (!$manufacturers->EOF) {
if ($manu_row==MANUFACTURERS_ALL_COLUMNS) {
$manu_content .= '
';
$manu_row=0;
}
$manu_content .= '

';
$manu_row++;
$manufacturers->MoveNext();
}

I'd appreciate any help.

Thanks,

Off topic

Um, really, my blog is not the place to get code help. That's what forums and IRC are for. I am around those often. I don't do tech support from my blog.

THANKS + 1 question

Thank you so much for the very useful explanation. I was in quite a quandary for a few days.

I do have one question. What if I need to strip slashes out of the output. Nothing seems to work for me. Normally, I just do:

  $description    = stripslashes($list["description"]);

But of course, that isn't working... neither is
  $description     = stripslashes({$record->description});

So what to do? I need to get the slashys out of my output.

Any help will be so appreciated!

Lori

Don't

To be honest, I can't recall the last time I had to deal with stripslashes(). If you're storing your data in the database properly, it should be a non-issue. stripslashes() and addslashes() is a sledgehammer approach to SQL injection prevention. It's far better to use PDO and proper prepared statements, and then do more context-appropriate filtering on your content than tools designed to work around the abomination that is PHP magic quotes. :-)

thanks for tutorial

Thanks for the tutorial. I found just what I needed!

All the best.

I spent 3 hours tonight

I spent 3 hours tonight trying to figure out how to do this! Finally I stumbled across your post...thanks :)

Thanks

Thanks, that helped a lot re-organizing my array.
One loop & five lines of code instead of two nested loops and several conditions.