Just how insular is the PHP community?

Submitted by Larry on 24 August 2015 - 3:21pm

Periodically, there is a complaint that PHP conferences are just "the same old faces". That the PHP community is insular and is just a good ol' boys club, elitist, and so forth.

It's not the first community I've been part of that has had such accusations made against it, so rather than engage in such debates I figured, let's do what any good scientist would do: Look at the data!

Update 2015-08-25: The Joind.in folks have given me permission to release the source code. See link inline. I also updated the report to include a break down by continent.

Joind.in me and rule the Internet!

The first step of course is getting data to analyze. While we don't have perfect data for all PHP events around the world, we have a reasonably good proxy. Joind.in has become the de facto session review site for most of the PHP community over the past few years, and has even branched out outside of the PHP world more recently. That allows us to get a very detailed picture of the PHP event ecosystem. We wil use that as our data source.

There are still numerous caveats to that data.

Joind.in's data only goes back to late 2008, and there have been conferences since long before that. For the purpose of this analysis, then, I am looking only at events in Joind.in from 1 January 2010 onward; that gives about a year's worth of data to "prime" the list of existing speakers.
Not all events are listed with Joind.in. While most of the general-PHP world uses it, not all PHP communities use it. Nearly all Drupal events are missing, for instance. That means that some established speakers look like newbies to Joind.in, and speakers who frequently attend events not listed here are under-reported. For instance, Joind.in thinks my first presentation was at Symfony Live Paris in 2012, and knows nothing about the dozen or more presentations I've given at DrupalCons since then. (Seriously, I've given way more than 23 sessions in the last 5 years!)
Joind.in doesn't differentiate between different sizes or types of event. It includes 3000 person conferences and 20 person user group meetups. For the purpose of this analysis I defined a "conference" as any event Joind.in knew about that had 5 or more presentations listed. A different cutoff may produce slightly different data.
Joind.in allows a session to have more than one speaker. However, the overwhelming majority of sessions have only a single speaker, and trying to account for multiple speakers would have made my job a lot harder. Therefore, I am only tracking a single speaker per session. If Join.in lists multiple speakers I am only counting the first listed. I don't believe that greatly impacts the overall conclusions, but may impact certain speaker's rankings.

Also, while downloading the data and loading it into MySQL there was some character set corruption (despite everything being UTF-8 as far as I know). That means a few speaker and event names have been garbled, particuarly Asian character sets. According to the Joind.in folks, that's probably bad data in their database to begin with so there's not much I can do about it. My apologies to those affected.

The toolchain

Joind.in offers a very nice JSON API that allowed me to download essentially their entire dataset for local analysis. To download the data I used Guzzle, PHP's leading HTTP client, and de facto standard Doctrine DBAL for writing to the database. (Raw PDO would likely have worked too, but Doctrine has some nice schema management tools and I wanted an excuse to play with Doctrine more.)

The import code itself follows a mostly-functional style, with a little procedural thrown in for lack of interest in refining it any further. :-)

The full source is available on GitHub. Pull requests for more reports welcome, but please don't abuse the API. :-)

Enough talking, where's the data!

I've collected the reported data on a separate page to make it easier to read, which is attached to this post. Go ahead and have a look. I'll be here when you get back.

Conclusion

I want to call out especially the following line at the bottom of the first table:

Average total sessions: 31.6
Average number of speakers: 24.8
Average number of first-time speakers: 13.1
Average percent first-time speakers: 50.6

That is, across the entire spectrum of Joind.in's available data, events average half of their speakers as first-timers. Half.

Of course, there's plenty of variability in that, with event ranging anywhere from 0% to 100%. However, most are at least in double-digits, even up through and including 2015.

That also appears to be about the same across regions (as defined by the timezone code the event used). The one exception is Asia, which I suspect is due to being newer to Joind.in so its data is skewed.

So to the claims that PHP conferences just select the "same tired old speakers, year after year", I would say that is patently false and we have the data to prove it. Are there other issues with session/speaker selection, diversity, and so forth? Quite possibly. But claims that there's not a diversity of names, period, are provably untrue. Even if the data is skewed a bit because of the sampling process or recent non-PHP additions, there's still a huge churn.

Also, everyone loves Derick. :-) In fact, Europeans seem to dominate the top-speakers list. There is only one American in the top 10, Matthew Weier O'Phinney. So much for the loudmouth American stereotype.

More analysis

Any other data I should crunch? See a bug in my analysis? Let me know! I'm happy to crunch additional reports out of the data as long as it's not too difficult to add. I will also add Errata to this post if anything turns out to be totally wrong. :-)

Great Analysis

As a former owner & organizer of tek, one of our internal goals was ~25% first time conference speakers, so I'm glad to see in the analysis that the analysis holds (while I was involved). :)

Some other questions/comments:
* It appears that many of the events with a high % of new speakers were in Europe. What do the %'s look like when you consider US and Europe separately?
* How many unique presentations are there? How many times does a regular give the same sessions?
* Have you considered doing similar analysis with the Lanyrd PHP data - http://lanyrd.com/topics/php/ ? It has a wider base outside the pure-PHP world.
* Do the regulars have the high reviews to merit being a regular?

Answers

1) I revised the reports to show separate totals for different regions. Check the attached file again. It looks like it's actually pretty consistent by region. (Asia is high, but I suspect that's because the conferences themselves are new to Joind.in.)

2) Hard to say how many unique presentations there are. A reasonable approximation might be to check uniqueness on session titles, but some sessions change titles over time. The code is now available so feel free to send in PRs with that data if you think you can get good info out of it. I'd be curious to see it myself. :-)

3) I've not looked at Lanyard, as I don't think I've ever been to an event that used it. It would be interesting to pull in data from both services and combine it somehow, but I will leave that for others.

4) I have not computed that data. Also, that's rather subjective. :-) (Lorna also indicated it might be rather dangerous to try and compute for risk of getting into ad hominem discussions, so I'm reluctant to do so.)

EU vs US

One thing that may need to be taken into account, is that it appears that in the EU they LOVE joind.in. Every event uses it, in fact many user groups use it.

However in the US, probably 1/3rd of the conferences I speak at (maybe more in the end), don't use it for various reasons. (Sadly)

Anyway, just a point that may help to explain some of the stat bias there.

UG meets included?

Including user group meetups will skew the results beyond use imho...

What's the result for the top 10 biggest conferences like?

Filtered

As noted, I only included events with more than 5 talks. That should exclude most user group meetups. If a user group has more than 5 talks at the same meeting, I'd say it's a very small conference rather than a very large user group. :-)

Not quite

If you look at your results, you find that there are lots of meetups-in there that make one 'event' per year for example. like: "2015 Madison PHP Meetings"

Anyway, it's a small enough number to not 'really' matter, but, just thought you should know.

What if

What happens if we pair the data down to the handful of conferences people would recognize? Those are the ones most likely to have your employer send you to. How does that skew it?

Define?

There's no automated way I know of to define "major" or "well-known" conferences. Possibly we could filter by "large" conferences, based on the number of sessions. (Eg, make the cutoff 40 instead of 5, or something.) I'm trying to keep subjective judgement out of the analysis if at all possible.

Catch-22

First of all, I like the idea of filtering to some minimum ... but also I think the catch-22 that you may be running into. Is that people's 'gut feeling' about "same speakers everywhere" ... isn't a non-subjective judgement. It's based upon (IMO) a very specific set of conferences. Ask enough of us and we'd probably be able to give almost identical responses. Yeah, it would be subjective. But better matches what people are saying, and would be interesting at least.

But *shrug* Thanks so much for your work.

Look at the big table

That's why I've included the full table, not just the aggregates. :-) So for example, phpTek (in its various punctuations) has:

2010: 25%
2011: 15%
2012: 27%
2013: 24%
2014: 18%
2015: 14%

So, it varies a bit but there's at least about 15% of speakers at Tek who are new to the stage (or at least stages that use Joind.in). That's below the overall average, but still means that more than 1 in 10 speakers, if not more than 1 in 5, are "new" each year. I'd say that's pretty good, although of course others may disagree.

That other 75%? Yeah, that's all people who have spoken before. Is 75% established speakers and 25% new people a good ratio?

I'd say yes, but again, others may disagree.

Ok, Some Filters

Ok. So I had to pull over and boot up my tablet. So I type this from the side of the highway in the middle of Dallas. If I hadn't, I would have forgotten or dwelled on it way too long. I feel like you have been around enough to be able to define what I originally meant with my data filters, but in the interest of collaboration, here is what I was thinking. Don't take that as me being angsty, I have sincere interest in the value of this data and its interpretation.

1) conferences with more than 200 attendees. reason: to show that we can in fact define which conferences out of that data people have most likely heard of. A high new speaker ratio would show that overall, the community speaker circuit is in good health.

2) conferences that have ticket prices above 300 USD - to show we can in fact define which conferences are most likely to require an attendee to get help from their job to attend. 300$ may not be a lot of money to some people, but it is to the hobby or startup developer. A high new speaker ratio would show that in conferences where dollars are coming from corporations that are not sponsors, that they also have a healthy speaker circuit.

3) conferences that pay for their speakers travel and lodging. a high new speaker ratio in this category would demonstrate that conference organizers are not just importing their good old boys club for their yearly drankoff lulz fest.

Good Ideas

You have some good ideas Bob ... though let me provide some counterpoints:

1) Attendee count is a good 'big conference' idea. However it doesn't necessarily match 'awareness'. Some of the big-well-known conferences, can be smaller in attendance than people feel it is. Also, I think that part of the 'circuit' idea are often considering the various smaller community conferences as part of it.

2) Ticket price is also an awkward thing. While I like the idea as you've proposed it. I don't think it matches the feeling people are talking about. Since also ...

3) I'd argue that it's not a case (at times) of 'old boys club'. As the conferences that pay for speakers, often are conferences that are run for profit, that are conferences that need to bring in "names that sell tickets". In those cases, you may find more 'repeat' speakers. But not because they are doing a 'drank lull fest'. But just because they are making sure that their conference has the name-draw to pull in enterprise-attendees that need to convince their boss. Versus the smaller community conferences that are being run at lower budget and trying to break even. Where people will attend regardless.

2 cents.

Comment Titles Are Awkward

I think the ticket price and speaker packages are useful metrics nevertheless. Low new speaker ratios in filters 2 and 3 would/could mean that conferences are erring on the side of caution to meet sales quotas. Innovation requires new blood with new point of views. If the new speaker ratio is low in these categories then that may explain why "Enterprise" code often means "big and terrible" as these big ticket players aren't getting injected with new things.

No data

It would be interesting to look at other ways to define "conference", or filter by whether they pay for speaker travel, etc. Unfortunately I don't have that data available via Joind.in, so that would require a fair bit of additional manual research. If you want to look through the table and pick out those you consider "major" conferences by name and check their numbers, that could be interesting to see. There was no automated way for me to do that other than session count, so that's all I did.

That said, I don't think "not newbies" is what makes "Enterprise" code "big and terrible". Rather, "Enterprise" usually means "has 500,000 configuration options so that I don't need to hire a developer to customize it, because I don't want to pay developers", which inevitably results in "you need a platform developer to understand those 500,000 options". That has nothing to do with new blood at conferences, it's just large businesses being penny-wise, pound foolish. That's an entirely separate debate for another time. :-)

More Title

If the numbers support that these corporate level dev conferences are getting fresh blood, then yes Enterprise is bad purely because it is bad.

But if the numbers support that these conferences are not getting new blood, I think it is naive to discount that as a factor. If the same dogma is being preached every year, that justifies big business not having to adjust their strategy to keep up with the actual market, that justifies no need for agility. Basically, they could be paying for what they *want* to hear instead of what they *need* to hear.

Re myself on Need and Want

for what they *want* to hear and not what they *need* to hear...

... in turn forcing everyone else to hear what they want to hear, too.