Your weapon of choice part 2: aggegation

I’ve previously talked about looking at Boris Bike data in a very first hand way, with few conceptual filters; and from that gone on to talk about some basic spatial processing, in the form of aggregating route numbers.

Let’s now go to the opposite extreme and take (explicit) space out of it altogether. Using the start and end times, we can create a histogram of journey times:

this is aggregated over all days, hours and minutes – so in some senses, represents a rather crude average, but let’s go with it. The points are real data –  the coloured lines are attempts to fit the data with mathematical functions – let’s not dwell on that too much for the time being.

What do we learn from this? Well, we learn that the most popular journey time is about 9 minutes – more precisely, the modal journey time is 7 1/2 minutes and the median time is just under 12 minutes (half of the journeys took more than 12 minutes, and half less than 2 minutes). Journeys shorter than that are not popular and drop off sharply. Longer journeys are also less popular, but the graph is asymmetric: journeys of 15 minutes are more likely than journeys of 5 minutes. In fact, there are small numbers of journeys even over half an hour, when the scheme starts charging (additional) money; surprisingly, there isn’t a sharp cut-off at half an hour. We might expect users to be very mindful of that 30 minute mark – perhaps some are, but the majority are not. Nevertheless, over 50% of journeys take between 7 and 18 minutes, and 90% are between 3 1/2 and 29 minutes. You can tell this by constructing a cumulative frequency curve and simply reading off the time values at 25% and 75%, and 5% and 95% respectively:

This data is also very useful if we want to build a model for these journeys. Most spatial models will tend to assume that people travel to things closest to them; in this case, that’s not quite true; if people are right next to their destination, they’re less likely to Boris Bike it (presumably, they would just walk). Technically, you can encode this as a cost function – how much (effort, or money, or time) it costs the user to travel a certain distance from their origin. The fits to the first graph are an attempt to retrospectively choose a cost function which reasonably mimics this sort of behaviour.

We’ve confined our analysis so far to time rather than space (distance). The problem is, to model space we’d have to start making assumptions: do we use straight line distance, or “distance along the road”? If the latter, how do we make sure it’s accurate, or a least reasonable?

The final bit of analysis we could do on this data is to slice it by time, weekday vs weekend, time of year etc to see whether people cycle for longer in the mornings or evenings, for example. Of course the issue again is “are people cycling faster or for shorter distances?”. So we will need to bring space back in to understand these habits.

Taking space (and time to some degree) out of the analysis has allowed us to get a large amount of data and observe trends in the dataset; slicing this data by time or place will give a more situational analysis, but will make the dataset smaller – tending to weaken the robustness of the results and analysis. There are a series of intermediate approaches between the spatial visualizations of my previous blog post and this aggregated analysis; I will talk about other approaches in forthcoming blogposts.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s