Big Social Data and Invasive Species

Spot the invasive species

Spot the invasive species

I just read Emma Uprichard’s excellent piece on big data in social sciences, I’d recommend doing the same. She argues persuasively that Big Data is not the panacea that will solve social ills, and drills into some specific concerns that social scientists might want to think about as the hype machine grinds into gear. There were a few points I wanted to address and reinforce in there.

I’ll start by saying that I don’t think there are consistent definitions of Big Data , and I’m ok with that. Big Data is something I’ve always seem defined functionally (“there sure is a lot of data”) and not structurally; for example, data with large dimension (information about lots of characteristics of some population), large scale (lots of members of a population) or high rate (“live” or frequently sampled data) could all generate massive datasets. I don’t think any one of these is a necessary or sufficient condition, and I’ve seen arguments in the past which elide some of these features, which can be problematic because each poses different problems. I thought the comparison with qualitative data was especially illuminating. Here, you may have a small number of subjects but a very rich “data” set around them (questionnaires, recorded interviews, and so on). This represents a very high-dimensional dataset around a smallish population. You don’t have to be a reductionist, of course – it may not be the best analysis method to try to convert interview data to purely quantitative data and do a regression. But if you want to, you can find big data all over. The UK census is pretty big.

This leads onto the question of expertise. Why are physical scientists/computer scientists/engineers doing this work? Because they have the technical skills. I have no doubt that in a generation, social scientists will graduate with the technical chops to do the machine learning, databasing, visualisation and so on to do it themselves (in fact, we train some of those people, at least at Masters level). In the meantime, transplanted physicists become naturalised in their social soil. I’m not terribly keen to be identified as a positivist invasive species – surely this route into social sciences is as valid as any other? Couldn’t we instead communicate to undergrad physical/computer scientists the value of their skills in social sciences, and encourage them to take on some of the ideas of these disciplines? So many physicists and engineers end up in the most dismal of the social sciences when they go and get finance jobs following graduation – wouldn’t in be good to snag some of those? The fact that so many people with this high-consensus training choose to cross over suggests that there is an appetite amongst hard scientists to work in these areas (I mean in academia – apparently the financial sector offers renumeration to entice physical scientists so they may not be as tempted). I’m not sure this needs to be viewed with such suspicion, even if you disagree on approach and methodology.

I don’t see the “methodological genocide” occurring that Dr Uprichard fears. Big Data self-evidently doesn’t have All The Answers. No one method can. Big Data’s not even a method, really. And there are plenty of important questions big data doesn’t ask, or effect change in response to. The article seems to be suggesting that sociologists need to be ready to argue back. Is that something sociologists are good at? I hadn’t noticed.*

There are some other bits of the article that I think are as true of little data as of big data. I wasn’t sure whether this was the point, but it’s certainly one worth making. Big data opens new questions and fills in detail for some older ones, but (like all data) it doesn’t predict, models and theories do that, and this is hard in social sciences, even with whizzy agent based models and suchlike. Compressing and reducing the data does tend to regress towards the mean – but as hinted, that also allows those who aren’t in the “mainstream” to be spotted. Often it’s these behaviours which are more interesting. The ethics of how and why this is done absolutely does need to be explored – but potentially, identifiying the majority allows you to chuck out that data and see interesting outliers. There are a lot of interesting quantitative techniques out there.

Data, and models, are an imperfect representation of the world. To Tukey’s “no data set is large enough to provide complete information about how it should be analysed”, we might add “no data set is large enough to describe the world we’re examining”. Data is filtered by experimental design, theoretical question, and increasingly, by the data that’s available. Data, models and analysis always need context and interpretation, to identify patterns, results and meaningless anomalies. Ironically, this is something good (natural) scientists and engineers do a lot of, too. But as the article pointed out, physical scientists aren’t used to the atoms changing their behaviour in response to their experiment**, or needing to persuade government that the results of their experiments require a change in policy**, or thinking about whether a study is ethical in the first place**. Big data won’t obliterate people interpreting things, but it might mean some of those people have (gasp) an engineering degree, or a social sciences degree that has a lot of things that make it look like a turn of the century stats or compsci degree. I’m actually rather hopeful about areas like big data, because they will allow people like me to learn a lot from sociologists. I think in asking “How and why is big data useful, and for whom?”, we will need all the expertise we can get.

*in the sweep of low-consensus subjects, I would have thought sociology as an example par excellence.

**well, they are, sort of

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s