Steaming Crossness

IMG_1503After my wander up the Greenway last year, it was exciting to finally see Crossness pumping station in action this weekend. Bazalgette’s sewers were/are gravity-fed, so by the time south London’s sewage reaches the Thames at distant Crossness, it’s some thirty feet underground, and needed to be pumped up to surface level before it could be discharged into the Thames. Fear not, those plucky Victorians waited until the tide was going out; in the meantime, it was stored in a giant sewage reservoir onsite.

While the lake of sewage has been replaced by a field of solar panels*, much of the original building and mechanisms remain, restored by volunteers over a number of years. And yesterday was steaming day, so we got to see one of four giant and colourful beam engines in action.

The building itself is home to some beautiful Victorian ironwork. From April next year, they will be open much more regularly, but until then, there is a list of open days on their website.

This slideshow requires JavaScript.

*I’ll leave you to insert your preferred glib comparison, or more nuanced insight about bountiful resources and centralised infrastructure, here

Data Visualisation for Public Engagement at #scicomm14

UCL sustainability research around energy (credit: Martin Zaltz Austwick and Charlotte Johnson 2014)

UCL sustainability research around energy (credit: Martin Zaltz Austwick and Charlotte Johnson 2014)

I’m excited to be chairing a session on Data Visualisation for Public Engagement at the British Science Association’s annual Science Communication conference, which is in sunny Guildford this year. It’s not until May, but when you keen scicommers, academics, science journalists, students, museums people and scicurious freelancers sign up, you’ll need to tell the nice people that you want to come to our session and not one of the equally awesome other ones, so I thought I’d get in ahead of time.

Data visualisation (aka “datavis”) is in the news constantly. The British Library are currently running an exhibition of scientific visualisation, books about visualisation and infographics sell by the truckload, and broadsheets and tabloids alike are running data journalism and visualisation blogs. What does this mean for public engagement with research, and science in particular? I’ve put together this session because I want to understand these issues. I’m a lecturer in spatial analysis and visualisation – which means I teach students (mainly from an architecture or geography background) techniques for visualising “human” data (like demographics, transport, twitter data, research funding data) and models (networks, agent-based, cellular automata, neural nets). I think datavis is already having a massive impact in social sciences, but I’m a physicist at heart, and I am really curious about how this all works in the natural sciences.

To this end, I’ve put together what I think is a really exciting panel. Damien George is the most focussed on communicating research outputs from natural science – not only that, but his efforts to map the research landscape in physics articulates what research is for both publics and practitioners. Andrew Steele has done great work visualising government science spending with his Scienceogram, and continues to find ways to communicate and challenge science policy via datavis. Artemis Skarlartidou has worked with communities in mapping potential sites for nuclear waste disposal, and has particular expertise around building trust through visualisation. Together, I want to explore what I think are key questions about datavis – what can it articulate that other ways of communicating cannot? How can it be used for meaningful engagement? Who can use these tools? What opportunities are we missing? And what are the limits of these techniques?

But of course, it won’t just be the panel doing all the talking. Each panellist will discuss datavis in general, and visualisations they’ve worked on, for about ten minutes each, leaving a generous 45 minutes for a decent discussion – technical, ethical, practical, or otherwise. Because datavis is fairly current, I’m expecting a lot of interesting views in the room – but we don’t require attendees to be experts, so even if you don’t know the right end of a visualisation from the wrong one, come along to question, debate and see what the fuss is about.

The session runs from 3.30 on Thursday May 1st – I hope to see you there.

—If you’re a newcomer, I recently wrote a post recommending some introductory books, as well as one which has my thoughts about which languages you might consider using if you want to get into the nitty-gritty programming side of things.

My favourite episode of Global Lab

#podcasting

The Magic of Podcasting

CASA’s homegrown Podcast, The Global Lab is shortly to relaunch with a new team of interviewers appearing alongside the wizened faces (/voices) of Steve Gray, Hannah Fry and Claire Ross (and me, of course). As part of this relaunch, we’re also getting our back catalogue onto Soundcloud and linking to all of those interviews over the last two and a bit years, and getting the original team to give a shout out to their favourite episode as it goes up. Because it’s March, I started to use the #marchOfGlobalLab hashtag, which quickly turned into #theImplacableMarchOfGlobalLab in my mind, thanks to its connotations of an army of podcasters and interviewees.

What have I learned from Global Lab since Steve and I started it in 2011? Well, I arguably already knew a fair bit about podcasts, at least if my share of two Sony awards is to be taken seriously*. I suppose I learned things I already knew, namely that the most important and valuable thing about any endeavour is usually people, and people in demanding jobs with other priorities will struggle with time-consuming things like podcasting. But I’ve also thought about ways to help with that, so I’ll share some of those here for other podcasters.

So, firstly, lower the barrier to entry. The original show format contained a “news section”, which was basically a chat between two hosts, followed by a short interview, then a brief outro. The news section might take half an hour to prepare, an hour to record and two hours to edit. Half a day’s work every fortnight was too much, so we ditched it. Also, a light edit is ok in the right circumstances. Our interviews used to run for 30-60 minutes, edited down to 15-20 minutes (which takes at least two hours unless you’re very fast/experienced) or unedited, which is too long for a casual audience (IMHO). Now we record 20-25 minute interviews and edit very sparingly.

That means interviews have to be well done. Although I’m not an expert interviewer, I love interviewing people. It is fascinating, it’s a real art, and I think I’ve massively improved at it since my first attempts. Making inexperienced interviewees feel at ease is important, and usually the best way to do that is to be better at interviewing – for example, knowing when to interrupt and interject, because then it will feel more like a chat and less like a monologue, and when not too, because it can be offputting. It’s important to know that the interviewer isn’t there to look like an authority on the topic. The interviewer is the voice of the audience, so if I know (or am busy showing I know) too much, I may not ask important questions at the points where the audience is getting lost. I tend to think that the audience aren’t tuning in to hear my personality, but for a show like Global Lab, we have different guests each episode, and the interviewers are the glue that bind things together, so we need to have a little personality. Hopefully not a deplorable excess.

From the perspective of bringing people in as part of the team, I’ve increasingly tried to make the tech easy. Our original workflow led to a really strong web frontend, but the process was a bit complex and not readily transferable. So we’ve reduced edit expectations and are experimenting with a Soundcloud feed. Sometimes the off-the-shelf option is the best. Also, if you have a team, use the team to train each other in the tech, technique and workflow – they will improve by teaching, and the learning process is fresh in their minds when they train someone else.

My current thinking is that getting any ongoing outreach or engagement activity rolling is in great part about finding enthusiastic people and lowering barriers to them starting and continuing, so that it becomes a small bit of their research life that they look forward to! Having a group of people who are keen really helps, as they can support one another. I hope that this normalisation of public engagement, outreach and dissemination as part of the research process will have long-term impacts. I guess we will have to check with the Global Lab team and see what they say a little bit down the line.

I’ve carefully avoided divulging my favourite Global Lab episode to date – possibly the Sounds of Science panel I participated in, but that’s not a proper Global Lab episode, just me talking about microphones and the sound of a shuttle taking off**. I honestly don’t have a favourite interviewee, and it would be a bit unfair to pick if I did. Maybe Nicholas Peroni’s social life of bats, or Jason Dittmer’s nationalist superheroes. Now if only I’d got James Kneale to talk about H P Lovecraft…

*if you haven’t heard of the Sony Awards and therefore struggle to take them, or me, remotely seriously – you are reluctantly forgiven

**it is really good

The Functional Art and other stories

Force-directed graph of whisky flavours (using d3.js)

Force-directed graph of whisky flavours (using d3.js)

I recently recapped on some of the datavis languages, and some books I’ve found useful to get started with them. I didn’t talk about the more conversational/popsci end of things, so I thought I’d mention some of those here. The previous post would be useful for people with some programming chops , or masters- and Phd- level students; the books here should be accessible to most, or useful as context for undergrads. This is nowhere near comprehensive, I’ll add more as additional blogposts as I go along

First, the venerable classics:
David McCandlessInformation is Beautiful is the coffee table book of the genre, selling millions of copies and spawning an awards. McCandless’ work is well-liked by many, but not universally so – and some people just don’t like Infographics much. Edward Tufte is incredibly influential, The Visual Display of Quantitative Information being his most read. (I think) I’ve said before that I don’t tend to agree with everything he has to say – his redesign of the scatter graph just isn’t going to catch on – and his “data/ink ratio” heuristic has the danger of leading to visuals so information-rich that no one knows what’s going on (although I don’t believe that’s what he intended). He is a critical thinker, though, and a talented designer in his own right, and rightly cited in any discussion on information design. Beautiful Visualisation, edited by Julie Steele and Noah Illinsky is a good showcase of visual design, but given it’s an edited collection, has less of an authorial position than Tufte.

Datavis for data journalism seems to be a growing literary genre – The Data Journalism Handbook (various authors) is a good roundup of data journalism case studies, focussed on journalism as much as data. Simon Rogers (formerly of the Guardian datablog, now twitter) released Facts are Sacred last year, which sits somewhere between McCandless and the above. It has some quite nice case studies, but the image quality is not always what it should be for a vis book. Incidentally, CASA podcast The Global Lab interviewed John Burn-Murdoch (formerly of the guardian datablog, with FT interactive at time of writing) who gives a very good overview of what data journalism is and how to get into it.

I was very impressed with Alberto Cairo‘s The Functional Art; Cairo is an experienced data journalist and visual designer, and in this 2013 book, weaves questions of journalistic practice in to some quite detailed exploration of the principles of visual design and perception. He illustrates this using personal case studies and interviews with key practitioners. To my mind, this is one of the best recent books on the subject – a disadvantage for researchers or scientists is that it tends to focus on news media and infographics more than datavis. But it marries journalism and graphic design wonderfully, even offering balanced critique of Tufte, who people often seem reluctant to criticise by virtue of his stature.

Nathan Yau‘s FlowingData website is must read for datavis, and his technical book on the subject is great. Last year’s Data Points is his foray into more popular style, and is well-designed and full of great example visuals and discussions of datavis and some general design principles. To my mind, this long arc is less compelling than his individual examples, but this is something I find with many vis books I read (/see?), so it might just be my preference. Certainly it’s well designed, but while the print quality is generally good, some of the more detailed images suffer a bit from the smaller format. But at least it’s small enough to read it on the tube. One in the eye* for Tufte and McCandless.

Although it’s not a book (yet!), wtfviz.net is worth a look – it finds a terrible visualisation, says pithy and sarcastic things about it, and moves on to the next. It is a fun antidote to serious journalists talking seriously about the importance of telling stories and serious graphic designers eulogising about hand drawn piecharts, and a good palate cleanser if you’re working your way through these.

Finally, if you’re reading this at time of publishing and you’re London-based, the British Library’s Beautiful Science exhibition on visualising science is open now, and runs until May 26th (2014). The “science” they cover is a pleasingly broad church; they have Jon Snow’s cholera map, which while being epidemiological is also considered the birth of GIS, there’s biology and climateology and all sorts. The BL is right next to Kings Cross and Euston, so if you’re visiting the old smoke, it’s definitely worth spending 20 minutes looking at exhibits drawn from their GARGANTUAN archives alongside more recent digital creations.

Happy visualating!

*see what I did there?

Datavisualisation programming: a recap

3828878e886211e389aa0e632b3228b2_8About a year ago, I wrote this post which rounded up some useful books showcasing and providing techniques for datavis. I should say that I’m primarily a programmatic visualator (i.e I tend not to deal with the GUI-style visualisation platforms, like GePhi or GIS packages, for example). Looking at the new books I’ve seen on datavis reveals a more fractured landscape of datavis, which came as a bit of a surprise to me. Two years ago, Processing was still the most powerful language for datavis, at least as far as I was concerned, and I thought it would only be a matter of time before R and Python (which I saw as its main alternatives) would create packages to generate animations and interactive visuals at the same level of elegance and sophistication. What I have seen instead is Processing going in a slightly different direction, JavaScript doing some amazing work, and R and Python not moving forward in those directions as much as I expected. They have done some interesting stuff in other directions, though.

Rather than just a simple update of the literature I covered last time, I thought it might be more interesting to compare and contrast these programming languages and talk about where I see them in the datavis landscape. Usual caveats apply: this is my limited and personal take on things, and people may be aware of libraries or techniques that I’ve missed out on with these broad brush strokes. I think you can probably do absolutely anything with any of these languages in theory – I’m more interested in what people use them for in practice. Let me know about those in the comments, or on twitter (@sociablePhysics).

Processing
Tl;dr – Processing is a language for creating images and animations, and looks lovely. It is less well suited to analysis, and not really native to the web. It’s moderately easy to learn.

Processing is a language based on Java, which uses various utility features to make Java programming easier and nicer and less of a syntactic spaghetti. I like Processing as a teaching language, because I think it is fairly approachable, but requires you to know what you are doing. It has types, does object orientation well, it uses curly brackets, and all the stuff a programming language should have. Once you’ve been programming Processing for a while, you’re programming Java. And Java is powerful and fast. If you’re into agent based models, NetLogo and Repast are Java based, so you’re doing it already. Processing has a structure that lends itself to interaction and animation; it’s built to do that. Processing.js and Processing for Android means that you can create an app that can run on desktop, mobile or web with one (slightly modified, or at least carefully created) bit of code, and that has the potential to be pretty awesome. Finally, Processing looks great.

But. Processing has done some stuff that makes making video harder, and that’s a real shame. On interactive stuff, people don’t really use java applets for the web; they use JavaScript (js). Processing is fast, but Processing.js is not especially fast, compared to libraries optimised for JavaScript. Also, Java is not the natural language for scientific computing, so if you want to get beyond vis to modelling or analysis, it can be harder work than it should be. Likewise I’ve always found the map libraries for Java to be large and unwieldy (with the exception of CASA alumnus Jon Reades’ MapThing), so GIS is not readily served here.

Reas, C. & Fry, B., 2007. Processing: A Programming Handbook for Visual Designers and Artists, The MIT Press.
Shiffman, D., 2008. Learning Processing: A Beginner’s Guide to Programming Images, Animation, and Interaction, Morgan Kaufmann.

These books provide a good intro to Processing for the beginner.

Fry, B., 2007. Visualizing Data: Exploring and Explaining Data with the Processing Environment 1st ed., O’Reilly Media.

The Fry book here is a bit dusty (it’s a few years old, now), but is the main book I’ve seen on using Processing for datavis.

Shiffman, D. & Fry, S., 2012. The nature of code

Daniel Shiffman’s a The Nature of Code covers a broad range of techniques related to complexity/biology/physics techniques in Processing. It covers a lot of the approaches I take to programming in both the ASAV and AAC courses I teach and tutor on.

Within my centre, I’m a habitual Processing user, and Camillo Vargas-Ruis and Ed Manley are equally keen on Java.

d3.js
Tl;dr – d3 is pure web datavis. It’s very web friendly, easy to cut and paste but more complex to actually understand. It does web datavis amazingly well, but not much else.

d3 is a strong contender on the visualisation front to rival Processing. Author Mike Bostock (now at the NYT) has used the mantra of Data Driven Documents (hence “d3”), creating a js library which explicitly binds data to visual (svg) objects in the browser, and lots and lots of smooth, well-optimised libraries for creating graphs, bubble charts, force graphs, pie charts, and maps. It is very diverse, fast and web-ready.

But. Maybe it’s just because I’m used to Java, but js is WEIRD. Dynamic typing, callback functions, anonymous functions – they are all kinda kooky. I think d3 is pretty easy to use for cut and paste programmers, but I find some of the things it does very odd, when you get under the hood. d3 is a visualisation language, and it’s very good at it; but I can’t imagine doing anything properly analytical. And programming d3 is still programming; I would characterise js as one of the harder languages to use.

Murray, Scott. Interactive Data Visualization for the Web: [an Introduction to Designing with D3]. Sebastopol, CA: O’Reilly, 2013.

This is a nice introduction to d3, and even gives some tips for those new to js. It doesn’t cover much more than the basics, but I found it a very good way in. I don’t have any recommendations for general js books, but there are plenty out there; codeacademy was quite useful for me here.

Within CASA, Rob Levy, Panos Mavros, Elio Marchione and Robin Edwards are d3 users.

Python
Tl;dr – Python does all sorts. It’s very easy to learn and use, and very flexible and powerful. But it’s not super web-friendly, and it’s not all that pretty.

Python is pretty much designed to be nice to use and learn. The syntax is way easier than any of these other languages for anything you might actually want to do. The new Python notebook lets you write narrative around your code in a nicely presented format, yet another reason why it’s a good language to learn if you’re new to programming. For scientific computing, there are tons of packages and a nice friendly user community, making Python very powerful if you want to get into modelling and analysis.

But. All that dynamic typing and whitespace can make you a sloppy programmer if you don’t know better; and I shudder at the thought of debugging large programmes which use meaningful white space. Python outputs aren’t all that pretty. They’re fine, and it’s ok for mapping and graphing, but I haven’t seen Python produce anything really innovative and beautiful. I’m not sure I’d know where to start if I wanted to build something interactive with Python, which doesn’t mean it’s impossible – here’s something new but I’ve not had a look yet.

McKinney, Wes. Python for Data Analysis: [agile Tools for Real-World Data]. Sebastopol, CA: O’Reilly, 2013.

This covers pandas, one of the user-friendly data manipulation packages that Python uses. The book builds on an IPython approach, which is a nice, friendly, literate programming environment. For educational users, Enthought Canopy is a good IPython environment.

At CASA, Steph Hugel uses Canopy to teach on the BASc data course, and Python use is pretty widespread – Hannah Fry and lots of others on the Enfolding project are keen. Python is pretty ubiquitous in scientific computing.

R
Tl;dr – R is a powerful statistical programming language. It’s great for maps, but not very flexible or web-friendly.

R is my least favourite programming language, but even I have to admit how powerful it is, and how much it’s improved in the last few years. It’s not wildly dissimilar from Python in the things it’s used for, but it started life as primarily a statistical language. It is really powerful at this; almost any statistical analysis you might wish to do, from K-mean clustering to Support Vector Machines, there will exist a package that does it. RStudio is a nice (MatLab-like) environment (IDE) and has notebook outputs for that “literate programming” vibe*. If you know what you’re doing, you can make really nice maps, too.

But. R is kinda funny-lookin. It’s only recently started using equals signs for assignment. I don’t really know why that is. R makes it very easy to do very complex stuff, and I suspect that’s a double edged sword. R isn’t particularly suited to interactive visualisation or animation, although as with all these languages, it is possible, and apparently there are ways of pushing R outputs into d3. I don’t think R is a particularly flexible language, but as with all of these, I imagine that people are figuring out clever ways to do all sorts of stuff with it.

Yau, N., 2011. Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics, John Wiley & Sons.

Nathan Yau’s Flowing Data is a wonderful website of beautiful visualisations – and this book is full of great vis and great advice. He also released Data Points last year, but I’ve only just ordered it from my local bookshop, so I haven’t had a chance to read it.

At CASA, James Cheshire is CASA’s R ninja, and produces a lot of beautiful maps and sharp analyses with R. A lot of geographers, like Adam Dennett, like it too.

There are other specific libraries and software packages that people use, but these are probably the most common. Actually, asking around the office, there’s a pretty even spread – as well a fair few people using C++ and MatLab, which I’ve not even mentioned†. And in the wider world, Ruby and PHP seem to be popular, although I’ve never written a line of either language. Increasingly, though, a smorgsabord approach might be the best way forward if you want to learn the skills to visualise, analyse, model and share interactive web visuals.

So there you go; if you’re interested in studying these techniques, our MRes ASAV covers a lot of Processing, some R and a little bit of Python in Adam Dennett’s GIS modules (as well as a range of 3D visualisation techniques taught by centre director Andy Hudson-Smith). If you’re already at UCL, some of these modules can be taken as options, and if you’re studying UCL’s BASc, the second year “Digital Literacy and Data Visualisation” uses Python for data analysis and some simple visualisation. Get in touch if you’d like to know more.

*I know that there are a bunch of neckbeards that won’t do any programming unless its on the command line with some rubbish text editor, but I like living in the 21st Century. Colour coding! Debugging! Dropdown menus! Oooooh. If you must, Sublime Text is a lovely text editor.

MatLab is commercial and mainly used by physical scientists; I actually think it’s very good, but if you’re new to programming, I’d recommend Python or R instead. I don’t know much about C++ except it’s vaguely similar to Java but I think it interfaces with hardware a bit better than Java.

Academic New Years Resolutions 2014

tronI feel like I’ve come a long way in the last year; written papers, applied for grants, read more, directed a course, improved my modules, created new ones, reflected a lot on my teaching – all the stuff a Proper Academic is supposed to do. And while there’s still plenty of ways in which I still have things to learn, I feel like there are lots of areas in which I feel I’ve understood the basics, oriented myself and got the lay of the land.

So be a bit braver is my first resolution. In many ways I now have a comfort zone I can go outside; two or three years ago that probably wasn’t the case! It’s necessary to have a little base camp to return to sometimes, or you freeze to death. It’s nice to feel I have something like that. But now isn’t the time to sit around eating beans over an open fire and belching. Not being afraid to fail is the flip side of this. After all, it would be silly to be afraid of something I appear to be so good at.

Using reading as a source of inspiration is my next resolution. A lot of academic writing is terrible, especially with the way that academic papers are incentivised to do a set of things unrelated to clear communication†. In 2013, I started transforming reading from a chore to a reflex. In 2014, I want to make it a pleasure. While I doubt I’ll ever do that with the intricacies of module proposal paperwork, I’d like to do the same for other aspects of my work. I’m a fan of the maxim that “every action is an opportunity for creativity“, but it takes a lot of energy to live up to that. It’s good enough, I think, to pick a few of those opportunities and make the most of them.

TEDx LSE (in March 2013) was an interesting event for me to take part in; and the theme I kept seeing was “connections”: whether in Helen Arney’s “use everything” – take everything you do and put it in the pot, or Ellie Saltmarshe’s talk in praise of the generalist. I’d like to form and continue to find new connections, whether it’s overlapping teaching and research, working with external partners on student projects or running new projects over the course of my public engagement fellowship. And make sure there are fun, creative things happening outside of my academic life. A lot of these things work their way into making me a better academic one way or another, and certainly contribute to my being a better and happier human being.

On the subject of connections, I like collaboration; in the last year, I’ve had the chance to collaborate on exciting work with people I like, respect and trust – academia has given me the privilege of doing that, and I’d like more, please. Taking more risks is easier when there are people to catch you, and be caught by you. I’d like to read more by people I don’t agree with*, I’d like to find time to blog more regularly, because it helps me to organise my thoughts about things I’ve read, and incentivise setting aside time for proper critical and comparative reading and reflection.

And that’s it. Doubt I’ll get all that done by next January. If you’d like to tweet me yours, I’m @sociablephysics.

——–

†oh, was it REF year? I hadn’t noticed.

*partly for the very specific reason that I want to write about the vision of architecture in The Fountainhead and how that relates to a normal person’s conception of the built environment

Communication on the web for smart men 101

IMG_9432

I’ve seen a lot of very unhelpful comments lately, by men, on blogs, by women, usually ones women have written about sexism or some aspect of the way women are treated in particular high-skill industries (tech, science, journalism, or academia) and it’s been acutely embarrassing to read so many dismissive, rude, point-missing and point-scoring discussions instigated by seemingly intelligent people. I expect some of them are just misogynist bullies and trolls, and know exactly what they are doing – but I’m prepared to give some the benefit of the doubt and say that I think that there are men who may not realise that they are being unhelpful or dismissive or irrelevant or childish or hectoring or bullying. I’m mentioning this on what is nominally an academic blog because a lot of this seems to be from men who are middle-class and educated and sciencey, which superficially describes me pretty well, so perhaps I feel I recognise where some of their behaviour is coming from, which is all the more frustrating. That education and relative level of comfort doesn’t correlate with thoughtfulness, it would seem.

I’m not trying to be patronising, but there seem to be some ways in which they would get a lot more out of internet discussions by thinking about the way they interact. Here are ten I thought of for starters:

1) Don’t be a troll. If it helps, think of it this way: there is no such thing as a troll. There are bullies, there are people who tease other people to get a rise, there are people who are trying to play devils advocate or use humour to start an argument, there are misogynists angrily objecting to women calling out bad behaviour, and so on. Sometimes these overlap, but I don’t think any of these are very fun, useful, or worthwhile. The internet is not a contact sport and I think everyone has a worthwhile time when these stereotypes are given as wide a berth as possible. Know when you’re being playful and when you’re being a bore. TIP: people will tell you.

2) Your humour is not a get out of jail free card. Neither is “irony”. When did “I was using HUMOR” become synonymous with “don’t object to anything I just said”? Phrasing something “as if it it’s a joke” doesn’t give you carte blanche to say whatever you want, consequence-free. NB: the only consequence might be that everyone thinks you’re an idiot, or a disagreeable person, and ignores you; there will be no actual Thought Police kicking down your door, just a bunch of people wondering why you’re being unpleasant and not wishing to continue the conversation with you in it.

3) You’re probably not that funny. At least not to most people. That’s ok, humour is subjective, most people don’t find the things I say very funny either. I don’t mean you should stop having a sense of humour, just be aware that if doesn’t come across to someone it might not be their fault.

Newsflash: people use humour to say some very unpleasant things. Often this is to communicate something unpleasant in a palatable way. So, Doc, what you’re saying is, “Don’t buy green bananas”. Sometimes the teller thinks these unpleasant things are bad (cf. sexism, racism, homophobia…), but sometimes they’re saying these things because they believe or like what they express (cf. sexism, racism, homophobia…) and humour lets them pretend that they don’t really mean it if they get called. If there’s any ambiguity, it might be worth rephrasing what you’ve written in case everyone does think you want women to be chained to the cooker. There are people in the world who sincerely believe that, and being a Nice, Educated Chap doesn’t mean you’re automatically Not One of Those People. Indeed, it’s not unusual for Nice, Educated Chaps to do and say things that don’t seem very nice or educated at all.

4) Don’t derail. You might find it terribly interesting to raise the issue of what Evolutionary Psychology tells us about women having supple digits for manipulating dish scourers*, but it may not be what other commenters want to talk about. How about respecting that, and if people don’t seem interested in your tangent, take it somewhere else? You could write your own blog, and have the chat there if you like, what’s wrong with doing that? If people are interested, they can join in. In the same way that, when you comment, you’re taking part in a discussion on a topic that the blogger has chosen.

Similarly,

5) It’s not all about science. “Science” does not make every social situation easy to understand if we model it as a perfect sphere in a vacuum. You may feel like you’ve got a lever which moves the argument wonderfully, but simplifying a situation by ignoring important factors and claiming you’re being “scientific” or “rational” or “logical” can makes people feel like you’re disregarding or diminishing their comments without adding much else to the mix. Maybe you have a great, simplifying insight, or maybe those details you’ve left out, or not thought about, are pretty crucial.

While you’re being “scientific”, consider the evidence. All the evidence**. Ok, more of the evidence. Including the evidence being presented to you by the blogger, and other commenters, even if they’re not blokes and they’re not using the same approach as you or agreeing with you. You’ve already thought about which evidence is important, now think about what evidence other people think is important and why. Maybe you could ask them (nicely) if you don’t get why. It’s not their job to educate you, but you can always ask (nicely).

If you don’t actually think science is helping and you are just using it as a cheap point-scoring tactic, please stop, it’s so boring. No one likes a Sophist. The goal of a conversation is not to win points and level up†. It is not a boss fight.

6) Don’t make it about you. This is a very man-specific thing that women have pointed out to me again and again. It goes a bit like this:

Woman: [X] community has some crappy behaviour towards women
Man [from Community X]: Well, I don’t do that
Woman: I wasn’t addressing it to you – it’s a wider issue that women need to be aware of
Man: Yeah but I’m not doing it so why are you accusing me
Woman: I wasn’t, but it’s something that needs to change
Man: I don’t do that – why won’t you admit it?
And so on

Don’t be that man. It’s unpleasant to hear that members of a community you’re part of are doing something awful. That will produce some cognitive dissonance – maybe you’ll think “oh I know all those guys, they wouldn’t behave that way, it must have been misconstrued or fabricated”; well, consider the possibility that some people do behave in that way. It doesn’t take many people doing something horrible to have a disproportionate effect; it doesn’t mean everyone is behaving that way; and neither does that mean it should not be taken seriously. And if somebody says something unpleasant happened, there is every possibility that they aren’t lying or misconstruing and it is true. Bear in mind that unpleasant people can be clever, and if they have a lot of practice at doing whatever bad thing it is they enjoy doing, they are often quite skillful in leaving enough ambiguity in their behaviour that it can make even those directly affected question how they should be feeling. Usually the answer is far from positive.

Anyway – if you’re asking yourself “Is this person writing about me?” or “Does this apply to my behaviour?” – either the answer is yes, and you need to think about changing your behaviour, or the answer is no and you need to think about others’ behaviour.

So,

7) Don’t expect automatically to be listened to or taken seriously. Everyone has the right to an opinion, and everyone has a right to ignore your opinion if they don’t think it’s helpful or especially well-informed. If you go to someone’s blog and comment ignorantly, divisively or tangentially on the subject, don’t expect anyone to care. If you’re not respectful of the person writing, why should they be respectful of you? If you think you’re making a valid point which is being ignored, nevermind. Worse things happen at sea. And to women and minorities every day. Withdraw gracefully. Not grumpily. You might want to chat to these people another day, even if they seem ill-disposed to do so today.

More generally,

8) Do be sincere. Please don’t treat someone’s discussion of an issue that upsets and impacts them as an opportunity to put on your Clever Hat and show off your knowledge of logical fallacies. (NB. Being sincere is not the same thing as being humourless).

9) Be forgiving. The internet is written is ink, and people make errors, whether factual, typographical, tonal or otherwise. Actually calling someone an idiot or otherwise being rude or patronising doesn’t give them anywhere to go if they do change their mind about their views. I’ve seen people be swayed by good, compassionate argument. People so often argue against things they know to be true – the cognitive dissonance of recognising a truth and not wanting to deal with the consequences of accepting that truth – that is quite a motivator. Learn to recognise it in yourself as well as seeing it in others.

10) “Oh but that’s how people act on the internet” is not an excuse. Sure, we behave differently in different contexts, I would never call Ayn Rand an idiot to her face*** (as I imply repeatedly on the world wide web), but that doesn’t mean we should expect people to behave cruelly, dismissively and rudely as a matter of course. Don’t do it, or excuse it. Lead by example.

Finally, I apologise again to readers who find this patronising or simplistic. If you find it either of those things, hopefully you’re going around not doing any of these things. This really is Internet 101 as far as I’m concerned, but I’ve seen so much that doesn’t manage to meet even these basic standards.  I’ve done more than one of these things in the past, I’ve certainly called people idiots I shouldn’t have, but they were bigger boys and they called me worse back so that’s ok. I doubt that the men who are being bullies on the internet will pay me much mind, but for those men who care about more than showing off – and I think that’s a lot of men reading – just chill out a bit and listen. You can do so much better and have much more interesting conversations and learn interesting and valuable things.

I’m guessing this won’t entirely solve the the Internet, but here’s to optimism.

*obviously this is nonsense, to be clear, I just made some nonsense up

**actually, you won’t be able to do that, but while you’re gathering All The Evidence In The World, we will get some peace and quiet

***she’s dead

†Unless you’re a character from an Ayn Rand novel. Then life is a debate you’ve conclusively won. Well done Dagny, you’re a Level 3 pain in the neck.

Big Social Data and Invasive Species

Spot the invasive species

Spot the invasive species

I just read Emma Uprichard’s excellent piece on big data in social sciences, I’d recommend doing the same. She argues persuasively that Big Data is not the panacea that will solve social ills, and drills into some specific concerns that social scientists might want to think about as the hype machine grinds into gear. There were a few points I wanted to address and reinforce in there.

I’ll start by saying that I don’t think there are consistent definitions of Big Data , and I’m ok with that. Big Data is something I’ve always seem defined functionally (“there sure is a lot of data”) and not structurally; for example, data with large dimension (information about lots of characteristics of some population), large scale (lots of members of a population) or high rate (“live” or frequently sampled data) could all generate massive datasets. I don’t think any one of these is a necessary or sufficient condition, and I’ve seen arguments in the past which elide some of these features, which can be problematic because each poses different problems. I thought the comparison with qualitative data was especially illuminating. Here, you may have a small number of subjects but a very rich “data” set around them (questionnaires, recorded interviews, and so on). This represents a very high-dimensional dataset around a smallish population. You don’t have to be a reductionist, of course – it may not be the best analysis method to try to convert interview data to purely quantitative data and do a regression. But if you want to, you can find big data all over. The UK census is pretty big.

This leads onto the question of expertise. Why are physical scientists/computer scientists/engineers doing this work? Because they have the technical skills. I have no doubt that in a generation, social scientists will graduate with the technical chops to do the machine learning, databasing, visualisation and so on to do it themselves (in fact, we train some of those people, at least at Masters level). In the meantime, transplanted physicists become naturalised in their social soil. I’m not terribly keen to be identified as a positivist invasive species – surely this route into social sciences is as valid as any other? Couldn’t we instead communicate to undergrad physical/computer scientists the value of their skills in social sciences, and encourage them to take on some of the ideas of these disciplines? So many physicists and engineers end up in the most dismal of the social sciences when they go and get finance jobs following graduation – wouldn’t in be good to snag some of those? The fact that so many people with this high-consensus training choose to cross over suggests that there is an appetite amongst hard scientists to work in these areas (I mean in academia – apparently the financial sector offers renumeration to entice physical scientists so they may not be as tempted). I’m not sure this needs to be viewed with such suspicion, even if you disagree on approach and methodology.

I don’t see the “methodological genocide” occurring that Dr Uprichard fears. Big Data self-evidently doesn’t have All The Answers. No one method can. Big Data’s not even a method, really. And there are plenty of important questions big data doesn’t ask, or effect change in response to. The article seems to be suggesting that sociologists need to be ready to argue back. Is that something sociologists are good at? I hadn’t noticed.*

There are some other bits of the article that I think are as true of little data as of big data. I wasn’t sure whether this was the point, but it’s certainly one worth making. Big data opens new questions and fills in detail for some older ones, but (like all data) it doesn’t predict, models and theories do that, and this is hard in social sciences, even with whizzy agent based models and suchlike. Compressing and reducing the data does tend to regress towards the mean – but as hinted, that also allows those who aren’t in the “mainstream” to be spotted. Often it’s these behaviours which are more interesting. The ethics of how and why this is done absolutely does need to be explored – but potentially, identifiying the majority allows you to chuck out that data and see interesting outliers. There are a lot of interesting quantitative techniques out there.

Data, and models, are an imperfect representation of the world. To Tukey’s “no data set is large enough to provide complete information about how it should be analysed”, we might add “no data set is large enough to describe the world we’re examining”. Data is filtered by experimental design, theoretical question, and increasingly, by the data that’s available. Data, models and analysis always need context and interpretation, to identify patterns, results and meaningless anomalies. Ironically, this is something good (natural) scientists and engineers do a lot of, too. But as the article pointed out, physical scientists aren’t used to the atoms changing their behaviour in response to their experiment**, or needing to persuade government that the results of their experiments require a change in policy**, or thinking about whether a study is ethical in the first place**. Big data won’t obliterate people interpreting things, but it might mean some of those people have (gasp) an engineering degree, or a social sciences degree that has a lot of things that make it look like a turn of the century stats or compsci degree. I’m actually rather hopeful about areas like big data, because they will allow people like me to learn a lot from sociologists. I think in asking “How and why is big data useful, and for whom?”, we will need all the expertise we can get.

*in the sweep of low-consensus subjects, I would have thought sociology as an example par excellence.

**well, they are, sort of

50 Shades of Science

Being a jackass of n>1 trade is the lot of the roaming physicist who has a glimmer of self-awareness, and one regularly finds oneself exploring texts and topics in a way that undergraduate students of that discipline would find shallow. Last year, Karl Popper ruined my summer when I tried to read his Logic of Scientific Discovery. I’ve previously enjoyed The Open Society and its Enemies, but I didn’t think Logic was very good, and nor did I feel I understood it enough to really get anything out of the experience. I thought that might be it for me and philosophy of science. Luckily, a conversation with Oliver Marsh from the UCL STS department earlier this year at a Science Showoff prodded me to have a go at reading Thomas Kuhn’s The Structure of Scientific Revolutions*. “Kuhn loves science”, Mr Marsh enthused, “he even used to be a theoretical physicist!”. I was sold.

The Structure of Scientific Revolutions is pretty good. For a start, it’s readable for nonspecialists like me, and accessible. I would say that (like Popper) he makes his point fairly early on and then spends way too long parameterising and paradigmising it, but this is intended to be something like an extended paper rather than a popular science history book; it’s not entirely aimed at idiots like me. His central focus is on the paradigm – now an idea so ubiquitous that it’s pretty much common parlance, but he uses it in various ways which I will choose to sum up as “a bunch of shared ideas and values that people working in a discipline hold about the way the world works and what is interesting to study”. In his conception, “normal science” proceeds when this set of shared ideas are held, used, prodded, tested, and sometimes, found to be wanting – which precipitates a tumultuous “paradigm shift”. This description of what scientists spend their time doing, informed by, one presumes, his experience, as as well as historical literature, rang true to me**.

Kuhn describes the practice of normal science as like a jigsaw puzzle – people using their ingenuity to get all of the pieces in place, and creating something at once satisfying and beautiful. Maybe it’s a jigsaw of a really awesome painting. You might be really good at jigsaws but not a very good painter. Actually, the metaphor that struck me as even more apposite was that of fanfiction – the phenomenon of people using their favourite characters from books and films, and writing new stories around them (I don’t think this was as popular when Kuhn was writing, or maybe he’d have thought of it himself). Where else do smart minds take other people’s ideas and tropes, and then exert their creativity to test them to destruction on the rocks of plausibility? Fanfic writers do some pretty… risqué stuff with their favourite characters, but they are constrained by the canon that they are inspired by and hold so dear. Edward and Bella outgrow the roles written for them, and you have Dorian Gray or David Gray or whatever he’s called, and bingo! Paradigm shift.

Ok, scientists test the canon of their subject’s mainstream on esoteric experiments and theories, to destruction on the rocks of nature. And in science, the source material is really good***, and a lot of the fanfic is really good. But laboratory physicists are exactly the sort of people I expect to write fanfic. Or, to return to Kuhn’s original formulation, would you rather paint a picture or do a jigsaw puzzle? Maybe Kuhn’s conception is a false dichotomy; constraints breed creativity, and even artists have limits set by their media. Even within paradigms, there are shades of grey.

As a (relatively) newly-minted UCL lecturer, I’ve been working to complete my teaching qualification at the Institute of Education – a “Professional Certificate” in higher and professional education. It’s given me a lot of opportunities to reflect on my Brownian academic motion, and how this informs where I am now as a teacher, but also a researcher (short version: physics undergrad, solid state physics/materials science doctorate, 4 years as a medical physicist, and now 3 as a social physicist). One of the more interesting elements for me was one that came out in passing during a session led by Holly Smith – the idea of “high-consensus” and “low-consensus” subjects.

The definition is sort of what you’d imagine: high-consensus subjects agree on a lot of stuff and low-consensus ones don’t. Holly pointed me to one of the original papers on the subject – an article from 1973 where researcher Anthony Biglan carried out a study at the university of Illinois, and in parallel, a smaller liberal arts college.

Distributing cards with subject names on, he asked academics to rank each subject in terms of how “close” it is to their subject. He then used a bit of clever maths magic called “Multidimensional Scaling”, or MDS. MDS is clever in that it takes a series of distances (in this case, the “distances” between subjects) and assumes that this is due to points representing each subject “floating around” in some abstract space. Using the distances between them, it “triangulates” where these subjects were in the space.

To think of a more concrete example, imagine if you don’t know where the London tube stations are but you know the distances between stations. MDS would let you work out the configuration of stations in space based on just those distances†. In Biglan’s case, he was reconstructing positions in a sort of abstract space, and rather than two, he used three dimensions, which he found to correspond to the distinctions “hard/soft”, “pure/applied” and “life/non-life”. In this classification Physics would be “hard/pure/non-life”; Management might be”soft/applied/life” and so on. But as in space, there are potentially a continuum values, not just binary yes/no answers.

It’s interesting that these dimensions pop out. In the follow-on literature I’ve read, people talk about “hard/soft” distinctions, or even “paradigmatic” subjects (after Kuhn), but my favourite is “high/low consensus” – in other words, high consensus subjects are those in which there is a lot of agreement about the methods we use, the technology we employ, and/or a shared body of facts – which does sound very Kuhn-like. I like this description because it’s quite person-focussed (“how much do we agree?”), it sums up the essential features, echoes the way these subjects are taught (or have been taught), and doesn’t have the weird macho baggage of studying a “hard science” vs a “soft subject”.

Ultimately, I’m interested in this distinction because I’ve moved from some very high-consensus areas to some lower-consensus ones; and because adapting my thinking and research and teaching styles hasn’t necessarily been that easy. What my PhD taught me, though, was even in a high-consensus subject, creativity, critical thinking, and the ability to deal with uncertainty is necessary to be a dynamic researcher. Thinking about ways to make high-consensus subjects more low-consensus is interesting to me, because it’s arguably that kind of thinking that opens up academia to public participation and engagement, and opens up teaching to be a more valuable process for the teacher. There are plenty who would say it makes it a more valuable process for the learner too. But taking  teaching and research from a high-consensus world into a low-consensus one can be  challenging, even for serial interdisciplinarians like me.

*I know it’s 50 years old. I am behind on my reading list.

**Of course, I should always be a bit wary of reading too many things that simply remind me of things I’d forgotten I believed

***better than Twilight, even

†actually, you’d also need the absolute positions of two stations to get the absolute location and correct n/s/e/w orientation right, but everything else would be ok.

Biglan, Anthony. “The Characteristics of Subject Matter in Different Academic Areas.” Journal of Applied Psychology 57, no. 3 (1973): 195–203. doi:10.1037/h0034701.
Kuhn, Thomas S. The Structure of Scientific Revoutions. Chicago: The University of Chicago Press, 1962.

Bike webs

The paper we published in PLOS ONE last week covers bike share schemes from five cities: the large ones in London and Washington, DC, and smaller ones in Boston, Minneapolis, and Denver. It might seem odd that we’ve chosen these cities, but they’re chosen on the availability of data more than for any other reason. A number of cities provide or tacitly allow access to real-time feeds of their bike data, and this makes it possible up see how many bikes and spaces there are in their docking stations. To do analysis on travel patterns, you need journey data – and few cities offer that. Other people have looked at Paris and Lyon, for example, but this was under private arrangements with those schemes. All of the data we use has been published openly on the Internet: for TFL data, you do technically need a developer login, but that’s free. The only exception to this is Denver, who shared their data with us thanks to the kindness of their hearts and Ollie’s notoriety in the cycling community.

The visualisations we produced were a first step, but we wanted to understand a little more about what makes these schemes tick. We created some summary graphs to show how far people travel, and how long for, and then we took the source-destination journey frequencies and converted that to a matrix (well, a big source-destination lookup table), and then to a network. The webs you see below represents the start and end points of bike journeys, with the weight of each “strand” determined by how many bikes had taken that journey.

London Weekday Journeys

London Weekday Journeys

There are defined techniques in network analysis which detect “communities” within networks. If you imagine the network of your Facebook friends, you may well be distantly connected to Kevin Bacon or whoever, but there will likely be big interconnected groups around people you work with, know from school, play sports with, your family, and so on. Within these groups everyone knows each other, but you may be the only common link between your work and your family, for example†. Spatial systems may have similar communities, but in this case it represents a part of the city that tends to “keep to itself more than it connects with the rest of the city”. In London, at lunchtimes and weekends, Hyde Park is like this – people cycle around within this region more than they enter or leave it.

With spatial systems, testing whether these communities are real or significant is a bit different. You might expect things which are close together and busy to interact a lot, because if there are a lot of people leaving Waterloo, say, and lots of people in London want to go to the London Eye, say, those two facts combined with how close these two locations are pretty much guarantees that lots of people will go from Waterloo to the London Eye. On the other hand, fewer people will cycle from Waterloo to Regent’s Park, because it’s much further, and not many people will go from Waterloo to Elephant and Castle, because although they are close, Elephant is not as popular as the London Eye. Well, I suppose it depends who you ask, but for the cyclists we looked at, it’s certainly true.

This is all a long-winded way of saying that it’s very easy to create an analysis that tells you that busy routes occur between popular things that are close to one another, but if you’re expecting this you can see which routes are used more than you’d expect just based on popularity and proximity. For example, in London, routes from Waterloo to the financial district are more busy than you expect, presumably because people who work in the financial district live in parts of the UK which are served by trains which arrive at Waterloo. It’s key that these factors have nothing to do with our null model, namely: routes are used in proportion to the popularity of the start and end locations and the inverse* of their proximity.

This approach was led by Paul Expert and others in a paper they wrote on a Belgian telecommunications network, and by downplaying this spatial component, they were able to detect communities which clustered on the basis of which language was spoken – and not just on how close together and populous different towns and cities were. I’ve taken a slightly different tack, which is to show the residuals – what remains when you subtract this null model from the real data. It’s mathematically the same approach, but this visualisation highlights flows above and beyond the proximity/popularity model (here, blue flows are bigger than you’d expect and red flows are smaller than you’d expect).

London weekday residuals

London weekday residuals

If you apply community detection to this, it partitions London into a set of communities as below (in reading this, it’s worth noting that Group 3 might just be “everything else” – i.e. not a meaningful cluster). You can hopefully see Hyde Park in the west, the City of London in the east, and a couple of other clusters with a less obvious meaning.

Communities based on spatial residuals

Communities based on spatial residuals

I usually get asked two questions about the work we’ve done on bikeshares. The first is “What did you discover?”, and, being honest, I think the main drawback of the paper for me is that there’s no “shazam” moment. It’s raised a lot of questions for me, around whether we can get more insight by looking at the time-varying networks. The rush hour network will look very different from the weekend network, or by season – what you see above is an amalgamation of months of data from different times of day. Secondly, I would like to do more work on whether the communities we detect are robust (as in, how much do they change based on the time of year, day, etc and how much of that is a methodological issue vs how much these communities actually change). In terms of results, the things I learned about the spatial and temporal uses of the scheme in London seemed pretty intuitive to me. Commuter stations are popular, rush hour is busy, and Hyde Park is more popular at the weekend and lunchtime – most of the results simply support and quantify these things. For other cities, I wish I had more insight into their geographies, and while a group of friends and colleagues helped to give me a little insight, I still feel I lacked some of the qualitative context.

The second question I’m usually asked is whether this is useful for operational planning – moving bikes around, and suchlike. At the moment, I think the answer is no, because to fully account for user choice you need to deal with those situations where users can’t get a bike, or find a space, and that requires something a bit more like an agent-based model. Something where the consequence of a full or empty dock can propogate through the network in a causal and time-sensitive way. If TFL or any other bikeshare schemes are interested in data visualisations or analyses, we’re very happy to work with external partners, and find ways to model and develop it further.

The full citation for the paper is:

Zaltz Austwick, M. et al., 2013. The Structure of Spatial Networks and Communities in Bicycle Sharing Systems. PLoS ONE, 8(9), p.e74685.
you can find it here:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0074685

And Ollie O’Brien and other co-authors will be at ECCS in Barcelona next week presenting a poster on this work – please do say hello. If you have any questions or comments, feel free to go below the line, or tweet me @sociablephysics.

†unless you work for News International or a cosy family business of another kind

*technically, it’s more complex than a simple inverse relationship and a semi-empirical function is used based on stand proximity for this paper. Other options include journey duration or routed distance, but those have their own drawbacks.You’ll have to read the paper.