Research grants yield Twitter followers?
I went to a nice little event on Monday evening, organised by the data.ac.uk guys at the University of Southampton. Effectively the aim was to coerce some people with ideas and people with having skills to build some little tools, applications and visualisations using data gathered about institutions under the .ac.uk second-level domain (largely UK universities, entities related to research or other academic pursuits).
I personally had a really rewarding time for a number of reasons. A handful of our first year undergraduates (from a class I teach called “Space Cadets“) attended, contributed and seemed to enjoy themselves. I also learned about the research topic of one of our new PhD students (Johanna) and was impressed at how pertinent and insightful her topic is, and at how she was using this event as a way to gather preliminary data. I also got to catch up with some friends I haven’t seen for a while, like Marja and Colin. And all this while almost hacking something together!
Full disclosure, we didn’t quite finish what we aimed to do within the time, but I managed to pull it together in the pub afterwards 🙂
So what did we do? Well we started off looking at the data on Gateway to Research, as we were going to see if we could link it to news stories on university RSS feeds (do universities publish many stories about their research?). Organically, this evolved into looking at their Twitter feed instead (as the data.ac.uk Observatory already scrapes the Twitter account from homepages). As a simple goal, we wondered if there’s any observable link between number of Twitter followers and number of research grants granted.
By the end of session we’d just about extracted all the relevant data (name of university, Twitter account, followers and number of research grants – 4 bits of data from 4 independent data sources) and displayed it as a list. We were somewhat hampered by my poor decision to attempt this in Javascript, as the Same Origin policy made it impossible to AJAX data from live APIs (why make your data available in JSON then not allow me to access it in Javascript, I say)*. However, a quick rewrite into PHP got us back on track.
As I said, we weren’t quite done, as I wanted to visualise this data somehow, as well as fix a few bugs. In the pub, I tried to make use of the (unfortunately deprecated) Google Image Chart API, but it was capping at some weird values. To resolve this, I outputted the data as CSV and imported into Google Sheets and generated the graph manually (hack events require cutting some corners and thinking on your feet!) This is what we got:
This is the number of research grants a university has had funded against number of Twitter followers on the first Twitter account on their homepage. It’s on a log scale.
The grants vs followers data in a Google spreadsheet, in case you want to look.
What does it tell us? Well it says that the more successful research universities also have more people listening to them on social media. Is this what we would have guessed anyway? It’s easy to say yes in hindsight, but it’s nice to have some numbers to support it. Of course, I’ve not yet run the correlation to see if this is a significant relationship; that’ll come with a bit more time.
Perhaps more importantly, it has helped us identify some quirks in the data and the nuances of how to handle it. For example, the Observatory will record all Twitter handles referenced on the homepage. If there’s a widget displaying a Twitter feed on the homepage, it will include all accounts @replied and retweeted. It also stores the date of an observation as the name of a property in an object, which are hard to sort, so it’s difficult to get latest observation (clearly this requires a smidgen of preprocessing). We spotted these by delving into a couple of the outliers, and interestingly by cleaning up the data, it moved them closer to the centre of the cluster of points.
To conclude, the event was a great success. I think the 2-hour hack might be the perfect format for exploratory data hacks. It’s demoralising to spend a day or three hacking and have nothing to show for it; spending an hour or three and having a result (even a small one) is massively rewarding. I hope to tidy up this code, check the details of the data (especially what grants GtR includes) and do some stats on it. We’ve observed there’s some link (though no inference about the cause of that link) between research funding and social media popularity of universities. I became a bit more confident in having with data within a time constraint, and had fun doing it!
Resources
- Google spreadsheet of data and graph
- github of code
- learning providers CSV from learning-provider.data.ac.uk
- twitter accounts JSON from observatory.data.ac.uk
- example of GtR JSON
- trick for getting Twitter followers without authenticating