122,379 tweets about climate change
May 22, 2017 · 7 minute readRClimate Change
I’ve spent a significant chunk of my career analyzing media content to understand coverage different environmental issues, ranging from climate change to drought to endangered species. This is a somewhat laborious process in which you carefully read and code hundreds of articles, looking for different frames, etc.
But there’s a universe of automated text analysis tools out there that might be really helpful for getting a broad overview of content, even if they can’t replace human-led analysis. I decided to try it out by analyzing a bunch of tweets about climate change. I based a lot of this initial analysis on David Robinson and Julia Silge’s Text Mining with R, which is a nice introduction to text analysis using R and the tidyverse tools.
Downloading tweets
The first step is to download the tweets. The rtweet
package makes this pretty easy. Check out the package repository for useful details. I also put my download script on GitHub, so you could just crib from that.
rtweet
uses Twitter’s APIs to download tweets for a given search term. Thanks to Twitter’s API limits, rtweet
can download about 18,000 tweets every 15 minutes, but you can set it to try again after a 15-minute break by using the retryonratelimit
option.
I thought I would be clever and download 1 million tweets about climate change. Here’s the code I used:
search_tweets(q = '"climate change" OR "global warming"',
n = 1000000,
include_rts = FALSE,
retryonratelimit = TRUE)
It turns out that the API that rtweet
calls limits you to tweets from about the last 10 days, so I got a “measley” 122,379 tweets. People tweet about climate change a lot.
Before we proceed, a caveat: these tweets are not the full discussion about climate change on twitter. They are limited to tweets in English that contained the search terms “climate change” or “global warming” and occurred in my search window, which was about 5/8/2017 to 5/18/2017. They aren’t representative of anything beyond that, but they’re still pretty interesting.
Basic stats
First, let’s look at the tweets as a whole. As I mentioned above, there were 122,379 tweets. Those tweets received an average of 4.6 favorites and and 2.3 retweets. The most favorited and retweeted tweet was this joke about climate change and the Texas weather:
Literally every state knows this struggle it's called global warming https://t.co/ShWf1q5jDp
— Mia-Simone Green (@miasimoneg_) May 11, 2017
As is often the case on Twitter, the second-most favorited and retweeted Tweet about climate change was an uncredited rip-off:
Literally every state knows this struggle it's called global warming https://t.co/07hzXOZI4R
— Dory (@Dory) May 11, 2017
Looking for patterns in the tweets, you can see that it trended downward over the period, especially if you account for the fact that people tweet less on the weekend. This might be a sign that the tweeting was driven by some news event(s):
Word frequencies
Since my goal is to try to get a rough sense of 122,379 tweets have to say, I thought looking at word frequencies would be a good place to start. Here are the top 10 words mentioned in the tweets, after removing common words like “the”, and “and”; the words “climate”, “change”, “global”, and “warming”; and numbers:
Given that these tweets were from May, 2017, I was pretty surprised to see that Obama was the most commonly mentioned word! Comparing tweets that mention Obama to tweets that mention Trump might be an interesting future exercise, but for now I’ll say this: Twitter is a strange place.
Another way to visualize word frequencies is through wordclouds. They are pretty and interesting, though maybe not the most analytically rigorous graphs around. Here’s a wordcloud with the top 200 words used in the tweets, again excluding common words, climate/change/global/warming, and numbers:
Now we’re starting to get a more detailed picture of the language that people used when they tweeted about climate change.
We can do the same thing with hashtags. Here are the top 15 hashtags used in the tweets, with nothing filtered out:
And here are the top 200 hastags in a wordcloud:
Of course, looking at a single word only tells you so much about a body of text. Looking at two words (“bigrams”) at a time can give a bit more context. Here are the top 10 bigrams, again after removing common words, “climate change”, and “global warming”. I didn’t remove numbers this time because if a number is being used as part of a phrase, it might be interesting:
Some of these words struck me as a bit strange, so I googled 14 car convoy to see if I could make heads or tails of it. It turns out that Barack Obama recently gave a speech in Milan that touched on climate change. Apparently, people on twitter enjoyed pointing out the irony of the fact that Obama (supposedly…this is the internet, afterall) took a private jet and a 14-car convoy to give a speech on climate change. Ha ha. Given that this happened on May 9, I suspect it might have been one of the news events driving the tweets early in the week, but I’ll leave a detailed analysis of that for another day.
Interestingly, by analyzing two words at a time instead of one, I was able to get a lot better understanding of what was going on in this body of tweets. However, what was going on beyond Obama’s speech? Let’s look at the top 20 bigrams to see if we can figure it out:
There’s some noise in here that doesn’t make much sense by itself (I’ll save trigram analysis for another time!). However, Rex Tillerson’s signing of the Fairbanks Declaration makes an appearance, as does Gloria Steinem, who gave an interview in which she said “if we had not been systematically forcing women to have children they don’t want or can’t care for over the 500 years of patriarchy, we wouldn’t have the climate problems that we have. That’s the fundamental cause of climate change.”
Honestly, I hadn’t heard about the Obama speech or the Steinem quote before this exercise and most of the google hits related to them were on websites that were somewhat (to be polite) “fringe”. This is interesting and/or harrowing: if tweets set the agenda for public discussion of climate change, what is the effect of the (perhaps outsized) presence of fringe media? Do the general public see those tweets or are the tweets limited to an echo chamber? That’s a research question for another time…
You can also visualize the bigrams using a network diagram to see the relationship between many of the bigrams at once. Here’s a diagram with all of the bigrams that occurred more than 400 times in the tweets. The arrows show which words preceeded which:
If nothing else, this yields some interesting google search terms. For example, “dramatic venice sculpture” led me to a story about a, well, dramatic sculpture in Venice that’s supposed to call attention to sea-level rise. “Scientists build miniature worlds” led me to an interesting story about scientists using minature ecosystems to simulate the impacts of climate change.
Conclusions (for now)
So what have I learned from this? First of all, people tweet about climate a lot, though it looks like many of the tweets might be driven by news events. News-driven tweeting isn’t surprising (in fact, agenda-setting theory would predict it), but there’s an open question about the effect of fringe news sites on public perception of these issues.
Politics appears to be a big part of the twitter discussion of climate change in this dataset, which isn’t particularly surprising. “Obama” appeared in around 10% of the tweets and “Trump” wasn’t far behind. However, most of the tweets don’t mention either Obama or Trump, so there’s a lot more to explore in these data.
In fact, this is an extraordinarily rich/fun dataset to play around with. It presents a nice overview of the climate conversation on Twitter. Each of the visualizations reveals a little bit more about the tweets, helping me to gain a better understanding of the text relatively quickly.
But this is just a start: there’s whole garden of forking paths 😉 to explore, including sentiment analysis, looking at which websites were linked to by whom, the reach of different tweets, modeling tweet popularity, changes in tweets over time, and more. I’ll dig into it eventually, maybe even in a more formal way. In the meantime, I’ve uploaded the data to GitHub, along with a most of my analysis scripts so feel free to go nuts.