Scraping Twitter with Tweepy and Python

Part of Twitter’s draw is the vast number of voices offering their opinions and thoughts on the latest events. In this article, we are going to look at the Tweepy module to show how we can search for a term used in tweets and return the thoughts of people talking about that topic. We’ll then look to make sense of them crudely by drawing a word cloud to show popular terms.

We’ll need the Tweepy and Wordcloud modules installed for this, so let’s fire these up alongside matplotlib.

In [1]:
import tweepy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

First up, you will need to get yourself keys to tap into Twitter’s API. These are freely available if you have a regular account from here.

When you have them, follow the code below to plug into the API. I’ve hidden my tokens and secrets, and would strongly recommend that you do too if you share any code!

Tweepy kindly handles all of the lifting here, you just need to provide it with your information:

In [2]:
access_token = "HIDDEN"
access_token_secret = "HIDDEN"
consumer_key = "HIDDEN"
consumer_secret = "HIDDEN"


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

So we are looking to collect tweets on a particular term. Fortunately, Tweepy makes this pretty easy for us with its ‘Cursor’ function. The principle of Tweepy’s cursor is just like the one of your screen, it goes through tweets in Twitter’s API and does what we tell it to when it finds something. It does this to work through the vast ‘pages’ of tweets that run through Twitter every second.

In our example, we are going to create a function that takes our query and returns the 1000 most recent tweets that contain the query. We are then going to turn them into a string, tidy the string and return it. Follow the commented code below to learn how:

In [3]:
#Define a function that will take our search query, a limit of 1000 tweets by default, default to english language
#and allow us to pass a list of words to remove from the string
def tweetSearch(query, limit = 1000, language = "en", remove = []):
    
    #Create a blank variable
    text = ""
    
    #Iterate through Twitter using Tweepy to find our query in our language, with our defined limit
    #For every tweet that has our query, add it to our text holder in lower case
    for tweet in tweepy.Cursor(api.search, q=query, lang=language).items(limit):
        text += tweet.text.lower()
    
    #Twitter has lots of links, we need to remove the common parts of links to clean our data
    #Firstly, create a list of terms that we want to remove. This contains https & co, alongside any words in our remove list
    removeWords = ["https","co"]
    removeWords += remove
    
    #For each word in our removeWords list, replace it with nothing in our main text - deleting it
    for word in removeWords:
        text = text.replace(word, "")
    
    #return our clean text
    return text

With that all set up, let’s give it a spin with Arsenal’s biggest stories of the window so far. Hopefully we can get our finger on the pulse of what is happening with new signing Mkhitaryan & potential Gooner Aubameyang. Let’s run the command to get the text, then plot it in a wordcloud:

In [ ]:
#Generate our text with our new function
#Remove all mentions of the name itself, as this will obviously be the most common!
Mkhitaryan = tweetSearch("Mkhitaryan", remove = ["mkhitaryan"])
In [5]:
#Create the wordcloud with the text created above
wordcloud = WordCloud().generate(Mkhitaryan)

#Plot the text with the lines below
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Lots of club propaganda about how a player has always dreamt of playing for their new club?! We probably didn’t need a new function to tell us that!

And let’s do the same to learn a bit more about what the Twitter hivemind currently thinks about Aubameyang:

In [7]:
Auba = tweetSearch("Aubameyang")
In [8]:
wordcloud = WordCloud().generate(Auba)

plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Equally predictably, we have “Sky Sources” talking about a bid in excess of a figure. Usual phraseology that we would expect in the build up to a transfer. I wish we had something unexpected and more imaginative, but at least we know we are getting something accurate. Hopefully you can find something more useful!

Summary

As you already know, Twitter is a huge collective of voices. On their own, this is white noise, but we can be smart about picking out terms and trying to understand the underlying opinions and individual voices. In this example, we have looked at the news on a new signing and potential signing and can see the usual story that the media puts across for players in these scenarios.

Alternative uses could be to run this during a match for crowd-sourced player ratings… or getting opinions on an awful new badge that a club has just released! We also don’t need word clouds for developing this, and you should look at language processing for some incredibly smart things that you can use to understand the sentiment in these messages.

You might also want to take a look at the docs to customise your wordclouds.

Next up – take a look through our other visualisation tutorials that you might also apply here.