You just signed up for PyDataLondon and you are super excited about it! Since you hear that measuring twitter sentiment is all the craze these days (be it for speculating in the stock market, or identifying a viral product), you decide that you also want in. Let's try to apply some NLP (natural language processing) goodness to analyze #PyDataLondon tweets!
First grab the data that we've downloaded for you.
The data is saved in the pickle format.
Don't be worried if you don't understand this part - it's just to set you up for the main parts.
import pickle
with open('./datasets/twitter_data.pkl', 'rb') as pickled_file:
twitter_data = pickle.load(pickled_file)
Let's check how many tweets we have.
Note, that iPython automatically displays the last output in the cell,
so it is enough to write len(tweets_list)
instead of print(len(tweets_list))
len(twitter_data)
Let's see what a tweet looks like
# Each tweet is represented by a dictionary with a following fields:
twitter_data[0].keys()
# Text of the first tweet
twitter_data[0]['text']
# We can extract the text from the tweets
tweets_text = [tweet['text'] for tweet in twitter_data]
# To see if it works, print out the first 10 tweets
tweets_text[:10]
Let's also take a look at the number of characters in a tweets. This dataset is from before Twitter changed their character limit, so you would expect there to be mostly < 140 character tweets.
tweet_lengths = [len(text) for text in tweets_text]
# Let's print legth of the first 10 tweets
tweet_lengths[:10]
We can have better understanding of our data if we plot a histogram instead of looking at the list of numbers
import pandas as pd
# Get notebook to show graphs
%pylab inline
# Use new pretty style of plots
matplotlib.style.use('ggplot')
# Because data scientists hate charts with no labels, let's add them :D
plt.ylabel('frequency')
plt.xlabel('number of characters in tweet')
# We can transform our list of tweet lengths from list to pandas Series
# it will let us to use hist() method to create histogram
pd.Series(tweet_lengths).hist(bins=20)
# What's the average number of characters? What's the maximum or minimum?
# We will again use pandas Series instead of the python builtin type (list)
# It will allow us to use the describe method
tweet_lengths_series = pd.Series(tweet_lengths)
tweet_lengths_series.describe()
We are going to use a technique called word vectors to find out which words are most commonly used together with which other words. On the way to doing that, we will also see some very cool visualizations for word counts.
from collections import defaultdict
word_count = defaultdict(int)
for tweet in tweets_text:
for word in tweet.split():
word_count[word] += 1
# Count the words used in our tweets
print('{} unique words'.format(len(word_count)))
# Here is a python standard library feature that is quite cool!
from collections import Counter
words = Counter(word_count)
print(words.most_common(10))
If you were asked to find the best chart to visualize word counts, how would you do it? Here's a cool little non-standard library that you should be able to install with a single command. Python is amazing!
from wordcloud import WordCloud
temp = {'a': 3, 'b': 1}
wordcloud = WordCloud(width=800, height=600).generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
Word clouds are so coool. Let's make the picture take up the whole screen, so we can stare at it IN ALL ITS GLORY :D
def enlarge(multiplier=2):
"""If you want to understand more about this function, refer to the data visualization notebook."""
figure = plt.gcf()
original_width, original_height = figure.get_size_inches()
new_size = (original_width * multiplier, original_height * multiplier)
figure.set_size_inches(new_size)
enlarge()
plt.imshow(wordcloud)
plt.axis("off")
Let's get back on track again... Too much chart porn is bad for you after all.
First, let's do some long overdue data cleanup that we spotted from the word cloud. We probably don't care about retweets, prepositions etc. And on that note, we also probably don't care about the words which only occur a couple times.
# It is good practice to exclude the most common words,
# like articles (the, a, ...), prepositions (on, by, ...) or some abreviations (rt - retweeted)
exclude_words = {
'rt', 'to', 'for', 'the', 'with', 'at', 'via', 'on', 'if', 'by', 'how', 'are', 'this'
'do', 'into', 'or', '-', 'you', 'is', 'a', 'i', 'it', 'in', 'and', 'of', 'from', '>'
}
word_count_filtered = {k: v for k, v in word_count.items() if k.lower() not in exclude_words}
# Let's represent the word_count_filtered as pandas DataFrame
words = pd.DataFrame.from_dict(word_count_filtered, orient='index').rename(columns={0: 'frequency'})
# The results are as following
words.head(15)
# We want to limit our vocabulary to only the most common words
limit = 30
shortened_list = words[words.frequency > limit]
print(
'If we limit the words to any word that at least occurs {} times, '
'we are left with {} words (from {} words)'.format(
limit, len(shortened_list), len(words)
)
)
Now we are finally all set to figure out the question we had previously posed: if a word is in the tweet, how frequently do these other words also show up in the tweet?
# First, let's create a DataFrame filled with zeros
occurrence_frequency = pd.DataFrame(0, index=shortened_list.index.values, columns=shortened_list.index.values)
# Sanity check (let's see if we succeeded, by printed the first blok of the matrix)
occurrence_frequency.iloc[:5, :5]
# Next, let's remove all the unncessary words from our tweets
allowed_words = occurrence_frequency.index
cleaned_tweets = []
for text in tweets_text:
words_in_one_tweet = text.split()
cleaned_tweets.append([w for w in words_in_one_tweet if w in allowed_words])
# To check if everything works, we print the first 10 tweets
# we should see only the most common words
cleaned_tweets[:10]
# A triple for-loop to add up and fill in the counts for each word vis-a-vis other words
for word_list in cleaned_tweets:
for word in word_list:
for other_word in word_list:
occurrence_frequency[word][other_word] += 1
# Let's display our results (first 10 lines)
occurrence_frequency.head(10)
Great! Now we have everything setup and we are ready to look at the cosine similarity between different words.
We are thinking of each word as a n-dimensional vector (where each dimension is the co-occurence frequency for another specific word). The cosine similarity basically looks and says, "hey word_a
co-occurs a lot with word_b
but does not appear with word_c
. Oh hey, word_d
also co-occurs a lot with word_b
but not with word_c
. I guess that word_a
and word_d
must be quite similar then."
from scipy.spatial.distance import pdist, squareform
cosine_distances = squareform(pdist(occurrence_frequency, metric='cosine'))
cosine_distances.shape
# Let's look at the top left corner of our array
cosine_distances[:5,:5]
You can see that the distances between any word and itself is 0. Let's flip it around for a second and look at similarity instead.
cosine_similarities_array = np.exp(-cosine_distances)
similarity = pd.DataFrame(
cosine_similarities_array,
index=occurrence_frequency.index,
columns=occurrence_frequency.index
)
similarity.head(10)
Now you can see that any word is 100% similar with itself.
Well that is great and all, but how would you visualize word similarity?
It turns out that scikit learn has just the tool for us:
from sklearn import manifold
# see http://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling
mds = manifold.MDS(n_components=2, dissimilarity='precomputed')
words_in_2d = mds.fit_transform(cosine_distances)
words_in_2d[:5]
MDS allows us to go from the n by n matrix down to a more manageable lower-dimension representation of the n words.
In this case, we choose a 2-d representation, which allows us to...
# make a bubble chart
counts = [word_count[word] for word in occurrence_frequency.index.values]
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts)
# let's enlarge it and add labels
enlarge()
important_words = words[words.frequency > 80].index.values
for word in important_words:
idx = occurrence_frequency.index.get_loc(word)
plt.annotate(word, xy=words_in_2d[idx], xytext=(0,0), textcoords='offset points')
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts, alpha=0.3)
That's cool- you can see there is:
If you've gotten to here, a big congratulations on finishing the first part of this tutorial!
If you stil have time, here are a couple suggestions for you to work on:
from IPython.core.display import HTML
HTML("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Fun with Clusters at <a href="https://twitter.com/hashtag/PyDataLondon?src=hash&ref_src=twsrc%5Etfw">#PyDataLondon</a> <a href="https://t.co/j42lbx4kyx">pic.twitter.com/j42lbx4kyx</a></p>— Lewis Oaten (@lewisoaten) <a href="https://twitter.com/lewisoaten/status/728548835082047489?ref_src=twsrc%5Etfw">May 6, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Turns out you can color the clusters!
In this part we will use textblob to determine the sentiment of the tweets. Textblob already has ready-trained classifiers that we can use for this purpose, so it is quite plug and play.
First, let's make sure we understand how it works:
from textblob import TextBlob
# Let's check a polarity of a positive sentence (try some other sentences as well!)
blob = TextBlob("The life is good.")
blob.polarity
# Nowe we can check a polarity of a negative sentence (try some other sentences as well!)
blob = TextBlob("The life is tough.")
blob.polarity
For textblob, we also need to clean the tweets to remove links and special characters.
import re
def clean_tweet(tweet):
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
cleaned_text = [clean_tweet(tweet['text']) for tweet in twitter_data]
cleaned_text[:5]
Let's check the sentiment of each tweet!
tweets_with_polarity = [(TextBlob(t).polarity, t) for t in cleaned_text]
# let's check the results
tweets_with_polarity[:5]
# the most positive tweets
sorted(tweets_with_polarity, key=lambda tup: tup[0])[-10:]
# the most negative tweets
sorted(tweets_with_polarity, key=lambda tup: tup[0])[:5]
Check out this tutorial if you are interested.
In part 1, we looked at word count / word level analytics. Inspired by the unreasonable effectiveness of character-level language models, let's try to use a Maximum Likelihood Character Level Language Model to generate Shakespeare!
# First we need a large body of text
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
# let's see what the file contains
with open("shakespeare_input.txt") as f:
shakespeare = f.read()
print(shakespeare[:300])
from collections import Counter
def train_char_lm(data, order=4):
"""Train the Maximum Likelihood Character Level Language Model."""
language_model = defaultdict(Counter)
# we add special characters at the beginning of the text to get things started
padding = "~" * order
data = padding + data
# count how many times a given letter follows after a particular n-char history.
for i in range(len(data) - order):
history, char = data[i:i + order], data[i + order]
language_model[history][char] += 1
# we normalize our results
normalized = {hist: normalize(chars) for hist, chars in language_model.items()}
return normalized
def normalize(counter):
"""Normalize counter by the sum of all values."""
sum_of_values = float(sum(list(counter.values())))
return [(key, value/sum_of_values) for key, value in counter.items()]
# Let's us train our model!
language_model = train_char_lm(shakespeare, order=4)
# Check how the model look like
list(language_model.items())[:6]
It means, that after Firs
, we always get t
with probability 1. But after First
, we might see a space with probability 0.83, or comma with probability 0.082 etc.
Let's us check which letter is the most probable after hous
. Since we generated a model with order 4, we can look only at last 4 letters.
# Other example
language_model['hous']
The most probable, as expected, is e
(house).
Why a
? Becuase hous
can be a part of the thousands
.
Play around with this!
Now let's use the model to generate some Shakespearean!
from random import random
def generate_letter(model, history, order):
"""Generate next letter with given probabilities."""
history = history[-order:]
probabilities = model[history]
x = random()
for character, prob in probabilities:
x = x - prob
if x <= 0:
return character
def generate_text(model, order, nletters=1000):
"""Generate new text using our model."""
# Use the special character to get things started
history = "~" * order
out = []
for i in range(nletters):
c = generate_letter(model, history, order)
history = history[-order:] + c
out.append(c)
return "".join(out)
print(generate_text(language_model, 4))
It is amazing how such a simple model is enough to generate text that has a structure of a play, with capitalized character names in the script etc.
Run the above again and try generating more text!
We can also increase the model order to get even better results. However, it will take exponentially more time to create the model. However, once we have the model, generating new text should be quite fast.
# Finally, check the order 10. It can take a while...
language_model = train_char_lm(shakespeare, order=10)
print(generate_text(language_model, 10))