NLP Notebook (Twitter, Sentiment Analysis and A Shakespeare Generator)

Part 1 - Twitter NLP

You just signed up for PyDataLondon and you are super excited about it! Since you hear that measuring twitter sentiment is all the craze these days (be it for speculating in the stock market, or identifying a viral product), you decide that you also want in. Let's try to apply some NLP (natural language processing) goodness to analyze #PyDataLondon tweets!

Load the data

First grab the data that we've downloaded for you. The data is saved in the pickle format.
Don't be worried if you don't understand this part - it's just to set you up for the main parts.

In [1]:
import pickle

with open('./datasets/twitter_data.pkl', 'rb') as pickled_file:
    twitter_data = pickle.load(pickled_file)

Let's check how many tweets we have.
Note, that iPython automatically displays the last output in the cell, so it is enough to write len(tweets_list) instead of print(len(tweets_list))

In [2]:
len(twitter_data)
Out[2]:
3653

Explore the data

Let's see what a tweet looks like

In [3]:
# Each tweet is represented by a dictionary with a following fields:
twitter_data[0].keys()
Out[3]:
dict_keys(['retweet_count', 'in_reply_to_status_id', 'favorited', 'id', 'in_reply_to_status_id_str', 'filter_level', 'possibly_sensitive', 'favorite_count', 'in_reply_to_screen_name', 'entities', 'in_reply_to_user_id', 'is_quote_status', 'text', 'created_at', 'truncated', 'id_str', 'coordinates', 'timestamp_ms', 'in_reply_to_user_id_str', 'retweeted_status', 'retweeted', 'contributors', 'user', 'lang', 'source', 'geo', 'place'])
In [4]:
# Text of the first tweet
twitter_data[0]['text']
Out[4]:
'RT @nosolosig: Interpolación de datos meteorológicos: #Python y #ArcGis https://t.co/L38lFalyDw por @Gistraininges'
In [5]:
# We can extract the text from the tweets
tweets_text = [tweet['text'] for tweet in twitter_data]

# To see if it works, print out the first 10 tweets
tweets_text[:10]
Out[5]:
['RT @nosolosig: Interpolación de datos meteorológicos: #Python y #ArcGis https://t.co/L38lFalyDw por @Gistraininges',
 '@cityZenflagNews MyPOV: #QoTD #datascience is ask good questions.',
 'Free online learning- Python for data science  https://t.co/sbopcNdW5j',
 'RT @gugod: I never realized Python devs are that fancy https://t.co/0nlgwhCzA0',
 'New #internship opening at #Work4 in #SanFrancisco! Python #Developer #Intern https://t.co/gl5owKOD3O #Paris https://t.co/w7IqYAj9Ql',
 'RT @JobHero_io: #C++ #Python @mixpanel is seeking a Machine Learning Engineer to join their team in SF >> https://t.co/kg9EdUTJQg https://t…',
 'drunk C++, 4am Python, unknowability',
 'Python REST API Framework https://t.co/U3mvHsGT20 #webdesign',
 'Blender Game Engine et Gamekit jeu Force Cube sans script python https://t.co/0dRek2OMUW via @YouTube',
 'RT @JobHero_io: #Python #AngularJS @bastillenet is seeking a Full Stack UI Engineer to join their team >> https://t.co/bQMuw0OFzs https://t…']

Let's also take a look at the number of characters in a tweets. This dataset is from before Twitter changed their character limit, so you would expect there to be mostly < 140 character tweets.

In [6]:
tweet_lengths = [len(text) for text in tweets_text]

# Let's print legth of the first 10 tweets
tweet_lengths[:10]
Out[6]:
[114, 65, 70, 78, 132, 146, 36, 60, 101, 146]

We can have better understanding of our data if we plot a histogram instead of looking at the list of numbers

In [7]:
import pandas as pd
# Get notebook to show graphs
%pylab inline

# Use new pretty style of plots
matplotlib.style.use('ggplot')

# Because data scientists hate charts with no labels, let's add them :D
plt.ylabel('frequency')
plt.xlabel('number of characters in tweet')

# We can transform our list of tweet lengths from list to pandas Series
# it will let us to use hist() method to create histogram
pd.Series(tweet_lengths).hist(bins=20)
Populating the interactive namespace from numpy and matplotlib
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa11ac45be0>
In [8]:
# What's the average number of characters? What's the maximum or minimum?
# We will again use pandas Series instead of the python builtin type (list)
# It will allow us to use the describe method
tweet_lengths_series = pd.Series(tweet_lengths)

tweet_lengths_series.describe()
Out[8]:
count    3653.000000
mean      112.016151
std        31.293920
min         6.000000
25%        92.000000
50%       123.000000
75%       139.000000
max       152.000000
dtype: float64

Words counts

We are going to use a technique called word vectors to find out which words are most commonly used together with which other words. On the way to doing that, we will also see some very cool visualizations for word counts.

In [9]:
from collections import defaultdict

word_count = defaultdict(int)

for tweet in tweets_text:
    for word in tweet.split():
        word_count[word] += 1

# Count the words used in our tweets
print('{} unique words'.format(len(word_count)))
12683 unique words
In [10]:
# Here is a python standard library feature that is quite cool!
from collections import Counter

words = Counter(word_count)
print(words.most_common(10))
[('RT', 1704), ('Python', 1030), ('to', 758), ('#DataScience', 675), ('in', 595), ('#datascience', 579), ('for', 508), ('a', 486), ('-', 471), ('the', 463)]

Visualization

If you were asked to find the best chart to visualize word counts, how would you do it? Here's a cool little non-standard library that you should be able to install with a single command. Python is amazing!

In [11]:
from wordcloud import WordCloud

temp = {'a': 3, 'b': 1}

wordcloud = WordCloud(width=800, height=600).generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
Out[11]:
(-0.5, 799.5, 599.5, -0.5)

Word clouds are so coool. Let's make the picture take up the whole screen, so we can stare at it IN ALL ITS GLORY :D

In [12]:
def enlarge(multiplier=2):
    """If you want to understand more about this function, refer to the data visualization notebook."""
    figure = plt.gcf()
    original_width, original_height = figure.get_size_inches()
    new_size = (original_width * multiplier, original_height * multiplier)
    figure.set_size_inches(new_size)

enlarge()
plt.imshow(wordcloud)
plt.axis("off")
Out[12]:
(-0.5, 799.5, 599.5, -0.5)

Data cleanup

Let's get back on track again... Too much chart porn is bad for you after all.

First, let's do some long overdue data cleanup that we spotted from the word cloud. We probably don't care about retweets, prepositions etc. And on that note, we also probably don't care about the words which only occur a couple times.

In [13]:
# It is good practice to exclude the most common words,
# like articles (the, a, ...), prepositions (on, by, ...) or some abreviations (rt - retweeted)
exclude_words = {
    'rt', 'to', 'for', 'the', 'with', 'at', 'via', 'on', 'if', 'by', 'how', 'are', 'this'
    'do', 'into', 'or', '-', 'you', 'is', 'a', 'i', 'it', 'in', 'and', 'of', 'from', '&gt'
}

word_count_filtered = {k: v for k, v in word_count.items() if k.lower() not in exclude_words}

# Let's represent the word_count_filtered as pandas DataFrame
words = pd.DataFrame.from_dict(word_count_filtered, orient='index').rename(columns={0: 'frequency'})

# The results are as following
words.head(15)
Out[13]:
frequency
@nosolosig: 1
Interpolación 1
de 168
datos 6
meteorológicos: 1
#Python 235
y 104
#ArcGis 1
https://t.co/L38lFalyDw 1
por 7
@Gistraininges 1
@cityZenflagNews 1
MyPOV: 1
#QoTD 1
#datascience 579
In [14]:
# We want to limit our vocabulary to only the most common words
limit = 30

shortened_list = words[words.frequency > limit]
print(
    'If we limit the words to any word that at least occurs {} times, '
    'we are left with {} words (from {} words)'.format(
        limit, len(shortened_list), len(words)
    )
)
If we limit the words to any word that at least occurs 30 times, we are left with 145 words (from 12627 words)

Colocation/co-occurrence frequency

Now we are finally all set to figure out the question we had previously posed: if a word is in the tweet, how frequently do these other words also show up in the tweet?

In [15]:
# First, let's create a DataFrame filled with zeros
occurrence_frequency = pd.DataFrame(0, index=shortened_list.index.values, columns=shortened_list.index.values)

# Sanity check (let's see if we succeeded, by printed the first blok of the matrix)
occurrence_frequency.iloc[:5, :5]
Out[15]:
de #Python y #datascience Python
de 0 0 0 0 0
#Python 0 0 0 0 0
y 0 0 0 0 0
#datascience 0 0 0 0 0
Python 0 0 0 0 0
In [16]:
# Next, let's remove all the unncessary words from our tweets
allowed_words = occurrence_frequency.index

cleaned_tweets = []
for text in tweets_text:
    words_in_one_tweet = text.split()
    cleaned_tweets.append([w for w in words_in_one_tweet if w in allowed_words])

# To check if everything works, we print the first 10 tweets
# we should see only the most common words
cleaned_tweets[:10]
Out[16]:
[['de', '#Python', 'y'],
 ['#datascience'],
 ['Python', 'data'],
 ['Python', 'that'],
 ['New', 'Python'],
 ['#Python',
  'Machine',
  'Learning',
  'Engineer',
  'their',
  '&gt;&gt;',
  'https://t…'],
 ['Python,'],
 ['Python'],
 ['python', '@YouTube'],
 ['#Python', 'Engineer', 'their', '&gt;&gt;', 'https://t…']]
In [17]:
# A triple for-loop to add up and fill in the counts for each word vis-a-vis other words
for word_list in cleaned_tweets:
    for word in word_list:
        for other_word in word_list:
            occurrence_frequency[word][other_word] += 1
In [18]:
# Let's display our results (first 10 lines)
occurrence_frequency.head(10)
Out[18]:
de #Python y #datascience Python data that New Machine Learning ... Slept Night. Stopped Eating https://t.co/AZMPaN7WCO https://t.co/25xQgIocEY 2601 Armada @EricIdle #data4good
de 292 29 83 13 69 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
#Python 29 237 9 4 35 3 2 1 9 12 ... 0 0 0 0 0 0 0 0 0 0
y 83 9 110 1 68 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
#datascience 13 4 1 579 9 93 3 8 8 13 ... 0 0 0 0 0 0 0 0 0 5
Python 69 35 68 9 1086 8 23 17 19 90 ... 37 37 37 37 37 37 0 0 7 0
data 0 3 0 93 8 202 0 0 3 0 ... 0 0 0 0 0 0 0 0 0 9
that 0 2 0 3 23 0 69 0 0 0 ... 0 0 0 0 0 0 0 0 3 0
New 0 1 0 8 17 0 0 37 0 0 ... 0 0 0 0 0 0 0 0 0 0
Machine 0 9 0 8 19 3 0 0 65 47 ... 0 0 0 0 0 0 0 0 0 0
Learning 0 12 0 13 90 0 0 0 47 139 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 145 columns

Great! Now we have everything setup and we are ready to look at the cosine similarity between different words.

We are thinking of each word as a n-dimensional vector (where each dimension is the co-occurence frequency for another specific word). The cosine similarity basically looks and says, "hey word_a co-occurs a lot with word_b but does not appear with word_c. Oh hey, word_d also co-occurs a lot with word_b but not with word_c. I guess that word_a and word_d must be quite similar then."

In [19]:
from scipy.spatial.distance import pdist, squareform

cosine_distances = squareform(pdist(occurrence_frequency, metric='cosine'))
cosine_distances.shape
Out[19]:
(145, 145)
In [20]:
# Let's look at the top left corner of our array
cosine_distances[:5,:5]
Out[20]:
array([[0.        , 0.76923043, 0.22826184, 0.94164195, 0.67882878],
       [0.76923043, 0.        , 0.84850698, 0.94141068, 0.7916521 ],
       [0.22826184, 0.84850698, 0.        , 0.98641635, 0.60571218],
       [0.94164195, 0.94141068, 0.98641635, 0.        , 0.95062265],
       [0.67882878, 0.7916521 , 0.60571218, 0.95062265, 0.        ]])

You can see that the distances between any word and itself is 0. Let's flip it around for a second and look at similarity instead.

In [21]:
cosine_similarities_array = np.exp(-cosine_distances)
similarity = pd.DataFrame(
    cosine_similarities_array, 
    index=occurrence_frequency.index, 
    columns=occurrence_frequency.index
)
similarity.head(10)
Out[21]:
de #Python y #datascience Python data that New Machine Learning ... Slept Night. Stopped Eating https://t.co/AZMPaN7WCO https://t.co/25xQgIocEY 2601 Armada @EricIdle #data4good
de 1.000000 0.463370 0.795916 0.389987 0.507211 0.378848 0.392999 0.403138 0.389935 0.406722 ... 0.388428 0.388726 0.388428 0.388428 0.388428 0.388428 0.372163 0.372163 0.380086 0.375304
#Python 0.463370 1.000000 0.428054 0.390077 0.453096 0.390730 0.406480 0.415572 0.459683 0.459537 ... 0.383354 0.383341 0.383354 0.383354 0.383354 0.383354 0.394035 0.394035 0.383846 0.399630
y 0.795916 0.428054 1.000000 0.372911 0.545686 0.371793 0.396287 0.405142 0.389582 0.411863 ... 0.393984 0.394179 0.393984 0.393984 0.393984 0.393984 0.368789 0.368789 0.382965 0.369067
#datascience 0.389987 0.390077 0.372911 1.000000 0.386500 0.667969 0.399671 0.458954 0.426873 0.428370 ... 0.370691 0.370685 0.370691 0.370691 0.370691 0.370691 0.380188 0.380188 0.371371 0.456797
Python 0.507211 0.453096 0.545686 0.386500 1.000000 0.386593 0.510194 0.560222 0.493864 0.640019 ... 0.537908 0.537584 0.537908 0.537908 0.537908 0.537908 0.379933 0.379933 0.438537 0.378252
data 0.378848 0.390730 0.371793 0.667969 0.386593 1.000000 0.390528 0.408013 0.404771 0.397461 ... 0.371904 0.371905 0.371904 0.371904 0.371904 0.371904 0.392177 0.392177 0.371890 0.503738
that 0.392999 0.406480 0.396287 0.399671 0.510194 0.390528 1.000000 0.425079 0.402809 0.433386 ... 0.404519 0.404430 0.404519 0.404519 0.404519 0.404519 0.374788 0.374788 0.429359 0.380910
New 0.403138 0.415572 0.405142 0.458954 0.560222 0.408013 0.425079 1.000000 0.429176 0.474012 ... 0.409153 0.409052 0.409153 0.409153 0.409153 0.409153 0.391130 0.391130 0.391947 0.403866
Machine 0.389935 0.459683 0.389582 0.426873 0.493864 0.404771 0.402809 0.429176 1.000000 0.775371 ... 0.393028 0.392969 0.393028 0.393028 0.393028 0.393028 0.387764 0.387764 0.389608 0.395484
Learning 0.406722 0.459537 0.411863 0.428370 0.640019 0.397461 0.433386 0.474012 0.775371 1.000000 ... 0.419875 0.419747 0.419875 0.419875 0.419875 0.419875 0.409075 0.409075 0.399843 0.433697

10 rows × 145 columns

Now you can see that any word is 100% similar with itself.
Well that is great and all, but how would you visualize word similarity?
It turns out that scikit learn has just the tool for us:

In [22]:
from sklearn import manifold

# see http://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling
mds = manifold.MDS(n_components=2, dissimilarity='precomputed')
words_in_2d = mds.fit_transform(cosine_distances)
words_in_2d[:5]
Out[22]:
array([[ 0.02600584, -0.68047923],
       [-0.33685271, -0.48213439],
       [ 0.09227946, -0.70892272],
       [-0.02415638,  0.54710068],
       [ 0.27501097, -0.29394945]])

MDS allows us to go from the n by n matrix down to a more manageable lower-dimension representation of the n words.
In this case, we choose a 2-d representation, which allows us to...

In [23]:
# make a bubble chart
counts = [word_count[word] for word in occurrence_frequency.index.values]
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts)
Out[23]:
<matplotlib.collections.PathCollection at 0x7fa105a3bc88>
In [24]:
# let's enlarge it and add labels
enlarge()
important_words = words[words.frequency > 80].index.values
for word in important_words:
    idx = occurrence_frequency.index.get_loc(word)
    plt.annotate(word, xy=words_in_2d[idx], xytext=(0,0), textcoords='offset points')
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts, alpha=0.3)
Out[24]:
<matplotlib.collections.PathCollection at 0x7fa1059bbfd0>

That's cool- you can see there is:

  • a cluster with monty + python
  • a cluster of (I'm guessing) Spanish words
  • a cluster of data science / big data / machine learning / data analytics, which weirdly also contains @kirkdborne. Checking his twitter, it turns out he posts a lot about data science!

Dig Deeper

If you've gotten to here, a big congratulations on finishing the first part of this tutorial!

If you stil have time, here are a couple suggestions for you to work on:

  • Try to write your own code to download twitter tweets. Here is a guide that is quite comprehensive. You will have to setup a twitter developer's account, create an app and get an api token first though.
  • Try to use what we have developed so far to create your own search algorithm. eg: search for all the tweets that has to do with machine learning (and make it smart enough to automatically show anything related to data science, big data, data analytics etc)
  • We kept bumping up against resource limits, especially during the triple for loop when filling out the occurrence_frequency counts. Given n tweets, there are probably k*n words, and so it has (very very roughly) a computation complexity of O(n^3). Most of the other computations we did were mainly O(kn). Can we rewrite the code to make it better?
  • For this last scatter plot we just generated showing which words are frequently used with which other words, can we use a clustering algorithm to color them, so that we can see the clusters that we observed more clearly?
In [25]:
from IPython.core.display import HTML
HTML("""
    <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Fun with Clusters at <a href="https://twitter.com/hashtag/PyDataLondon?src=hash&amp;ref_src=twsrc%5Etfw">#PyDataLondon</a> <a href="https://t.co/j42lbx4kyx">pic.twitter.com/j42lbx4kyx</a></p>&mdash; Lewis Oaten (@lewisoaten) <a href="https://twitter.com/lewisoaten/status/728548835082047489?ref_src=twsrc%5Etfw">May 6, 2016</a></blockquote>
    <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")
Out[25]:

Turns out you can color the clusters!

Part 2 - Sentiment Analysis

In this part we will use textblob to determine the sentiment of the tweets. Textblob already has ready-trained classifiers that we can use for this purpose, so it is quite plug and play.

First, let's make sure we understand how it works:

In [26]:
from textblob import TextBlob

# Let's check a polarity of a positive sentence (try some other sentences as well!)
blob = TextBlob("The life is good.")
blob.polarity
Out[26]:
0.7
In [27]:
# Nowe we can check a polarity of a negative sentence (try some other sentences as well!)
blob = TextBlob("The life is tough.")
blob.polarity
Out[27]:
-0.3888888888888889

For textblob, we also need to clean the tweets to remove links and special characters.

In [28]:
import re

def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

cleaned_text = [clean_tweet(tweet['text']) for tweet in twitter_data]

cleaned_text[:5]
Out[28]:
['RT Interpolaci n de datos meteorol gicos Python y ArcGis por',
 'MyPOV QoTD datascience is ask good questions',
 'Free online learning Python for data science',
 'RT I never realized Python devs are that fancy',
 'New internship opening at Work4 in SanFrancisco Python Developer Intern Paris']

Let's check the sentiment of each tweet!

In [29]:
tweets_with_polarity = [(TextBlob(t).polarity, t) for t in cleaned_text]
    
# let's check the results
tweets_with_polarity[:5]
Out[29]:
[(0.0, 'RT Interpolaci n de datos meteorol gicos Python y ArcGis por'),
 (0.7, 'MyPOV QoTD datascience is ask good questions'),
 (0.4, 'Free online learning Python for data science'),
 (0.0, 'RT I never realized Python devs are that fancy'),
 (0.13636363636363635,
  'New internship opening at Work4 in SanFrancisco Python Developer Intern Paris')]
In [30]:
# the most positive tweets
sorted(tweets_with_polarity, key=lambda tup: tup[0])[-10:]
Out[30]:
[(1.0,
  'RT Awesome overview amp explanation Trifecta Python MachineLearning Dueling Languages by'),
 (1.0,
  'RT Why MachineLearning with Python is the best combination abdsc BigData D'),
 (1.0,
  'RT Why MachineLearning with Python is the best combination abdsc BigData D'),
 (1.0,
  'RT 5 Best Programming Languages to Learn for Beginners programming javascript java python cod'),
 (1.0, 'Wonderful wonderful Monty Python'),
 (1.0,
  'RT Best Ways to Learn Programming for Beginners programming code Codecademy php python javas'),
 (1.0, 'RT Why MachineLearning with Python is the best combination'),
 (1.0,
  'Checking out Machine Learning with Python Why do they form the best combinat on AnalyticBridge'),
 (1.0,
  'RT Checking out Machine Learning with Python Why do they form the best combinat on AnalyticBridge'),
 (1.0, 'I wish I understood this cause it sounds Awesome payattention')]
In [31]:
# the most negative tweets
sorted(tweets_with_polarity, key=lambda tup: tup[0])[:5]
Out[31]:
[(-1.0,
  'I don t trust benchmarks nasty things but looking at I want to believe Dat performance'),
 (-0.875,
  'Bfff Brutal sima joder Python Regius Stormtrooper BallPythonLove lt 3'),
 (-0.8,
  'job alert Senior Python Developer Dublin Dublin The Client My client is a software development company base'),
 (-0.7999999999999999,
  'like do i use php 5 or php 7 where d 6 go python 2 6 or python 2 7 fucking stupid to multi support like that just pick 1'),
 (-0.7,
  'Creating a legend for a map should not be this painful python matplotlib Basemap')]

Dig Deeper

Check out this tutorial if you are interested.

Part 3 - A Shakespeare Generator

In part 1, we looked at word count / word level analytics. Inspired by the unreasonable effectiveness of character-level language models, let's try to use a Maximum Likelihood Character Level Language Model to generate Shakespeare!

In [32]:
# First we need a large body of text
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
--2018-04-25 19:49:58--  http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt [following]
--2018-04-25 19:49:58--  https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [text/plain]
Saving to: ‘shakespeare_input.txt.1’

100%[======================================>] 4,573,338   6.95MB/s   in 0.6s   

2018-04-25 19:49:59 (6.95 MB/s) - ‘shakespeare_input.txt.1’ saved [4573338/4573338]

In [33]:
# let's see what the file contains

with open("shakespeare_input.txt") as f:
    shakespeare = f.read()
print(shakespeare[:300])
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us
In [34]:
from collections import Counter

def train_char_lm(data, order=4):
    """Train the Maximum Likelihood Character Level Language Model."""
    language_model = defaultdict(Counter)
    
    # we add special characters at the beginning of the text to get things started
    padding = "~" * order
    data = padding + data
    
    # count how many times a given letter follows after a particular n-char history.
    for i in range(len(data) - order):
        history, char = data[i:i + order], data[i + order]
        language_model[history][char] += 1

    # we normalize our results
    normalized = {hist: normalize(chars) for hist, chars in language_model.items()}
    return normalized


def normalize(counter):
    """Normalize counter by the sum of all values."""
    sum_of_values = float(sum(list(counter.values())))
    return [(key, value/sum_of_values) for key, value in counter.items()]
In [35]:
# Let's us train our model!
language_model = train_char_lm(shakespeare, order=4)
In [36]:
# Check how the model look like
list(language_model.items())[:6]
Out[36]:
[('~~~~', [('F', 1.0)]),
 ('~~~F', [('i', 1.0)]),
 ('~~Fi', [('r', 1.0)]),
 ('~Fir', [('s', 1.0)]),
 ('Firs', [('t', 1.0)]),
 ('irst',
  [(' ', 0.8337095560571859),
   (',', 0.08201655379984951),
   (':', 0.011286681715575621),
   ('?', 0.004514672686230248),
   ('y', 0.006019563581640331),
   ('\n', 0.014296463506395787),
   ('.', 0.028592927012791574),
   ('-', 0.008276899924755455),
   ("'", 0.0007524454477050414),
   (';', 0.007524454477050414),
   ('s', 0.0007524454477050414),
   ('l', 0.0015048908954100827),
   ('i', 0.0007524454477050414)])]

It means, that after Firs, we always get t with probability 1. But after First, we might see a space with probability 0.83, or comma with probability 0.082 etc.

Let's us check which letter is the most probable after hous. Since we generated a model with order 4, we can look only at last 4 letters.

In [37]:
# Other example
language_model['hous']
Out[37]:
[('a', 0.38618346545866367), ('e', 0.6138165345413363)]

The most probable, as expected, is e (house).

Why a? Becuase hous can be a part of the thousands.

Play around with this!

Now let's use the model to generate some Shakespearean!

In [38]:
from random import random

def generate_letter(model, history, order):
    """Generate next letter with given probabilities."""
    history = history[-order:]
    probabilities = model[history]
    x = random()
    for character, prob in probabilities:
        x = x - prob
        if x <= 0:
            return character
In [39]:
def generate_text(model, order, nletters=1000):
    """Generate new text using our model."""
    # Use the special character to get things started
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(model, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)
In [40]:
print(generate_text(language_model, 4))
First.
O, find
There welcomes Sir Hamlet!

NESTOR:
If heaven with posses.

POSTHUMBERLANDO:
Most unbelieves see accidenhood yet me! Flamity.

BRUTUS:
Go, loss, my life!
And time ever know's of penant,
Like not her from amazed:
Whom that bring us dear.

MISTRESS QUICKLY:
A noble too.

BALTHASAR:
Prince:
A doubted
Find oppose rease patience orders heart
in away sorrow this rest thou
sick in a thou tall us to be glad damned lie.

COUNTESS:
Nay, those the Count a villain? ell.

BEATRICE:
In iron on the privil aged in away, sin Rome; I verifice,--

PAROLLES:
Yours of yourselves,
That thou music womb, what the made it.
If moss'd: this it known I am in
her lips and so good inter quaintend
The knave, all spends sojour.

BASTARD:
What Ulysses on yields in think it, I change from hence friend, are up that fell three.
A places.
And doctor mine ears, upon the that thing afore, thus
From dame,
Why servant:
Go, carest humourn'd?

ROSALIND:
Here's mine your my loved authorns does you awhiles heads, w

It is amazing how such a simple model is enough to generate text that has a structure of a play, with capitalized character names in the script etc.

Run the above again and try generating more text!

We can also increase the model order to get even better results. However, it will take exponentially more time to create the model. However, once we have the model, generating new text should be quite fast.

In [41]:
# Finally, check the order 10. It can take a while...
language_model = train_char_lm(shakespeare, order=10)
print(generate_text(language_model, 10))
First Citizen:
Peace, son! and show thee, of a fool, inconstant to build upon!
Now, pray, sir?

FALSTAFF:
Well, and how?

ULYSSES:
Stand where 'tis truly owe
To that which yet distinction should have heard thence; these are all undone.

SHYLOCK:
An oath, and leave those horses from the good conclusion, he did beat me to acquaint you with him;
But none can drive him from your hands.

CAPULET:
Peace, master of his life before thy most perfect conscience.

SUFFOLK:
Ay, but yet
Let us be keen, and rather than have tidings
Of any penny tribute paid by howling Troy
To the shore, that you are, long; and be you well.
Yield: come before your fault upon me that can teach
thy fool to lie: I would your duty this way have you been true,
If heaven has an end in all: yet, you see that he is dead.

SOMERSET:
The quarrel 'twixt your throne and state,
In private, and I am done.

MIRANDA:
Alack, for mercy-lacking uses.

HUBERT:
Upon my target, thus.

PRINCESS:
From which place
We have yet many among us w

Dig Deeper

  • Try to repeat the above using tweets instead of Shakespeare text. Does it work? Is the text in tweets long enough to train our model well?
  • Our model seems to be impressive. But is the generated text really original? If we trained the model to an order of 100 or even 1000 on a really powerful machine, what would the output be if we tried to generate some text?
  • Believe it or not, there are better methods out there. If you are interested, check out this article by Andrej Karpathy describing how to generating Shakespeare-like text using Recurrent Neural Networks.