# NLP Notebook (Twitter, Sentiment Analysis and A Shakespeare Generator)

## Part 1 - Twitter NLP

You just signed up for PyDataLondon and you are super excited about it! Since you hear that measuring twitter sentiment is all the craze these days (be it for speculating in the stock market, or identifying a viral product), you decide that you also want in. Let's try to apply some NLP (natural language processing) goodness to analyze #PyDataLondon tweets!

### Load the data

First grab the data that we've downloaded for you.
The data is saved in the [pickle format](https://docs.python.org/3/library/pickle.html#data-stream-format).  
Don't be worried if you don't understand this part - it's just to set you up for the main parts.

In [None]:
import pickle

with open('./datasets/twitter_data.pkl', 'rb') as pickled_file:
    twitter_data = pickle.load(pickled_file)

Let's check how many tweets we have.  
Note, that iPython automatically displays the last output in the cell,
so it is enough to write `len(tweets_list)` instead of `print(len(tweets_list))`

In [None]:
len(twitter_data)

### Explore the data

Let's see what a tweet looks like

In [None]:
# Each tweet is represented by a dictionary with a following fields:
twitter_data[0].keys()

In [None]:
# Text of the first tweet
twitter_data[0]['text']

In [None]:
# We can extract the text from the tweets
tweets_text = [tweet['text'] for tweet in twitter_data]

# To see if it works, print out the first 10 tweets
tweets_text[:10]

Let's also take a look at the number of characters in a tweets. This dataset is from before Twitter changed their character limit, so you would expect there to be mostly < 140 character tweets.

In [None]:
tweet_lengths = [len(text) for text in tweets_text]

# Let's print legth of the first 10 tweets
tweet_lengths[:10]

We can have better understanding of our data if we plot a histogram instead of looking at the list of numbers

In [None]:
import pandas as pd
# Get notebook to show graphs
%pylab inline

# Use new pretty style of plots
matplotlib.style.use('ggplot')

# Because data scientists hate charts with no labels, let's add them :D
plt.ylabel('frequency')
plt.xlabel('number of characters in tweet')

# We can transform our list of tweet lengths from list to pandas Series
# it will let us to use hist() method to create histogram
pd.Series(tweet_lengths).hist(bins=20)

In [None]:
# What's the average number of characters? What's the maximum or minimum?
# We will again use pandas Series instead of the python builtin type (list)
# It will allow us to use the describe method
tweet_lengths_series = pd.Series(tweet_lengths)

tweet_lengths_series.describe()

### Words counts

We are going to use a technique called [word vectors](http://www.eecs.qmul.ac.uk/~dm303/static/eecs_open14/eecs_open14.pdf) to find out which words are most commonly used together with which other words. On the way to doing that, we will also see some very cool visualizations for word counts.

In [None]:
from collections import defaultdict

word_count = defaultdict(int)

for tweet in tweets_text:
    for word in tweet.split():
        word_count[word] += 1

# Count the words used in our tweets
print('{} unique words'.format(len(word_count)))

In [None]:
# Here is a python standard library feature that is quite cool!
from collections import Counter

words = Counter(word_count)
print(words.most_common(10))

### Visualization

If you were asked to find the best chart to visualize word counts, how would you do it? Here's a cool little non-standard library that you should be able to install with a single command. Python is amazing!

In [None]:
from wordcloud import WordCloud

temp = {'a': 3, 'b': 1}

wordcloud = WordCloud(width=800, height=600).generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")

Word clouds are so coool. Let's make the picture take up the whole screen, so we can stare at it __IN ALL ITS GLORY__ :D

In [None]:
def enlarge(multiplier=2):
    """If you want to understand more about this function, refer to the data visualization notebook."""
    figure = plt.gcf()
    original_width, original_height = figure.get_size_inches()
    new_size = (original_width * multiplier, original_height * multiplier)
    figure.set_size_inches(new_size)

enlarge()
plt.imshow(wordcloud)
plt.axis("off")

### Data cleanup

Let's get back on track again... Too much chart porn is bad for you after all.

First, let's do some long overdue data cleanup that we spotted from the word cloud. We probably don't care about retweets, prepositions etc. And on that note, we also probably don't care about the words which only occur a couple times.

In [None]:
# It is good practice to exclude the most common words,
# like articles (the, a, ...), prepositions (on, by, ...) or some abreviations (rt - retweeted)
exclude_words = {
    'rt', 'to', 'for', 'the', 'with', 'at', 'via', 'on', 'if', 'by', 'how', 'are', 'this'
    'do', 'into', 'or', '-', 'you', 'is', 'a', 'i', 'it', 'in', 'and', 'of', 'from', '&gt'
}

word_count_filtered = {k: v for k, v in word_count.items() if k.lower() not in exclude_words}

# Let's represent the word_count_filtered as pandas DataFrame
words = pd.DataFrame.from_dict(word_count_filtered, orient='index').rename(columns={0: 'frequency'})

# The results are as following
words.head(15)

In [None]:
# We want to limit our vocabulary to only the most common words
limit = 30

shortened_list = words[words.frequency > limit]
print(
    'If we limit the words to any word that at least occurs {} times, '
    'we are left with {} words (from {} words)'.format(
        limit, len(shortened_list), len(words)
    )
)

### Colocation/co-occurrence frequency

Now we are finally all set to figure out the question we had previously posed: if a word is in the tweet, how frequently do these other words also show up in the tweet?

In [None]:
# First, let's create a DataFrame filled with zeros
occurrence_frequency = pd.DataFrame(0, index=shortened_list.index.values, columns=shortened_list.index.values)

# Sanity check (let's see if we succeeded, by printed the first blok of the matrix)
occurrence_frequency.iloc[:5, :5]

In [None]:
# Next, let's remove all the unncessary words from our tweets
allowed_words = occurrence_frequency.index

cleaned_tweets = []
for text in tweets_text:
    words_in_one_tweet = text.split()
    cleaned_tweets.append([w for w in words_in_one_tweet if w in allowed_words])

# To check if everything works, we print the first 10 tweets
# we should see only the most common words
cleaned_tweets[:10]

In [None]:
# A triple for-loop to add up and fill in the counts for each word vis-a-vis other words
for word_list in cleaned_tweets:
    for word in word_list:
        for other_word in word_list:
            occurrence_frequency[word][other_word] += 1

In [None]:
# Let's display our results (first 10 lines)
occurrence_frequency.head(10)

Great! Now we have everything setup and we are ready to look at the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between different words.

We are thinking of each word as a n-dimensional vector (where each dimension is the co-occurence frequency for another specific word). The cosine similarity basically looks and says, "hey `word_a` co-occurs a lot with `word_b` but does not appear with `word_c`. Oh hey, `word_d` also co-occurs a lot with `word_b` but not with `word_c`. I guess that `word_a` and `word_d` must be quite similar then."

In [None]:
from scipy.spatial.distance import pdist, squareform

cosine_distances = squareform(pdist(occurrence_frequency, metric='cosine'))
cosine_distances.shape

In [None]:
# Let's look at the top left corner of our array
cosine_distances[:5,:5]

You can see that the distances between any word and itself is 0.
Let's flip it around for a second and look at similarity instead.

In [None]:
cosine_similarities_array = np.exp(-cosine_distances)
similarity = pd.DataFrame(
    cosine_similarities_array, 
    index=occurrence_frequency.index, 
    columns=occurrence_frequency.index
)
similarity.head(10)

Now you can see that any word is 100% similar with itself.  
Well that is great and all, but how would you visualize word similarity?  
It turns out that scikit learn has just the tool for us:

In [None]:
from sklearn import manifold

# see http://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling
mds = manifold.MDS(n_components=2, dissimilarity='precomputed')
words_in_2d = mds.fit_transform(cosine_distances)
words_in_2d[:5]

[MDS](https://en.wikipedia.org/wiki/Multidimensional_scaling) allows us to go from the n by n matrix down to a more manageable lower-dimension representation of the n words.  
In this case, we choose a 2-d representation, which allows us to...

In [None]:
# make a bubble chart
counts = [word_count[word] for word in occurrence_frequency.index.values]
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts)

In [None]:
# let's enlarge it and add labels
enlarge()
important_words = words[words.frequency > 80].index.values
for word in important_words:
    idx = occurrence_frequency.index.get_loc(word)
    plt.annotate(word, xy=words_in_2d[idx], xytext=(0,0), textcoords='offset points')
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts, alpha=0.3)

That's cool- you can see there is:
- a cluster with monty + python
- a cluster of (I'm guessing) Spanish words
- a cluster of data science / big data / machine learning / data analytics, which weirdly also contains @kirkdborne. Checking his twitter, it turns out he posts a lot about data science!

### Dig Deeper

If you've gotten to here, a big congratulations on finishing the first part of this tutorial!

If you stil have time, here are a couple suggestions for you to work on:

- Try to write your own code to download twitter tweets. [Here](http://adilmoujahid.com/posts/2014/07/twitter-analytics/) is a guide that is quite comprehensive. You will have to setup a twitter developer's account, create an app and get an api token first though.
- Try to use what we have developed so far to create your own search algorithm. eg: search for all the tweets that has to do with machine learning (and make it smart enough to automatically show anything related to data science, big data, data analytics etc)
- We kept bumping up against resource limits, especially during the triple for loop when filling out the occurrence_frequency counts. Given n tweets, there are probably k*n words, and so it has (very very roughly) a [computation complexity](https://en.wikipedia.org/wiki/Big_O_notation) of O(n^3). Most of the other computations we did were mainly O(kn). Can we rewrite the code to make it better?
- For this last scatter plot we just generated showing which words are frequently used with which other words, can we use a clustering algorithm to color them, so that we can see the clusters that we observed more clearly?

In [None]:
from IPython.core.display import HTML
HTML("""
    <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Fun with Clusters at <a href="https://twitter.com/hashtag/PyDataLondon?src=hash&amp;ref_src=twsrc%5Etfw">#PyDataLondon</a> <a href="https://t.co/j42lbx4kyx">pic.twitter.com/j42lbx4kyx</a></p>&mdash; Lewis Oaten (@lewisoaten) <a href="https://twitter.com/lewisoaten/status/728548835082047489?ref_src=twsrc%5Etfw">May 6, 2016</a></blockquote>
    <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")


Turns out you can color the clusters!

## Part 2 - Sentiment Analysis

In this part we will use textblob to determine the sentiment of the tweets. Textblob already has ready-trained classifiers that we can use for this purpose, so it is quite plug and play.

First, let's make sure we understand how it works:

In [None]:
from textblob import TextBlob

# Let's check a polarity of a positive sentence (try some other sentences as well!)
blob = TextBlob("The life is good.")
blob.polarity

In [None]:
# Nowe we can check a polarity of a negative sentence (try some other sentences as well!)
blob = TextBlob("The life is tough.")
blob.polarity

For textblob, we also need to clean the tweets to remove links and special characters.

In [None]:
import re

def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

cleaned_text = [clean_tweet(tweet['text']) for tweet in twitter_data]

cleaned_text[:5]

Let's check the sentiment of each tweet!

In [None]:
tweets_with_polarity = [(TextBlob(t).polarity, t) for t in cleaned_text]
    
# let's check the results
tweets_with_polarity[:5]

In [None]:
# the most positive tweets
sorted(tweets_with_polarity, key=lambda tup: tup[0])[-10:]

In [None]:
# the most negative tweets
sorted(tweets_with_polarity, key=lambda tup: tup[0])[:5]

### Dig Deeper
Check out [this tutorial](https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-) if you are interested.

## Part 3 - A Shakespeare Generator

In part 1, we looked at word count / word level analytics. Inspired by the [unreasonable effectiveness](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139) of character-level language models, let's try to use a Maximum Likelihood Character Level Language Model to generate Shakespeare!

In [None]:
# First we need a large body of text
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

In [None]:
# let's see what the file contains

with open("shakespeare_input.txt") as f:
    shakespeare = f.read()
print(shakespeare[:300])

In [None]:
from collections import Counter

def train_char_lm(data, order=4):
    """Train the Maximum Likelihood Character Level Language Model."""
    language_model = defaultdict(Counter)
    
    # we add special characters at the beginning of the text to get things started
    padding = "~" * order
    data = padding + data
    
    # count how many times a given letter follows after a particular n-char history.
    for i in range(len(data) - order):
        history, char = data[i:i + order], data[i + order]
        language_model[history][char] += 1

    # we normalize our results
    normalized = {hist: normalize(chars) for hist, chars in language_model.items()}
    return normalized


def normalize(counter):
    """Normalize counter by the sum of all values."""
    sum_of_values = float(sum(list(counter.values())))
    return [(key, value/sum_of_values) for key, value in counter.items()]

In [None]:
# Let's us train our model!
language_model = train_char_lm(shakespeare, order=4)

In [None]:
# Check how the model look like
list(language_model.items())[:6]

It means, that after `Firs`, we always get `t` with probability 1. But after `First`, we might see a space with probability 0.83, or comma with probability 0.082 etc.

Let's us check which letter is the most probable after `hous`. Since we generated a model with order 4, we can look only at last 4 letters.

In [None]:
# Other example
language_model['hous']

The most probable, as expected, is `e` (house).

Why `a`?  Becuase `hous` can be a part of the `thousands`.

Play around with this!

Now let's use the model to generate some Shakespearean!

In [None]:
from random import random

def generate_letter(model, history, order):
    """Generate next letter with given probabilities."""
    history = history[-order:]
    probabilities = model[history]
    x = random()
    for character, prob in probabilities:
        x = x - prob
        if x <= 0:
            return character

In [None]:
def generate_text(model, order, nletters=1000):
    """Generate new text using our model."""
    # Use the special character to get things started
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(model, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

In [None]:
print(generate_text(language_model, 4))

It is amazing how such a simple model is enough to generate text that has a structure of a play, with capitalized character names in the script etc.

Run the above again and try generating more text!

We can also increase the model order to get even better results. However, it will take exponentially more time to create the model. However, once we have the model, generating new text should be quite fast.

In [None]:
# Finally, check the order 10. It can take a while...
language_model = train_char_lm(shakespeare, order=10)
print(generate_text(language_model, 10))

### Dig Deeper

- Try to repeat the above using tweets instead of Shakespeare text. Does it work? Is the text in tweets long enough to train our model well?
- Our model seems to be impressive. But is the generated text really original? If we trained the model to an order of 100 or even 1000 on a really powerful machine, what would the output be if we tried to generate some text?
- Believe it or not, there are better methods out there. If you are interested, check out [this article](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy describing how to
generating Shakespeare-like text using Recurrent Neural Networks.
