Data Visualization Notebook

In [1]:
# this is an important setup step to do upfront. without this none of your graphs will automatically display
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Tweaking your charts

Here is a simple chart of the square function

In [2]:
import matplotlib.pyplot as plt
xs = range(100)
ys = [x * x for x in xs]
plt.plot(xs, ys)
Out[2]:
[<matplotlib.lines.Line2D at 0x7fedeccdd3c8>]

The default matplotlib graphs are ugly. Let's change it to a different plotting style. In fact you can even create and customize your own style!

In [3]:
matplotlib.style.use('ggplot')
plt.plot(xs, ys)
Out[3]:
[<matplotlib.lines.Line2D at 0x7fedecb700f0>]

Like any good data scientist, let's label our axes

In [4]:
plt.plot(xs, ys)
plt.ylabel('Happiness')
plt.xlabel('Tutorial Progress')
plt.title('How happy I am vs my tutorial progress')
Out[4]:
Text(0.5,1,'How happy I am vs my tutorial progress')

Let's say you become delirious when your happiness level reaches 4096 (because you are a big fan of 2048) Let's put a horizontal line in to denote that, and change the colors around so that you attain delerium past the red line.

In [5]:
plt.ylabel('Happiness')
plt.xlabel('Tutorial Progress')
plt.title('How happy I am vs my tutorial progress')

plt.plot(xs, ys, color='black')
plt.axhline(4096, color='red')
plt.text(x=10, y=4096, s='Delirium boundary', color='red')
Out[5]:
Text(10,4096,'Delirium boundary')

A note: passing in keyword arguments instead of position arguments really really helps you remember what the parameters are later on. It makes your code that much more descriptive and approachable for a coworker/fellow data scientist.

Let's say the people sitting next to you has a different utility (happiness) function. How would you make a plot for 3 different people?

It turns out that we can do better to make our code reusable. Let's create our first function to increase reusability!

In [6]:
def label_everything():
    plt.ylabel('Happiness')
    plt.xlabel('Tutorial Progress')
    plt.title('Happiness vs tutorial progress')

# john has a linear utility function
john_ys = xs
plt.plot(xs, john_ys)
label_everything()
In [7]:
conrad_ys = [sqrt(x) for x in xs]
plt.plot(xs, conrad_ys)
label_everything()

Okay, so we can easily create plots for different people. But how do we combine it into a single plot?

In [8]:
plt.plot(xs, ys)
plt.plot(xs, conrad_ys)
plt.plot(xs, john_ys)
label_everything()

Hmmm. That doesn't really work.

In [9]:
plt.plot(xs, ys, label="ME!")
plt.plot(xs, conrad_ys, label="Conrad")
plt.plot(xs, john_ys, label="John")
label_everything()
plt.yscale('log')
plt.legend()
Out[9]:
<matplotlib.legend.Legend at 0x7fedec920400>

To move the legend location around (eg: outside of the chart), see this.

More advanced topics

Another way to display our happiness is to show them side by side. Let's explore the dreaded complexities of subplots a little bit.

Let's use the subplots() function. There is also a plt.subplot() that creates just a single subplot and not other ones. But let's create them all now to illustrate an important concept:

In [10]:
# create a figure with 3 subplots (ie. axes)
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, sharex=True)
label_everything()

You can see that the chart title and axis labels were applied to the last subplot (axes) created. This is because matplotlib actually secretly keeps track of the "current" figure and "current" axes you are on. All our labelling etc is implicitly done on the current figure/axes.

You can manipulate the current axes with plt.gca() to get the current axes, plt.sca(ax1) to set current to ax1, and plt.cla() to clear current axes, and similarly for the current figure (gcf, scf, clf).

In [11]:
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, sharex=True)

# let's see if this does what we think it does
plt.sca(ax2)
label_everything()

Now you can see why it might make sense to use plt.subplot- you implicitly set the current axes, allowing you to add labels/lines etc to it easily.

It is actually possible to explicitly specify an axes as well. For example, you can use the command ax1.axhline(), which would allow you to set the horizontal line on the ax1 subplot even if the current axes is not ax1. However, you run into lots of fun when the interface for ax1 differs from what you expect. For example, instead of plt.title, you are actually expected to call ax1.set_title.

On the other hand, it's also so fun to juggle implicit state when you are working with multiple figures or axes.

In [12]:
def tell_the_tale_of_three_people():
    fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, sharex=True)

    fig.suptitle('A Tale of Three People')
    ax1.plot(xs, ys, label="ME!")
    ax2.plot(xs, conrad_ys, label="Conrad")
    ax3.plot(xs, john_ys, label="John")

tell_the_tale_of_three_people()

Let's say you think the figure looks squashed.

In [13]:
def enlarge(x_multiple=1, y_multiple=1):
    figure = plt.gcf()  # do you remember what gcf() does?
    original_width, original_height = figure.get_size_inches()
    new_size = (original_width * x_multiple, original_height * y_multiple)
    figure.set_size_inches(new_size)

tell_the_tale_of_three_people()
enlarge(y_multiple=1.5)

Now, let's say instead of having xs = range(100), you create xs = range(1000000). You may notice that everything slows down a lot. In particular, if you are tracking a lot of people's utility functions, a very costly part may be the generation of the ys. How do you fix this?

In [14]:
# one way, of course, is to just store the function

conrad = lambda x: sqrt(x)
john = lambda x: x

This is great if you just want to store the function / know about it. But what if you want to plot it etc?

First of all, let us understand why it is that the function definition is so much more succinct (in terms of space efficiency). It is because it's saying, "hey instead of generating and storing all these y's, instead, apply this function when you need it to get whatever y you want".

Why is it slow to create the 1,000,000 item ys right now? It's because your computer needs to loop over all 1,000,000 x's, and it doesn't start computing the next y until it's done the previous one.

In this case, how to computing y is completely independent of the previous y's (there could be situations where this is not the case- eg: a recursive definition of the fibonacci sequence). Is there a way to tell your computer to parallelize these computations and do them all at once?

In [15]:
# why with the magic of python, it's just one click away!

xs = range(1000000)
# use numpy arrays
xs_numpy = np.arange(1000000)

# use ipython magic to time how long it takes for the following to execute
%timeit [conrad(x) for x in xs]
%timeit np.apply_along_axis(conrad, 0, xs_numpy)
9.09 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The slowest run took 7.95 times longer than the fastest. This could mean that an intermediate result is being cached.
18.8 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Look at that huge huge difference!

Another really cool library to try out is pandas. Pandas is built on top of numpy and has a lot of friendly functions. One of which is helper functions for graphs!

In [16]:
import pandas as pd

xs_pandas = pd.Series(xs)

%timeit xs_pandas.apply(conrad)
12.7 s ± 3.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Interestingly, pandas does very poorly here. However, if you pass it optimized functions, it is a whole different story:

In [ ]:
%timeit [log(x) for x in xs]
%timeit np.apply_along_axis(np.log, 0, xs_numpy)
%timeit xs_pandas.apply(np.log)
/home/conrad/.virtualenvs/pydata/lib/python3.6/site-packages/IPython/kernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
  if __name__ == '__main__':
9.72 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
/home/conrad/.virtualenvs/pydata/lib/python3.6/site-packages/numpy/lib/shape_base.py:132: RuntimeWarning: divide by zero encountered in log
  res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
The slowest run took 4.68 times longer than the fastest. This could mean that an intermediate result is being cached.
146 ms ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And of course the great thing about pandas is...

In [ ]:
ys_pandas = xs_pandas.apply(np.log)
ys_pandas[1:100].plot(kind="area")
In [ ]:
everyone = pd.DataFrame({'conrad': conrad_ys, 'john': john_ys, 'me': ys})
everyone.plot(subplots=True)
In [ ]:
# There's also a lot more helper functions
ys_pandas.head()
In [ ]:
# describe and transpose
ys_pandas.describe().T

Interactive content and FUN!

In [ ]:
# Let's try importing a youtube video!
from IPython.lib.display import YouTubeVideo
vid = YouTubeVideo("OSGv2VnC0go")
display(vid)

This is a very instructional video by the way. You should watch it if you haven't yet.

Putting your charts and notebooks online

Here is where we leave the trodden path. And by that I mean we are going to explore the world outside of your notebooks! You will need to use PythonAnywhere for the following sections. There are some pointers and instructions here. Talk to our coaches if you are confused.

The first exercise that we are going to try is to put a chart online. First, let's generate and save a chart on PythonAnywhere.

In [ ]:
ys_pandas[1:100].plot(kind="area")
plt.savefig('area_plot.png')
In [ ]:
# you should be able to see it newly generated when you use the linux command `ls`
!ls -lt

You should also be able to see it by going to your www.pythonanywhere.com dashboard and navigating via the files tab.

In [ ]:
# figure out where you are
!pwd
In [ ]:
command_output = !pwd
directory = command_output[0]
print('The chart you saved should be at {}/area_plot.png'.format(directory))

See if you can find it! When you click on it, you are able to view and download the picture. However, other people would not be able to access the picture without your PythonAnywhere account. Let's create a web app to host it online so everyone can access it!

Go to the web tab and click on "add a new web app". You can choose any of the framework options (eg: Django, Web2py, Flask, Bottle) as we won't be using any of the frameworks to serve dynamic pages today.

After you have setup the web app, scroll down to the static files configuration, and setup a url and a directory.

In [ ]:
 print('For example, set the url to /pydata and the directory to {}'.format(directory))

Hit reload, wait for it to reload, and then go to your webapp domain + /pydata/area_plot.png

In [ ]:
command_output = !whoami
your_website = 'https://{}.pythonanywhere.com'.format(command_output[0])
image_url = '{}/pydata/area_plot.png'.format(your_website)
print('For example, go to ' + image_url)
In [ ]:
# In fact, you can now load that image back into this notebook over the internet now
from IPython.display import Image
Image(url=image_url)

Similarly, you can do the exact same thing to serve a html version of your notebook online.

In [ ]:
!jupyter nbconvert DataVisualization.ipynb
In [ ]:
# check that you have the new html notebook in your current directory
!ls
In [ ]:
expected_notebook_url = your_website + '/pydata/DataVisualization.html'
print('For example, go to ' + expected_notebook_url)

Now you know how we just displayed the online chart back into this notebook via the internet?

Hmmm. What if we do the same thing with our notebook, and try displaying the notebook as a cell output?

In [ ]:
from IPython.display import IFrame
IFrame(expected_notebook_url, width=700, height=350)

Notebookception O.o ?!

(╯°□°)╯︵ ┻━┻)

(╯°□°)╯︵ ┻━┻)

(╯°□°)╯︵ ┻━┻)

If you want to look at data visualization more, I would suggestion reading about common pitfalls. Look through the other notebooks for pretty charts that catch your eye, and making a chart out of it! Here are some more pretty graphs that may inspire you.

Using a Database

Since you are already on PythonAnywhere for the easy hosting setup, you may also want to play around with accessing and manipulating data in mysql:

  1. creating a database- go the the databases tab and click create. You should also have a default database setup for you already. Click to start a mysql console.
  2. browsing around the database
    • run "show databases;". What is the information_schema database?
    • Make sure you are working on your own database by running "use <yourdbname>;". Then run "show tables;"- it will show you any existing tables that you have created. You can then use "describe <tableName>" to see more information about the individual fields.
  3. putting data into a database- look up how to do a CREATE TABLE, and how to INSERT
  4. getting data out of a database- try to do a SELECT
  5. backing up and restoring a database- this is probably best done outside of the mysql console. Look at some instructions here

If you decide that writing SQL is not your thing, but you still want to use a database, you may also want to check out using an ORM such as sqlalchemy. If you want to build a website, web frameworks such as Django also have their own ORM built in.