Alexander Ramadan

Where I Sometimes Write Stuff


Plotting Data with Python and matplotlib
2015-03-17

I've recently become fascinated by data visualization, and how large sets of data can be distilled down to an easily consumed visualization.

Luckily data visualization appears to be a great fit for my other learning interest, Python. I'm slowly working my through the different aspects of the language, and as it is my first language, I still have a long way to go. I find writing about the things I'm learning and experimenting with helps me retain the knowledge as well as finding errors or places for improvement.

So, I've been playing around with Matplotlib and wanted to figure out a way to easily plot some data from an external file.

I decided to create some fictitious comic book sales numbers for the past 60 or so years and then try to plot them on a standard line graph.

Here's how I worked throug the problem...

Creating My Fake Data

My first issue was creating some fake data that I could plot.

As a sidenote, I'm currently looking for some actual data to try this on, but this was just a quick project that I wanted to get done in an evening.

First, I needed  to create the file where I was going to be writing the years and sales data to. Then I needed to generate the years between 1945 and 2014. You could type these out by hand, but that would take forever. So I used range()to speed up that process.

Finally I needed to generate some fake sales numbers, and combine the years with their fake sales data. I did this using random.randrange().

    
import random comicYears = open("comicsSold.txt", "a") years = range(1945, 2015) for year in years: sold = random.randrange(5000, 150000000) sold = str(sold) comicYears.write(str(year) + ',' + sold + '\n')

Here I am using a for loop that creates a fake sales number for for each year, converts them to a string and concatanates the year and the fake sales data, seperated by a comma.

Here's what the code looks like put together

    
import random comicYears = open("comicsSold.txt", "a") years = range(1945, 2015) for year in years: sold = random.randrange(5000, 150000000) sold = str(sold) comicYears.write(str(year) + ',' + sold + '\n')

Here's a gist of this code for easier readability

link

The output looks like this

    
1945,114950089 1946,5327739 1947,92066212 1948,8359428 1949,104528851 1950,87344945 1951,111024866 1952,85318191 1953,146137175 1954,97641070 1955,144609067 1956,142349233 1957,64969373

Now I've got my fake data that I will be plotting with matplotlib.

Now it's time to start graphing!

I'm going to use the popular Python library, matplotlib to do the heavy lifting of my graphing.

Lets start by importing the matplotlib library so we can have access to it's modules. The plt.ion() will make our graph interactive.

    
import matplotlib.pyplot as plt plt.ion() file_open = open('comicsSold.txt', 'r') file_read = file_open.read() years = [] sold = [] file_split = file_read.splitlines()

So, now I've got a series of strings with a year and a sales number separated by a comma. Next, I'll need to add all my years to a list and all my sales figures to a sales list so that I can use them to plot.

    
for line in file_split: year = line.split(',')[0] num = line.split(',')[1] years.append(year) sold.append(num)

Next I want to set up the correct x axis that will display the range of years I'm working with. To do this I need to access the first year and last year in my list.

    
first_year = years[0] last_year = years[-1]

This last bit of code below sets my x axis to the range of the year I'm plotting. The reason I had to use +1 is because the range() function is exclusive so if I didn't include +1 then the last year in my year list would not get included.

The problem with this is (besides my graph being inaccurate) you can only plot when you have the same number of x axis elements and y axis elements, or you'll get an error.+1 will make range() inclusive and get me that last year, 2014, making my 2 axis's equal.

    
x = range(int(first_year), int(last_year)+1) plt.xlabel('Years') plt.ylabel('# of Comics Sold') plt.title('Comic Book Sales Data') plt.grid(True) plt.plot(x, sold) plt.show()

Here is code all together

    

    import matplotlib.pyplot as plt
    plt.ion()

    file_open = open('comicsSold.txt', 'r')
    file_read = file_open.read()

    years = []
    sold = []

    file_split = file_read.splitlines()

    for line in file_split:
        year = line.split(',')[0]
        num = line.split(',')[1]
        years.append(year)
        sold.append(num)

    first_year = years[0]
    last_year = years[-1]

    x = range(int(first_year), int(last_year)+1)

    plt.xlabel('Years')
    plt.ylabel('# of Comics Sold')
    plt.title('Comic Book Sales Data')
    plt.grid(True)

    plt.plot(x, sold)
    plt.show()

And here's the result

Here's a gist of this script

link

I'm sure this code could be refactored down quite a bit, but this was a quick and dirty attempt at getting something plotted.