Luke Lee

Software Engineer

Web + Desktop + Science

Fork me on Github

Performance lessons for reading ascii files into numpy arrays

There are a few ways to read data from a csv file into a numpy array:

  1. numpy.loadtxt
  2. pandas.read_csv [1]

I recently had to read this type of data for a project I'm working on. It's worth looking into the performance since there are a few choices. This post is the result of that investigation.

Sample data setup

I wrote the following small function to generate a sufficient amount of 'random' data for testing:

def generate_test_data(column_names, row_count, filename):
    """
    Generate file of random test data of size (row_count, len(column_names))

    column_names - List of column name strings to use as header row
    row_count - Number of rows of data to generate
    filename - Name of file to write test data to
    """

    col_count = len(column_names)
    rand_arr = np.random.rand(row_count, col_count)
    header_line = ' '.join(column_names)
    np.savetxt(filename, rand_arr, delimiter=' ', fmt='%1.5f',
               header=header_line, comments='')

Hopefully this function is straight-forward so I won't discuss it further.

For this test I simplify used the above function to create a relatively small file:

    # For testing just create a column for each lower case letter in English
    # alphabet
    columns = [char for char in string.lowercase]
    row_count = 1000

    # Don't need the file open.  In order to time things properly we should
    # allow each method to open the file, etc. itself.
    fd, filename = tempfile.mkstemp()
    os.close(fd)

    generate_test_data(filename, columns, row_count)

This creates a space-separated file of random float data that is about 208 KB, comprised of 26 columns and 1000 rows.

Test results

The following snippet is from an IPython shell utilizing the %timeit functionality [2]:

    >>> import numpy as np
    >>> import pandas as pd
    >>> %timeit -n 100 pd.read_csv('test.out', delim_whitespace=True)
    100 loops, best of 3: 6.66 ms per loop
    >>> %timeit -n 100 f = open('test.out', 'r');f.readline();np.loadtxt(f, unpack=True)
    100 loops, best of 3: 28 ms per loop

Pandas wins!

The result is that Pandas is much faster, but why? The short answer, as developer Wes McKinney posted in response to my question:

Short answer: file tokenization and type inference is being handled at the lowest level possible in C/Cython. If you look at the impl of numpy.loadtxt you'll see a lot of Python.

So there you have it, straight from the author! Interestingly enough the massive speed increases for pandas.read_csv are relatively recent, and Wes has written a few great articles detailing it in full.

Notes

[1] Using pandas.read_csv will return a DataFrame object, which is essentially a 2 dimensional numpy array. So, the performance improvements of pandas.read_csv could come at the price of another dependency in your project for Pandas. Also, you'll be getting back a DataFrame object instead of more stripped down numpy arrays.

[2] Note that I used the unpack=True argument to numpy.loadtxt because I want to read all the data as column-based arrays, not the default row-based. This was just a requirement for the application I was profiling for. This isn't necessary in the Pandas library because data is read into a DataFrame object, which allows slicing by columns.

Published: 03-08-2013 19:49:01

lukelee.net