6. Data

The Introduction to Python section gave you an introduction to basic Python data types like strings, lists, tuple, and dictionaries. In this section we introduce some more advanced data types available in special Python packages called numpy. and pandas.

We will begin by learning how to create, access, and update the basic numpy data structure, the ndimensional array, as well how to add, subtract, and multiply with arrays using vectorized arithmetic operations, operations that apply elementwise to all the elements of an array. What we learn about operations on numbers will carry over to Boolean conditions, conditions that are True or False of the individual elements in an array. Applying a Boolean condition is also a vectorized operation, so applying a Boolean condition to an array results in an array Boolean result. We will learn to use such Boolean arrays to extract portions of arrays that satisfy Boolean conditions, allowing for high-level queries and manipulations of the data.

An immediate payoff from our brief survey of numpy is that all the principles for computing with numpy arrays will carry over with minor modifications to computing with pandas.

The pandas module is Python’s most popular toolset for manipulating data in tabular form (Excel sheets, data tables). The two main pandas data types are DataFrame and Series.

A DataFrame is a table of data. Datasets at all levels of analysis of analysis can be represented as DataFrames.

You can think of a DataFrame as being organized in rows and column, like a numpy 2D array, but differing from it in one important respect: A DataFrame uses keyword indexing instead of positional indexing.

Despite this change in how indexing works, all the principles that apply to computing with numpy arrays will carry over with minor modifications to computing with pandas DataFrames. This is especially true of Boolean indexing, which will be your fundamental tool for selecting and reshaping data in pandas. Where a DataFrame is like a 2D array, a Series is like a 1D array; both the rows and the columns of pandas DataFrames are Series objects.

We concluide our brief tour of pandas with a look at some of its aggregation tools, including cross-tabulation, grouoing, and pivot tables, as well as some tools for merging data.

Contents