Tabular data and statistics#
A typical pattern in the development of Pythons eco system is to recognize something very efficient and practical somewhere else and find ways to integrate or resemble that functionality to make it usable in the Python world.
We’ve seen this for numpy and Matplotlib for example.
Another success story here is the development of a library called pandas.
This time it was someone from the world of statistics that saw the nice and modern features of the Python language, but the lack of statistical functions and procedures that were available in specialized languages such as R.
Another key concept in R is the data model used throughout, the data frame.
In statistics most data used can be organized and presented in tabular form, pretty comparable to an Excel sheet or a database table.
A lot of core functionality is available for these object orientated data frames already as methods such as finding extreme values, averages, percentiles and so on.
If one would want to do the same thing in Python, it would definitely be a good idea to implement this special data model first. A young statistician from the US, Wes McKinney one day did. The result is nowadays one of the most important contributions to modern Data Science, the pandas
library. This library, together with numpy
and matplotlib
, the Jupyter Notebook and machine learning solutions such as pytorch
and scikit learn are the reasons almost everyone doing Data Science is using Python for this these days.

Fig. 4 Wes McKinney (Courtesy unknown)#
Let’s play around with this a bit. It all starts with importing the library. In most cases you’ll find the following pattern to do so.
import pandas as pd
DataFrames can of course be constructed from being empty onwards. The more usual way though is to migrate an existing data model into a dataframe. Most often you’ll find one of the following two approaches. In the first case a dictionary is the base having one key-value pair per column, where the dicts value holds all items for the complete column.
data = {
"angle": [-90, -75, -60, -45, -30, -15, 0, 15, 30, 45, 60, 75, 90],
"value1": [.13, .15, .18, .19, .21, .23, .25, .25, .24, .23, .20, .16, .12],
"value2": [.12, .16, .20, .23, .24, .25, .26, .23, .21, .18, .18, .15, .12],
}
df = pd.DataFrame(data)
df
angle | value1 | value2 | |
---|---|---|---|
0 | -90 | 0.13 | 0.12 |
1 | -75 | 0.15 | 0.16 |
2 | -60 | 0.18 | 0.20 |
3 | -45 | 0.19 | 0.23 |
4 | -30 | 0.21 | 0.24 |
5 | -15 | 0.23 | 0.25 |
6 | 0 | 0.25 | 0.26 |
7 | 15 | 0.25 | 0.23 |
8 | 30 | 0.24 | 0.21 |
9 | 45 | 0.23 | 0.18 |
10 | 60 | 0.20 | 0.18 |
11 | 75 | 0.16 | 0.15 |
12 | 90 | 0.12 | 0.12 |
The other example shows the same values, but organized in a different fashion. Here we have a list of records, where each record is a dict having three key-value pairs. The keys will be used to create the columns for the DataFrame.
data = []
data.append({"angle": -90, "value1": .13, "value2": .12})
data.append({"angle": -75, "value1": .15, "value2": .16})
data.append({"angle": -60, "value1": .18, "value2": .20})
data.append({"angle": -45, "value1": .19, "value2": .23})
data.append({"angle": -30, "value1": .21, "value2": .24})
data.append({"angle": -15, "value1": .23, "value2": .25})
data.append({"angle": 0, "value1": .25, "value2": .26})
data.append({"angle": 15, "value1": .25, "value2": .23})
data.append({"angle": 30, "value1": .24, "value2": .21})
data.append({"angle": 45, "value1": .23, "value2": .18})
data.append({"angle": 60, "value1": .20, "value2": .18})
data.append({"angle": 75, "value1": .16, "value2": .15})
data.append({"angle": 90, "value1": .12, "value2": .12})
df = pd.DataFrame(data)
df
angle | value1 | value2 | |
---|---|---|---|
0 | -90 | 0.13 | 0.12 |
1 | -75 | 0.15 | 0.16 |
2 | -60 | 0.18 | 0.20 |
3 | -45 | 0.19 | 0.23 |
4 | -30 | 0.21 | 0.24 |
5 | -15 | 0.23 | 0.25 |
6 | 0 | 0.25 | 0.26 |
7 | 15 | 0.25 | 0.23 |
8 | 30 | 0.24 | 0.21 |
9 | 45 | 0.23 | 0.18 |
10 | 60 | 0.20 | 0.18 |
11 | 75 | 0.16 | 0.15 |
12 | 90 | 0.12 | 0.12 |
Some simple statistics#
Both approaches will create the very same DataFrame here, as you can see from its output. The DataFrame has three columns though, even though only two have been specified.
The first column holds the DataFrames index, which can be used to address the elements in a DataFrame. This is very similar to working with data in Excel or databases for example, where you can use the index and the columns names to access values in the table.
Before we do that however, lets have a quick overview on the values this DataFrame contains. The describe
method is a perfect way to get first insights on the values showing you the count of items in each column, the mean value for the column, the minimum and maximum value found in each column, as well as the 25, 50 and 75 percentiles.
df.describe()
angle | value1 | value2 | |
---|---|---|---|
count | 13.000000 | 13.000000 | 13.000000 |
mean | 0.000000 | 0.195385 | 0.194615 |
std | 58.416607 | 0.044836 | 0.047367 |
min | -90.000000 | 0.120000 | 0.120000 |
25% | -45.000000 | 0.160000 | 0.160000 |
50% | 0.000000 | 0.200000 | 0.200000 |
75% | 45.000000 | 0.230000 | 0.230000 |
max | 90.000000 | 0.250000 | 0.260000 |
You can also have a look at the starting rows of the DataFrame with the head
method or the final ones using the tail
method.
df.head()
angle | value1 | value2 | |
---|---|---|---|
0 | -90 | 0.13 | 0.12 |
1 | -75 | 0.15 | 0.16 |
2 | -60 | 0.18 | 0.20 |
3 | -45 | 0.19 | 0.23 |
4 | -30 | 0.21 | 0.24 |
df.tail()
angle | value1 | value2 | |
---|---|---|---|
8 | 30 | 0.24 | 0.21 |
9 | 45 | 0.23 | 0.18 |
10 | 60 | 0.20 | 0.18 |
11 | 75 | 0.16 | 0.15 |
12 | 90 | 0.12 | 0.12 |
For very large DataFrames it can be useful to check a random sample of rows to get an overview. You can use the sample
method for that. The number of rows to be sampled can be used as an optional parameter. The default for this is 1, so calling sample
without arguments fetches you a randomly chosen single row from the DataFrame
df.sample()
angle | value1 | value2 | |
---|---|---|---|
2 | -60 | 0.18 | 0.2 |
Let’s fetch some more rows for a better impression.
df.sample(5)
angle | value1 | value2 | |
---|---|---|---|
4 | -30 | 0.21 | 0.24 |
6 | 0 | 0.25 | 0.26 |
12 | 90 | 0.12 | 0.12 |
2 | -60 | 0.18 | 0.20 |
10 | 60 | 0.20 | 0.18 |
Indexing#
As mentioned before items in a DataFrame can be accessed using the columns names and the index. There are several ways to do so. All being implemented as pythonic as possible. So we would expect some sort of square brackets to be used here. And just that is the case. You can fetch the complete column with items for “value1” for example like the following.
values = df["value1"]
values
0 0.13
1 0.15
2 0.18
3 0.19
4 0.21
5 0.23
6 0.25
7 0.25
8 0.24
9 0.23
10 0.20
11 0.16
12 0.12
Name: value1, dtype: float64
The result however is not a simple iterable, such as a list of values, but of type Series
. Actually the DataFrame
is a collection of elements of type Series
.
In addition to the raw values there is still the index, the dtype
and the name of the dataset in the DataFrame
.
We can fetch single elements from the Series
as we would expect.
values[2]
0.18
This will give is the third element in our selected column.
A direct indexing to a single element is possible as well of course. The only thing to keep in mind here, is that you need to index columns before rows here.
df["value1"][2]
0.18
Even striding, which means to index subsets of the DataFrame is possible. To select every second element in our “value1” column for example, we can use this approach.
df["value1"][::2]
0 0.13
2 0.18
4 0.21
6 0.25
8 0.24
10 0.20
12 0.12
Name: value1, dtype: float64
Pretty much all tricks from working with lists and numpy arrays will work here as well. Selecting multiple columns by name is possible as well. The syntax is a bit different though, as simply counting names and specifying steps is not directly possible. But using every second named column for an analysis will hardly make sense anyways. However selecting several columns from the original DataFrame goes like this.
df[["value2", "angle"]]
value2 | angle | |
---|---|---|
0 | 0.12 | -90 |
1 | 0.16 | -75 |
2 | 0.20 | -60 |
3 | 0.23 | -45 |
4 | 0.24 | -30 |
5 | 0.25 | -15 |
6 | 0.26 | 0 |
7 | 0.23 | 15 |
8 | 0.21 | 30 |
9 | 0.18 | 45 |
10 | 0.18 | 60 |
11 | 0.15 | 75 |
12 | 0.12 | 90 |
This effectively swaps the order of columns compared to our original DataFrame
.
Of course combining the two approaches works as well.
df[["value2", "angle"]][1:-2:2]
value2 | angle | |
---|---|---|
1 | 0.16 | -75 |
3 | 0.23 | -45 |
5 | 0.25 | -15 |
7 | 0.23 | 15 |
9 | 0.18 | 45 |
Here, we create the column based subset first, and after that select all rows beginning from the second to the second last one in steps of two. Nice!
There are two other indexing approaches though, which are more consistent when it comes to further processing these sub selections. The reason for their existence is that pandas
tries to avoid creating copies of data as good as possible to be efficient and fast.
When working with subsets of data the way we’ve seen before, this can become error prone.
There are two further alternatives to index data, both location based and one of them being purely focused on indexes in both row and column direction.
The purely index based approach can be used by the iloc
attribute.
This is not a function as you can see.
It’s used with the square brackets syntax.
Notice that in conrast to the former approach, the order of axes is different here. We’re indexing in row first order!
However indexing into the axes works as expected again. To select the first two rows of the “value1” column we can use the following approach.
df.iloc[0:2,1]
0 0.13
1 0.15
Name: value1, dtype: float64
The second approach allows to use names for the columns again. To select the first two elements from the “angle” column, we would write
df.loc[0:2, "angle"]
0 -90
1 -75
2 -60
Name: angle, dtype: int64
Moreover, the type of name based striding which wasn’t possible in the naive indexing approach is possible here. To create a subset of values based on just the values columns we can use the following statement.
df.loc[3:6, "value1":"value2"]
value1 | value2 | |
---|---|---|
3 | 0.19 | 0.23 |
4 | 0.21 | 0.24 |
5 | 0.23 | 0.25 |
6 | 0.25 | 0.26 |
Given enough columns, using the step parameter for indexing would work as well here.
Modifying columns#
It’s quite easy to add columns, for example from other datasets or based on calculations with the existing ones. For example to add a column holding the difference for the two value columns, we could write
df["diff"] = df["value1"] - df["value2"]
df
angle | value1 | value2 | diff | |
---|---|---|---|---|
0 | -90 | 0.13 | 0.12 | 0.01 |
1 | -75 | 0.15 | 0.16 | -0.01 |
2 | -60 | 0.18 | 0.20 | -0.02 |
3 | -45 | 0.19 | 0.23 | -0.04 |
4 | -30 | 0.21 | 0.24 | -0.03 |
5 | -15 | 0.23 | 0.25 | -0.02 |
6 | 0 | 0.25 | 0.26 | -0.01 |
7 | 15 | 0.25 | 0.23 | 0.02 |
8 | 30 | 0.24 | 0.21 | 0.03 |
9 | 45 | 0.23 | 0.18 | 0.05 |
10 | 60 | 0.20 | 0.18 | 0.02 |
11 | 75 | 0.16 | 0.15 | 0.01 |
12 | 90 | 0.12 | 0.12 | 0.00 |
As you can see our DataFrame
now has a fourth column with all the calculated differences.
Simple calculus#
The DataFrame
has several methods we already now from numpy
available. Finding the minimum values in each column can be achieved like so.
df.min()
angle -90.00
value1 0.12
value2 0.12
diff -0.04
dtype: float64
Showing the maximum values works the same way.
df.max()
angle 90.00
value1 0.25
value2 0.26
diff 0.05
dtype: float64
As do summing up the values or calculating the mean.
df.sum()
angle 0.00
value1 2.54
value2 2.53
diff 0.01
dtype: float64
df.mean()
angle 0.000000
value1 0.195385
value2 0.194615
diff 0.000769
dtype: float64
Calculating the difference of consecutive rows.
df.diff()
angle | value1 | value2 | diff | |
---|---|---|---|---|
0 | NaN | NaN | NaN | NaN |
1 | 15.0 | 0.02 | 0.04 | -0.02 |
2 | 15.0 | 0.03 | 0.04 | -0.01 |
3 | 15.0 | 0.01 | 0.03 | -0.02 |
4 | 15.0 | 0.02 | 0.01 | 0.01 |
5 | 15.0 | 0.02 | 0.01 | 0.01 |
6 | 15.0 | 0.02 | 0.01 | 0.01 |
7 | 15.0 | 0.00 | -0.03 | 0.03 |
8 | 15.0 | -0.01 | -0.02 | 0.01 |
9 | 15.0 | -0.01 | -0.03 | 0.02 |
10 | 15.0 | -0.03 | 0.00 | -0.03 |
11 | 15.0 | -0.04 | -0.03 | -0.01 |
12 | 15.0 | -0.04 | -0.03 | -0.01 |
Calculating the cumulative sum
df.cumsum()
angle | value1 | value2 | diff | |
---|---|---|---|---|
0 | -90 | 0.13 | 0.12 | 0.01 |
1 | -165 | 0.28 | 0.28 | 0.00 |
2 | -225 | 0.46 | 0.48 | -0.02 |
3 | -270 | 0.65 | 0.71 | -0.06 |
4 | -300 | 0.86 | 0.95 | -0.09 |
5 | -315 | 1.09 | 1.20 | -0.11 |
6 | -315 | 1.34 | 1.46 | -0.12 |
7 | -300 | 1.59 | 1.69 | -0.10 |
8 | -270 | 1.83 | 1.90 | -0.07 |
9 | -225 | 2.06 | 2.08 | -0.02 |
10 | -165 | 2.26 | 2.26 | 0.00 |
11 | -90 | 2.42 | 2.41 | 0.01 |
12 | 0 | 2.54 | 2.53 | 0.01 |
Value based selections.
Select all rows in the DataFrame
where the values in the value1 column are above its mean value.
df[df["value1"] > df["value1"].mean()]
angle | value1 | value2 | diff | |
---|---|---|---|---|
4 | -30 | 0.21 | 0.24 | -0.03 |
5 | -15 | 0.23 | 0.25 | -0.02 |
6 | 0 | 0.25 | 0.26 | -0.01 |
7 | 15 | 0.25 | 0.23 | 0.02 |
8 | 30 | 0.24 | 0.21 | 0.03 |
9 | 45 | 0.23 | 0.18 | 0.05 |
10 | 60 | 0.20 | 0.18 | 0.02 |
Plotting#
Simple plotting is also available as a method for DataFrame
s. This has been implemented to provide a consistent api, where the DataFrame
itself is always the core element.
This is well suited for a quick overview.
However for more detailed plots, where more control is needed over the plot itself, using the approaches from the Plotting chapter is advised.
We need to prepare a bit for the plots to look nice and work as expected. pandas will use an available backend for the plots. So let’s prepare our Matplotlib and seaborn based approach first.
%matplotlib widget
import seaborn as sns
sns.set()
The most simple way to create a plot is to just call the plot
method on the DataFrame
. Some already known Matplotlib parameters can be used in this context as well.
df.plot(figsize=(8,5))
<Axes: >
Based on the legend shown in the plot we can see that all four available columns are plotted against the index by default.
If we wanted to use a different column to be the base, we need to change the index. pandas
offers a way for this by using the set_index
method.
Itr returns a different view on the same DataFrame
with the index now being the chosen column.
df1 = df.set_index("angle")
df1
value1 | value2 | diff | |
---|---|---|---|
angle | |||
-90 | 0.13 | 0.12 | 0.01 |
-75 | 0.15 | 0.16 | -0.01 |
-60 | 0.18 | 0.20 | -0.02 |
-45 | 0.19 | 0.23 | -0.04 |
-30 | 0.21 | 0.24 | -0.03 |
-15 | 0.23 | 0.25 | -0.02 |
0 | 0.25 | 0.26 | -0.01 |
15 | 0.25 | 0.23 | 0.02 |
30 | 0.24 | 0.21 | 0.03 |
45 | 0.23 | 0.18 | 0.05 |
60 | 0.20 | 0.18 | 0.02 |
75 | 0.16 | 0.15 | 0.01 |
90 | 0.12 | 0.12 | 0.00 |
Calling the plot
method on this DataFrame
shows the probably more desired output.
df1.plot(figsize=(8,5))
<Axes: xlabel='angle'>