Tabular data and statistics

Tabular data and statistics#

A typical pattern in the development of Pythons eco system is to recognize something very efficient and practical somewhere else and find ways to integrate or resemble that functionality to make it usable in the Python world. We’ve seen this for numpy and Matplotlib for example. Another success story here is the development of a library called pandas. This time it was someone from the world of statistics that saw the nice and modern features of the Python language, but the lack of statistical functions and procedures that were available in specialized languages such as R. Another key concept in R is the data model used throughout, the data frame. In statistics most data used can be organized and presented in tabular form, pretty comparable to an Excel sheet or a database table. A lot of core functionality is available for these object orientated data frames already as methods such as finding extreme values, averages, percentiles and so on. If one would want to do the same thing in Python, it would definitely be a good idea to implement this special data model first. A young statistician from the US, Wes McKinney one day did. The result is nowadays one of the most important contributions to modern Data Science, the pandas library. This library, together with numpy and matplotlib, the Jupyter Notebook and machine learning solutions such as pytorch and scikit learn are the reasons almost everyone doing Data Science is using Python for this these days.

wes_mckinney — Fig. 4 Wes McKinney (Courtesy unknown)#

Let’s play around with this a bit. It all starts with importing the library. In most cases you’ll find the following pattern to do so.

import pandas as pd

DataFrames can of course be constructed from being empty onwards. The more usual way though is to migrate an existing data model into a dataframe. Most often you’ll find one of the following two approaches. In the first case a dictionary is the base having one key-value pair per column, where the dicts value holds all items for the complete column.

data = {
    "angle": [-90, -75, -60, -45, -30, -15, 0, 15, 30, 45, 60, 75, 90],
    "value1": [.13, .15, .18, .19, .21, .23, .25, .25, .24, .23, .20, .16, .12],
    "value2": [.12, .16, .20, .23, .24, .25, .26, .23, .21, .18, .18, .15, .12],
}

df = pd.DataFrame(data)
df

	angle	value1	value2
0	-90	0.13	0.12
1	-75	0.15	0.16
2	-60	0.18	0.20
3	-45	0.19	0.23
4	-30	0.21	0.24
5	-15	0.23	0.25
6	0	0.25	0.26
7	15	0.25	0.23
8	30	0.24	0.21
9	45	0.23	0.18
10	60	0.20	0.18
11	75	0.16	0.15
12	90	0.12	0.12

The other example shows the same values, but organized in a different fashion. Here we have a list of records, where each record is a dict having three key-value pairs. The keys will be used to create the columns for the DataFrame.

data = []
data.append({"angle": -90, "value1": .13, "value2": .12})
data.append({"angle": -75, "value1": .15, "value2": .16})
data.append({"angle": -60, "value1": .18, "value2": .20})
data.append({"angle": -45, "value1": .19, "value2": .23})
data.append({"angle": -30, "value1": .21, "value2": .24})
data.append({"angle": -15, "value1": .23, "value2": .25})
data.append({"angle": 0, "value1": .25, "value2": .26})
data.append({"angle": 15, "value1": .25, "value2": .23})
data.append({"angle": 30, "value1": .24, "value2": .21})
data.append({"angle": 45, "value1": .23, "value2": .18})
data.append({"angle": 60, "value1": .20, "value2": .18})
data.append({"angle": 75, "value1": .16, "value2": .15})
data.append({"angle": 90, "value1": .12, "value2": .12})

df = pd.DataFrame(data)
df

	angle	value1	value2
0	-90	0.13	0.12
1	-75	0.15	0.16
2	-60	0.18	0.20
3	-45	0.19	0.23
4	-30	0.21	0.24
5	-15	0.23	0.25
6	0	0.25	0.26
7	15	0.25	0.23
8	30	0.24	0.21
9	45	0.23	0.18
10	60	0.20	0.18
11	75	0.16	0.15
12	90	0.12	0.12

Some simple statistics#

Both approaches will create the very same DataFrame here, as you can see from its output. The DataFrame has three columns though, even though only two have been specified. The first column holds the DataFrames index, which can be used to address the elements in a DataFrame. This is very similar to working with data in Excel or databases for example, where you can use the index and the columns names to access values in the table. Before we do that however, lets have a quick overview on the values this DataFrame contains. The describe method is a perfect way to get first insights on the values showing you the count of items in each column, the mean value for the column, the minimum and maximum value found in each column, as well as the 25, 50 and 75 percentiles.

df.describe()

	angle	value1	value2
count	13.000000	13.000000	13.000000
mean	0.000000	0.195385	0.194615
std	58.416607	0.044836	0.047367
min	-90.000000	0.120000	0.120000
25%	-45.000000	0.160000	0.160000
50%	0.000000	0.200000	0.200000
75%	45.000000	0.230000	0.230000
max	90.000000	0.250000	0.260000

You can also have a look at the starting rows of the DataFrame with the head method or the final ones using the tail method.

df.head()

	angle	value1	value2
0	-90	0.13	0.12
1	-75	0.15	0.16
2	-60	0.18	0.20
3	-45	0.19	0.23
4	-30	0.21	0.24

df.tail()

	angle	value1	value2
8	30	0.24	0.21
9	45	0.23	0.18
10	60	0.20	0.18
11	75	0.16	0.15
12	90	0.12	0.12

For very large DataFrames it can be useful to check a random sample of rows to get an overview. You can use the sample method for that. The number of rows to be sampled can be used as an optional parameter. The default for this is 1, so calling sample without arguments fetches you a randomly chosen single row from the DataFrame

df.sample()

	angle	value1	value2
2	-60	0.18	0.2

Let’s fetch some more rows for a better impression.

df.sample(5)

	angle	value1	value2
4	-30	0.21	0.24
6	0	0.25	0.26
12	90	0.12	0.12
2	-60	0.18	0.20
10	60	0.20	0.18

Indexing#

As mentioned before items in a DataFrame can be accessed using the columns names and the index. There are several ways to do so. All being implemented as pythonic as possible. So we would expect some sort of square brackets to be used here. And just that is the case. You can fetch the complete column with items for “value1” for example like the following.

values = df["value1"]
values

   0.13
   0.15
   0.18
   0.19
   0.21
   0.23
   0.25
   0.25
   0.24
   0.23
  0.20
  0.16
  0.12
Name: value1, dtype: float64

The result however is not a simple iterable, such as a list of values, but of type Series. Actually the DataFrame is a collection of elements of type Series. In addition to the raw values there is still the index, the dtype and the name of the dataset in the DataFrame. We can fetch single elements from the Series as we would expect.

values[2]

0.18

This will give is the third element in our selected column.

A direct indexing to a single element is possible as well of course. The only thing to keep in mind here, is that you need to index columns before rows here.

df["value1"][2]

0.18

Even striding, which means to index subsets of the DataFrame is possible. To select every second element in our “value1” column for example, we can use this approach.

df["value1"][::2]

   0.13
   0.18
   0.21
   0.25
   0.24
  0.20
  0.12
Name: value1, dtype: float64

Pretty much all tricks from working with lists and numpy arrays will work here as well. Selecting multiple columns by name is possible as well. The syntax is a bit different though, as simply counting names and specifying steps is not directly possible. But using every second named column for an analysis will hardly make sense anyways. However selecting several columns from the original DataFrame goes like this.

df[["value2", "angle"]]

	value2	angle
0	0.12	-90
1	0.16	-75
2	0.20	-60
3	0.23	-45
4	0.24	-30
5	0.25	-15
6	0.26	0
7	0.23	15
8	0.21	30
9	0.18	45
10	0.18	60
11	0.15	75
12	0.12	90

This effectively swaps the order of columns compared to our original DataFrame. Of course combining the two approaches works as well.

df[["value2", "angle"]][1:-2:2]

	value2	angle
1	0.16	-75
3	0.23	-45
5	0.25	-15
7	0.23	15
9	0.18	45

Here, we create the column based subset first, and after that select all rows beginning from the second to the second last one in steps of two. Nice!

There are two other indexing approaches though, which are more consistent when it comes to further processing these sub selections. The reason for their existence is that pandas tries to avoid creating copies of data as good as possible to be efficient and fast. When working with subsets of data the way we’ve seen before, this can become error prone.

There are two further alternatives to index data, both location based and one of them being purely focused on indexes in both row and column direction. The purely index based approach can be used by the iloc attribute. This is not a function as you can see. It’s used with the square brackets syntax. Notice that in conrast to the former approach, the order of axes is different here. We’re indexing in row first order!

However indexing into the axes works as expected again. To select the first two rows of the “value1” column we can use the following approach.

df.iloc[0:2,1]

0    0.13
1    0.15
Name: value1, dtype: float64

The second approach allows to use names for the columns again. To select the first two elements from the “angle” column, we would write

df.loc[0:2, "angle"]

 -90
 -75
 -60
Name: angle, dtype: int64

Moreover, the type of name based striding which wasn’t possible in the naive indexing approach is possible here. To create a subset of values based on just the values columns we can use the following statement.

df.loc[3:6, "value1":"value2"]

	value1	value2
3	0.19	0.23
4	0.21	0.24
5	0.23	0.25
6	0.25	0.26

Given enough columns, using the step parameter for indexing would work as well here.

Modifying columns#

It’s quite easy to add columns, for example from other datasets or based on calculations with the existing ones. For example to add a column holding the difference for the two value columns, we could write

df["diff"] = df["value1"] - df["value2"]
df

	angle	value1	value2	diff
0	-90	0.13	0.12	0.01
1	-75	0.15	0.16	-0.01
2	-60	0.18	0.20	-0.02
3	-45	0.19	0.23	-0.04
4	-30	0.21	0.24	-0.03
5	-15	0.23	0.25	-0.02
6	0	0.25	0.26	-0.01
7	15	0.25	0.23	0.02
8	30	0.24	0.21	0.03
9	45	0.23	0.18	0.05
10	60	0.20	0.18	0.02
11	75	0.16	0.15	0.01
12	90	0.12	0.12	0.00

As you can see our DataFrame now has a fourth column with all the calculated differences.

Simple calculus#

The DataFrame has several methods we already now from numpy available. Finding the minimum values in each column can be achieved like so.

df.min()

angle    -90.00
value1     0.12
value2     0.12
diff      -0.04
dtype: float64

Showing the maximum values works the same way.

df.max()

angle     90.00
value1     0.25
value2     0.26
diff       0.05
dtype: float64

As do summing up the values or calculating the mean.

df.sum()

angle     0.00
value1    2.54
value2    2.53
diff      0.01
dtype: float64

df.mean()

angle     0.000000
value1    0.195385
value2    0.194615
diff      0.000769
dtype: float64

Calculating the difference of consecutive rows.

df.diff()

	angle	value1	value2	diff
0	NaN	NaN	NaN	NaN
1	15.0	0.02	0.04	-0.02
2	15.0	0.03	0.04	-0.01
3	15.0	0.01	0.03	-0.02
4	15.0	0.02	0.01	0.01
5	15.0	0.02	0.01	0.01
6	15.0	0.02	0.01	0.01
7	15.0	0.00	-0.03	0.03
8	15.0	-0.01	-0.02	0.01
9	15.0	-0.01	-0.03	0.02
10	15.0	-0.03	0.00	-0.03
11	15.0	-0.04	-0.03	-0.01
12	15.0	-0.04	-0.03	-0.01

Calculating the cumulative sum

df.cumsum()

	angle	value1	value2	diff
0	-90	0.13	0.12	0.01
1	-165	0.28	0.28	0.00
2	-225	0.46	0.48	-0.02
3	-270	0.65	0.71	-0.06
4	-300	0.86	0.95	-0.09
5	-315	1.09	1.20	-0.11
6	-315	1.34	1.46	-0.12
7	-300	1.59	1.69	-0.10
8	-270	1.83	1.90	-0.07
9	-225	2.06	2.08	-0.02
10	-165	2.26	2.26	0.00
11	-90	2.42	2.41	0.01
12	0	2.54	2.53	0.01

Value based selections.

Select all rows in the DataFrame where the values in the value1 column are above its mean value.

df[df["value1"] > df["value1"].mean()]

	angle	value1	value2	diff
4	-30	0.21	0.24	-0.03
5	-15	0.23	0.25	-0.02
6	0	0.25	0.26	-0.01
7	15	0.25	0.23	0.02
8	30	0.24	0.21	0.03
9	45	0.23	0.18	0.05
10	60	0.20	0.18	0.02

Plotting#

Simple plotting is also available as a method for DataFrames. This has been implemented to provide a consistent api, where the DataFrame itself is always the core element. This is well suited for a quick overview. However for more detailed plots, where more control is needed over the plot itself, using the approaches from the Plotting chapter is advised.

We need to prepare a bit for the plots to look nice and work as expected. pandas will use an available backend for the plots. So let’s prepare our Matplotlib and seaborn based approach first.

%matplotlib widget
import seaborn as sns
sns.set()

The most simple way to create a plot is to just call the plot method on the DataFrame. Some already known Matplotlib parameters can be used in this context as well.

df.plot(figsize=(8,5))

<Axes: >

Based on the legend shown in the plot we can see that all four available columns are plotted against the index by default.

If we wanted to use a different column to be the base, we need to change the index. pandas offers a way for this by using the set_index method. Itr returns a different view on the same DataFrame with the index now being the chosen column.

df1 = df.set_index("angle")
df1

	value1	value2	diff
angle
-90	0.13	0.12	0.01
-75	0.15	0.16	-0.01
-60	0.18	0.20	-0.02
-45	0.19	0.23	-0.04
-30	0.21	0.24	-0.03
-15	0.23	0.25	-0.02
0	0.25	0.26	-0.01
15	0.25	0.23	0.02
30	0.24	0.21	0.03
45	0.23	0.18	0.05
60	0.20	0.18	0.02
75	0.16	0.15	0.01
90	0.12	0.12	0.00

Calling the plot method on this DataFrame shows the probably more desired output.

df1.plot(figsize=(8,5))

<Axes: xlabel='angle'>