Python Pandas Tutorial

Filed in: Python

For Pandas you are best off using python 3 and then doing a pip install pandas.
Once you have the pandas installed it is straightforward to create a data frame and a series which forms the basics of everything we do indeed a data frame is just a set of Series as we shall see.
First lets create a series which is just a 1d Array with labels



import pandas as pd

labels=['a','b','c','d']
ser=pd.Series([3, -8, 5,2], index=labels )
print(ser)
print('This seems like a named and ordered Dictionary')
labels=['d','c','b','x']
ser=pd.Series([3, -8, 5,2], index=labels )
print('New series post label change')
print(ser)

as you can see a Series is kind of like a dict and you can change the labels. The labels act as a kind of reference or row number for the series



a    3
b   -8
c    5
d    2
dtype: int64
This seems like a named and ordered Dictionary
New series post label change
d    3
c   -8
b    5
x    2

Now let us create a Series from a dictionary since they have similar attributes. Also note what happens if we adjust labels to values not in the dictionary. This puts the things seen so far together



#We could ofcourse create data via a dict
import pandas as pd

data={ 'a': 1,
       'b': 2,
       'c': 3,
       'd': 4
     }

labels=['a','b','c','d']

ser=pd.Series(data, index=labels )
print(ser)
labels=['a','b','c','x']
#change the index to something not in the key will give you a NaN 
ser=pd.Series(data, index=labels )
print(ser)
print('Dealing with Na/Null you can use bool arrays to find them and fix ')
print(pd.isna(ser))
print(pd.isnull(ser))
print(ser.isnull())
print('Lastly we can modify an index in place')
ser.index=['rob','bob','yob','gob']
print(ser)

your output should look like this



a    1
b    2
c    3
d    4
dtype: int64
a    1.0
b    2.0
c    3.0
x    NaN
dtype: float64
Dealing with Na/Null you can use bool arrays to find them and fix 
a    False
b    False
c    False
x     True
dtype: bool
a    False
b    False
c    False
x     True
dtype: bool
a    False
b    False
c    False
x     True
dtype: bool
Lastly we can modify an index in place
rob    1.0
bob    2.0
yob    3.0
gob    NaN
dtype: float64

Now we have seen a series let us look at Dataframe. Well a data frames is a dict of Series that all share the same label. Lets create one and add a new column as a Series so you can see what I mean



#Dataframe has both row and col index
#can be seen as a dict of series all sharing same index
#lets prove by creting a DF and getting a series

import pandas as pd
import numpy as np
cols=['a','b','c']
idx=['r1','r2','r3']
df=pd.DataFrame(np.arange(9).reshape((3,3)), columns=cols, index=idx)
print(df)
#now pull out a Series from the DF
ser=df['a']
print(ser)
print('you can add a series to a df as long as they match ')
val=pd.Series([-1,-2,-3], index=['r1','r2','r4']) #we dont have r3 so expect NaN
df['d']=val #add new col as a series
print(df)
print('now remove the col')
del df['d']
print(df)

This output is like so



    a  b  c
r1  0  1  2
r2  3  4  5
r3  6  7  8
r1    0
r2    3
r3    6
Name: a, dtype: int64
you can add a series to a df as long as they match 
    a  b  c    d
r1  0  1  2 -1.0
r2  3  4  5 -2.0
r3  6  7  8  NaN
now remove the col
    a  b  c
r1  0  1  2
r2  3  4  5
r3  6  7  8

A data frame is essentially a table and the column data is called a Series and if you have lots of series data you have lots of columns. You then have column names which are known as Index names and every row is ordered from 0 to the total rows of your set
If you want the data set I used here for this example please contact me directly.

So lets load a table so you can see what I mean




import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.info)

this essentially reads a csv and shows the details
The output is




0      feed_a  pressure_a  0.009055    0.01        0        0       0
1      feed_a  pressure_a  0.049098    0.01        0        0     200
2      feed_a  pressure_a  0.046490    0.01        0        0     400
3      feed_a  pressure_a  0.058478    0.01        0        0     600
4      feed_a  pressure_a  0.052379    0.01        0        0     800
...       ...         ...       ...     ...      ...      ...     ...
15595  feed_a  pressure_z  0.074816    0.01        1       59       0
15596  feed_a  pressure_z  0.055374    0.01        1       59     200
15597  feed_a  pressure_z  0.042579    0.01        1       59     400
15598  feed_a  pressure_z  0.044520    0.01        1       59     600
15599  feed_a  pressure_z  0.060234    0.01        1       59     800

So we have 15599 rows we have column headings and we have everything ordered from 0 to 15599
So what are the column names or the Index names well we can do




import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.info)
print(df.columns)

Which gives us




Index(['machine', 'sensor', 'signal', 'active', 'minutes', 'seconds',
       'millis'],
      dtype='object')

Notice how it says Index list well this is confusing as these are column names - row names are called index

We can now look at a specific column or columns or a specific row or rows or the head of the table or the tail of the table




import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.info)
#show the Index
print(df.columns)

#show machine series data
print(df['machine'])

#show row 5 
print(df.loc[5])
#show row 5 and 0 
print(df.loc[[5,0]])

#show head and tail
print(df.head())
print(df.tail())

I won't show the output but you get the idea. Now that we can extract a specific or a specific set of columns from the DataFrame ie: df['colname'] which is useful when you want to extract specific data. Let us try to find the mean of the signal column but we want to see the mean for each sensor.
This implies that you each sensor pressure_a to preassure_z will have their own sensor.
The basic stats are covered in pandas but when you look at stats you need to do some sort of aggregation for them to make sense. What is the min well we need to find the lowest across the whole set ie aggregate the whole set. If we want to find the mean we need the mean of either an entire column or probably the mean for that column aggregated by another column. ie the mean of sensor but aggregated by sensor.

This is probably best explained in an example



import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.agg(['min','max']))
grouping=df['signal'].groupby(df['sensor']).mean()
print(grouping)

and results are



    machine      sensor    signal    active  minutes  seconds  millis
min  feed_a  pressure_a  0.001247  0.010000        0        0       0
max  feed_a  pressure_z  0.843612  0.830312        1       59     800
sensor
pressure_a    0.383056
pressure_b    0.274544
pressure_c    0.053086
pressure_d    0.392225
pressure_e    0.413272
pressure_f    0.372712
pressure_g    0.361753
pressure_h    0.265497
pressure_i    0.200899
pressure_j    0.439096
pressure_k    0.487640
pressure_l    0.293329
pressure_m    0.267263
pressure_n    0.394381
pressure_o    0.133307
pressure_p    0.309840
pressure_q    0.423608
pressure_r    0.154208
pressure_s    0.139138
pressure_t    0.073541
pressure_u    0.314823
pressure_v    0.357417
pressure_w    0.128248
pressure_x    0.298298
pressure_y    0.481426
pressure_z    0.223841
Name: signal, dtype: float64

As you can see when you do basic stats you first need to think about how you will aggregate or group by certain other columns in order to get at the stat.
To round this off we will look at a data frame that allows you to set your own names fro rows and cols so you don't need to use numbers to refer to items



import pandas as pd 
import numpy as np
#create dict that has the info 
data = { 'colA' : ['r1','r2','r3'],
         'colB' : np.random.randn(3),
         'colC' : np.random.randn(3)
       }
df=pd.DataFrame(data, columns=['colA','colB','colC'], index=['r1','r2','r3'])
#Now you can refer to Column or row using names rather tan arbitrary col/row nums 
print(df)

Which give you this which is so much easier to manipulate as you can use row and column headings



   colA      colB      colC
r1   r1  0.254374 -0.655000
r2   r2  0.900610 -1.940164
r3   r3 -1.523257  0.691618

People who enjoyed this article also enjoyed the following:

Python Decorator Semantics
Python Threading
Python Numpy
Python Pandas
Equity Derivatives tutorial
Fixed Income tutorial

And the following Trails:

C++
Java
python
Scala
Investment Banking tutorials

HOME
homeicon

Tags: Pandas python|Python

Arif Jaffer

Arif Jaffer

Polyglot Paradise

Python Pandas Tutorial

People who enjoyed this article also enjoyed the following:

And the following Trails: