Python Pandas Tutorial
Filed in: Python
For Pandas you are best off using python 3 and then doing a pip install pandas.
Once you have the pandas installed it is straightforward to create a data frame and a series which forms the basics of everything we do indeed a data frame is just a set of Series as we shall see.
First lets create a series which is just a 1d Array with labels
as you can see a Series is kind of like a dict and you can change the labels. The labels act as a kind of reference or row number for the series
Now let us create a Series from a dictionary since they have similar attributes. Also note what happens if we adjust labels to values not in the dictionary. This puts the things seen so far together
your output should look like this
Now we have seen a series let us look at Dataframe. Well a data frames is a dict of Series that all share the same label. Lets create one and add a new column as a Series so you can see what I mean
This output is like so
A data frame is essentially a table and the column data is called a Series and if you have lots of series data you have lots of columns. You then have column names which are known as Index names and every row is ordered from 0 to the total rows of your set
If you want the data set I used here for this example please contact me directly.
So lets load a table so you can see what I mean
this essentially reads a csv and shows the details
The output is
So we have 15599 rows we have column headings and we have everything ordered from 0 to 15599
So what are the column names or the Index names well we can do
Which gives us
Notice how it says Index list well this is confusing as these are column names - row names are called index
We can now look at a specific column or columns or a specific row or rows or the head of the table or the tail of the table
I won't show the output but you get the idea. Now that we can extract a specific or a specific set of columns from the DataFrame ie: df['colname'] which is useful when you want to extract specific data. Let us try to find the mean of the signal column but we want to see the mean for each sensor.
This implies that you each sensor pressure_a to preassure_z will have their own sensor.
The basic stats are covered in pandas but when you look at stats you need to do some sort of aggregation for them to make sense. What is the min well we need to find the lowest across the whole set ie aggregate the whole set. If we want to find the mean we need the mean of either an entire column or probably the mean for that column aggregated by another column. ie the mean of sensor but aggregated by sensor.
This is probably best explained in an example
and results are
As you can see when you do basic stats you first need to think about how you will aggregate or group by certain other columns in order to get at the stat.
To round this off we will look at a data frame that allows you to set your own names fro rows and cols so you don't need to use numbers to refer to items
Which give you this which is so much easier to manipulate as you can use row and column headings
Python Decorator Semantics
Python Threading
Python Numpy
Python Pandas
Equity Derivatives tutorial
Fixed Income tutorial
Java
python
Scala
Investment Banking tutorials
HOME

Once you have the pandas installed it is straightforward to create a data frame and a series which forms the basics of everything we do indeed a data frame is just a set of Series as we shall see.
First lets create a series which is just a 1d Array with labels
import pandas as pd
labels=['a','b','c','d']
ser=pd.Series([3, -8, 5,2], index=labels )
print(ser)
print('This seems like a named and ordered Dictionary')
labels=['d','c','b','x']
ser=pd.Series([3, -8, 5,2], index=labels )
print('New series post label change')
print(ser)
as you can see a Series is kind of like a dict and you can change the labels. The labels act as a kind of reference or row number for the series
a 3
b -8
c 5
d 2
dtype: int64
This seems like a named and ordered Dictionary
New series post label change
d 3
c -8
b 5
x 2
Now let us create a Series from a dictionary since they have similar attributes. Also note what happens if we adjust labels to values not in the dictionary. This puts the things seen so far together
#We could ofcourse create data via a dict
import pandas as pd
data={ 'a': 1,
'b': 2,
'c': 3,
'd': 4
}
labels=['a','b','c','d']
ser=pd.Series(data, index=labels )
print(ser)
labels=['a','b','c','x']
#change the index to something not in the key will give you a NaN
ser=pd.Series(data, index=labels )
print(ser)
print('Dealing with Na/Null you can use bool arrays to find them and fix ')
print(pd.isna(ser))
print(pd.isnull(ser))
print(ser.isnull())
print('Lastly we can modify an index in place')
ser.index=['rob','bob','yob','gob']
print(ser)
your output should look like this
a 1
b 2
c 3
d 4
dtype: int64
a 1.0
b 2.0
c 3.0
x NaN
dtype: float64
Dealing with Na/Null you can use bool arrays to find them and fix
a False
b False
c False
x True
dtype: bool
a False
b False
c False
x True
dtype: bool
a False
b False
c False
x True
dtype: bool
Lastly we can modify an index in place
rob 1.0
bob 2.0
yob 3.0
gob NaN
dtype: float64
Now we have seen a series let us look at Dataframe. Well a data frames is a dict of Series that all share the same label. Lets create one and add a new column as a Series so you can see what I mean
#Dataframe has both row and col index
#can be seen as a dict of series all sharing same index
#lets prove by creting a DF and getting a series
import pandas as pd
import numpy as np
cols=['a','b','c']
idx=['r1','r2','r3']
df=pd.DataFrame(np.arange(9).reshape((3,3)), columns=cols, index=idx)
print(df)
#now pull out a Series from the DF
ser=df['a']
print(ser)
print('you can add a series to a df as long as they match ')
val=pd.Series([-1,-2,-3], index=['r1','r2','r4']) #we dont have r3 so expect NaN
df['d']=val #add new col as a series
print(df)
print('now remove the col')
del df['d']
print(df)
This output is like so
a b c
r1 0 1 2
r2 3 4 5
r3 6 7 8
r1 0
r2 3
r3 6
Name: a, dtype: int64
you can add a series to a df as long as they match
a b c d
r1 0 1 2 -1.0
r2 3 4 5 -2.0
r3 6 7 8 NaN
now remove the col
a b c
r1 0 1 2
r2 3 4 5
r3 6 7 8
A data frame is essentially a table and the column data is called a Series and if you have lots of series data you have lots of columns. You then have column names which are known as Index names and every row is ordered from 0 to the total rows of your set
If you want the data set I used here for this example please contact me directly.
So lets load a table so you can see what I mean
import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.info)
this essentially reads a csv and shows the details
The output is
0 feed_a pressure_a 0.009055 0.01 0 0 0
1 feed_a pressure_a 0.049098 0.01 0 0 200
2 feed_a pressure_a 0.046490 0.01 0 0 400
3 feed_a pressure_a 0.058478 0.01 0 0 600
4 feed_a pressure_a 0.052379 0.01 0 0 800
... ... ... ... ... ... ... ...
15595 feed_a pressure_z 0.074816 0.01 1 59 0
15596 feed_a pressure_z 0.055374 0.01 1 59 200
15597 feed_a pressure_z 0.042579 0.01 1 59 400
15598 feed_a pressure_z 0.044520 0.01 1 59 600
15599 feed_a pressure_z 0.060234 0.01 1 59 800
So we have 15599 rows we have column headings and we have everything ordered from 0 to 15599
So what are the column names or the Index names well we can do
import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.info)
print(df.columns)
Which gives us
Index(['machine', 'sensor', 'signal', 'active', 'minutes', 'seconds',
'millis'],
dtype='object')
Notice how it says Index list well this is confusing as these are column names - row names are called index
We can now look at a specific column or columns or a specific row or rows or the head of the table or the tail of the table
import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.info)
#show the Index
print(df.columns)
#show machine series data
print(df['machine'])
#show row 5
print(df.loc[5])
#show row 5 and 0
print(df.loc[[5,0]])
#show head and tail
print(df.head())
print(df.tail())
I won't show the output but you get the idea. Now that we can extract a specific or a specific set of columns from the DataFrame ie: df['colname'] which is useful when you want to extract specific data. Let us try to find the mean of the signal column but we want to see the mean for each sensor.
This implies that you each sensor pressure_a to preassure_z will have their own sensor.
The basic stats are covered in pandas but when you look at stats you need to do some sort of aggregation for them to make sense. What is the min well we need to find the lowest across the whole set ie aggregate the whole set. If we want to find the mean we need the mean of either an entire column or probably the mean for that column aggregated by another column. ie the mean of sensor but aggregated by sensor.
This is probably best explained in an example
import pandas as pd
df=pd.read_csv('pressure.csv')
print(df.agg(['min','max']))
grouping=df['signal'].groupby(df['sensor']).mean()
print(grouping)
and results are
machine sensor signal active minutes seconds millis
min feed_a pressure_a 0.001247 0.010000 0 0 0
max feed_a pressure_z 0.843612 0.830312 1 59 800
sensor
pressure_a 0.383056
pressure_b 0.274544
pressure_c 0.053086
pressure_d 0.392225
pressure_e 0.413272
pressure_f 0.372712
pressure_g 0.361753
pressure_h 0.265497
pressure_i 0.200899
pressure_j 0.439096
pressure_k 0.487640
pressure_l 0.293329
pressure_m 0.267263
pressure_n 0.394381
pressure_o 0.133307
pressure_p 0.309840
pressure_q 0.423608
pressure_r 0.154208
pressure_s 0.139138
pressure_t 0.073541
pressure_u 0.314823
pressure_v 0.357417
pressure_w 0.128248
pressure_x 0.298298
pressure_y 0.481426
pressure_z 0.223841
Name: signal, dtype: float64
As you can see when you do basic stats you first need to think about how you will aggregate or group by certain other columns in order to get at the stat.
To round this off we will look at a data frame that allows you to set your own names fro rows and cols so you don't need to use numbers to refer to items
import pandas as pd
import numpy as np
#create dict that has the info
data = { 'colA' : ['r1','r2','r3'],
'colB' : np.random.randn(3),
'colC' : np.random.randn(3)
}
df=pd.DataFrame(data, columns=['colA','colB','colC'], index=['r1','r2','r3'])
#Now you can refer to Column or row using names rather tan arbitrary col/row nums
print(df)
Which give you this which is so much easier to manipulate as you can use row and column headings
colA colB colC
r1 r1 0.254374 -0.655000
r2 r2 0.900610 -1.940164
r3 r3 -1.523257 0.691618
People who enjoyed this article also enjoyed the following:
Python Decorator Semantics
Python Threading
Python Numpy
Python Pandas
Equity Derivatives tutorial
Fixed Income tutorial
And the following Trails:
C++Java
python
Scala
Investment Banking tutorials
HOME
