Week 9 - Pandas (Part III) - Settings and options, Advanced Calculations

Liaw Bei Le · May 22, 2021

Learning Python Pandas

Pandas - Settings and options:

Export our dataframe as csv
Configuring options to view max or selected no. of columns/ rows
To edit float values’ format to 2 d.p.

Pandas - Advanced Calculations:

Data type conversions

df.dtype
df.mean( )
df[‘columnname’][index]
.astype( )
.to_datetime( )

Working with strings as values in dataframes

df.str.replace( )
df.str.len( )
df.str.strip( )
df.str.lower( ) and df.str.upper( )
Swap the order of the first and last words of a string

Working with dates as values in dataframes

pd.period_range( ) and pd.DataFrame( )
.diff( )
.values.astype( )
date_df.dtypes
.dt.to_timestamp( )
Subtract 2 dates
Subtract time difference (in days) from a date
.dt.day_name( )

Working with missing values

df.isna( )
df[‘columnname’].cumsum( )
.groupby( )
df.dropna( )
df.fillna( )
df.interpolate( )

Apply / Map / ApplyMap / for loop

df.apply(lambda x: )
df[columnname].map( )
df.applymap(lambda x: )
using for loop

What I learnt:

Settings and options

Export our data as csv:

df.to_csv(‘filename.csv’,index=False) and include the index=False so that the index is not included in our csv

Configuring options:

pd.set_option(“display.max_columns”,None) or pd.set_option(“max_columns”,None) to view all columns without truncation
pd.set_option(“display.max_columns”,2) or pd.set_option(“max_columns”,2) to view 2 columns and hide the rest
pd.set_option(“display.max_rows”,None) or pd.set_option(“max_rows”,None) to view all rows without truncation
pd.set_option(“display.max_rows”,2) or pd.set_option(“max_rows”,2) to view 2 rows and hide the rest

To edit float values’ format:

sometimes the values are shown as 1.00e^10
can amend to 2 d.p. using pd.options.display.float_format = ‘{:,.2f}’.format

Advanced Calculations

Data type conversions

df.dtype to view data types of each series
df.mean( ) to see mean of series with integer and float types
df[‘columnname’][index] e.g. df[‘columnname’][0] to obtain the first value in the series
use .astype(float) or .astype(int) or .astype(str) to convert to float or integer or string types e.g. df[‘columnname’][0].astype(float) or df[‘year’][0].astype(str)
use .to_datetime( ) to create a new column with formatted date time values, e.g. new column df[‘year_dt’] = pd.to_datetime(df[‘year’],format=’%Y’)

Working with strings

Example dataframe:

df = pd.Series{['Amy, Wong ',' Ben; Tan','Cody, Lim']}  

df = df.str.replace(‘;’ , ‘,’) to replace the ; with ,
df.str.len( ) to view the number of character in each string of your series
df.str.strip( ) to remove the trailing and leading spaces in the names (e.g. first 2 names)
df = df.str.lower( ) or df.str.upper( ) to make all strings lower or upper case.

we can swap the order of the first and last words (names) using

df.str.split(', ')  
df=pd.Series([i[::-1] for i in df])  
df=[' '.join(i) for i in df]  

Working with dates

pd.period_range( ) to generate a series of dates

'''these show 4 dates, starting from 1/1/2020, each separated by 30 days'''
daterange = pd.period_range('1/1/2020', freq='30d', periods=4)  
date_df = pd.DataFrame(data=daterange,columns=['sample date'])
date_df

.diff( ) to calculate date difference from the prior date in days

date_df['date difference'] = date_df['sample date']).diff(periods=1)
date_df

.values.astype( ) to find first day of the month

date_df['first day of month'] = date_df['sample date']).values.astype('datetime64[M]')
date_df

date_df.dtypes to check the data types of the dates in the dataframe
.dt.to_timestamp( ) to convert date type to another date type
from period[30D] to datetime64[ns] type
```
date_df['sample date'] = date_df['sample date']).dt.to_timestamp()
date_df
```

Subtract 2 dates to get the dates’ difference in days

'''date 1 minus date 2'''
date_df['sample date'] - date_df['first day of month']

Subtract time difference (in days) from a date to find the dates that are the time difference away from specifed dates

'''date 1 minus date difference'''
date_df['sample date'] - date_df['date difference']

'''date 1 minus date difference'''
date_df['sample date'] - pd.Timedelta('30 d')

.dt.day_name( ) Get the day names of the date

'''Get day names e.g. Monday Tuesday Wednesday'''
date_df['sample date'].dt.day_name()

Working with missing values

df.isna( ) identify null values in a dataframe
df[‘columnname’].cumsum( ) finds the cumulative sum of values as it goes down the column, and skips null by default when going down the column to get the cumulative amounts.
If df[‘columnname’].cumsum(skipna=False), then the cumulative sum from first null value onwards will all be null.

.groupby( ) by default excludes any null values

'''Get day names e.g. Monday Tuesday Wednesday'''
df.groupby(by=['columnname']).max()

To prevent the groupby from excluding null values, use dropna=False

'''Get day names e.g. Monday Tuesday Wednesday'''
df.groupby(by=['columnname'],dropna=False).max()

df.dropna( ) default is to drop any rows that contain null values in any column
If we want to only drop rows that contain null values in certain column, we can use the subset parameter, or we can just drop the column(s) with no values by using df.dropna(axis=1)
df.fillna(0) replaces all null values with ‘0’ value, but may be problematic especially if we calculate mean and include these 0 values
df.fillna(method=’pad’) replaces all null values with the value before it, but may be problematic
df.interpolate( ) by default lets us replace all null values with a straight line estimate

Apply / Map / ApplyMap

Apply can be used for both data frames and series.
Map works for series only.
Applymap affects each element in a data frame.
When in doubt, for a loop should do the trick.

df.apply(lambda x: ) to alter values along an axis in your dataframe or in a series by applying a function.
Lambda function allows you to create the function in the Apply statement without having to create it in advance. This Lambda function will return a series which we will use to populate a (new) column in our data frame.

df = pd.DataFrame({"Region":['North','West','East','South','North','West', 'East','South'], "Team":['One','One','One','One','Two','Two','Two','Two'], "Squad":['A','B','C','D','E','F','G','H'], "Revenue":[7500,5500,2750,6400,2300,3750,1900,575], "Cost":[5200,5100,4400,5300,1250,1300,2100,50]})
'''create a new column Profit with "Profit or Loss" or "Loss" based on Revenue vs Cost values'''
df['Profit or Loss'] = df.apply(lambda x: 'Profit' if x['Revenue']>x['Cost'] else 'Loss',axis=1)
df

df[columnname].map( ) to substitute each value in a series, using either a function, dictionary, or series.

df = pd.DataFrame({"Region":['North','West','East','South','North','West', 'East','South'], "Team":['One','One','One','One','Two','Two','Two','Two'], "Squad":['A','B','C','D','E','F','G','H'], "Revenue":[7500,5500,2750,6400,2300,3750,1900,575], "Cost":[5200,5100,4400,5300,1250,1300,2100,50]})
'''Step 1: Create a dictionary that maps team one or team two (keys) to their corresponding color red or blue (values)'''
team_map = {"One":"Red","Two":"Blue"}
'''Step 2: Map the values in the mapped dictionary to a new column, which we'll call team color'''
df['Team Color'] = df['Team'].map(team_map)
df

df.applymap(lambda x: ) to apply a function to each element in your dataframe

'''a simple Lambda function which returns the character length of each value in our data frame.'''
df.applymap(lambda x: len(str(x)))

If all else fails, use a ‘for’ loop

'''Start with an empty list that we populate as we iterate through each row in our data frame.'''
new_col = []
'''To construct the for loop, we loop through each i in the range from 0 to the length of our data frame.'''
for i in range(0,len(df)):
  '''rev represents the revenue value for that particular row divided by the sum of all revenue in our data frame where the region equals this particular squad's region.'''
  rev = df['Revenue'][i]/df[df['Region']==df.loc[i,'Region']]['Revenue'].sum()
  '''we append this value before continuing the loop''' 
  new_col.append(rev)
'''Next, we set a new column "revenue share of region" equal to this list.'''
df['Revenue Share of Region'] = new_col
df.sort_values(by='Region')

Share: Twitter, Facebook