Pandas - Settings and options:
- Export our dataframe as csv
- Configuring options to view max or selected no. of columns/ rows
- To edit float values’ format to 2 d.p.
Pandas - Advanced Calculations:
Data type conversions
- df.dtype
- df.mean( )
- df[‘columnname’][index]
- .astype( )
- .to_datetime( )
Working with strings as values in dataframes
- df.str.replace( )
- df.str.len( )
- df.str.strip( )
- df.str.lower( ) and df.str.upper( )
- Swap the order of the first and last words of a string
Working with dates as values in dataframes
- pd.period_range( ) and pd.DataFrame( )
- .diff( )
- .values.astype( )
- date_df.dtypes
- .dt.to_timestamp( )
- Subtract 2 dates
- Subtract time difference (in days) from a date
- .dt.day_name( )
Working with missing values
- df.isna( )
- df[‘columnname’].cumsum( )
- .groupby( )
- df.dropna( )
- df.fillna( )
- df.interpolate( )
Apply / Map / ApplyMap / for loop
- df.apply(lambda x: )
- df[columnname].map( )
- df.applymap(lambda x: )
- using for loop
What I learnt:
Settings and options
Export our data as csv:
- df.to_csv(‘filename.csv’,index=False) and include the index=False so that the index is not included in our csv
Configuring options:
- pd.set_option(“display.max_columns”,None) or pd.set_option(“max_columns”,None) to view all columns without truncation
- pd.set_option(“display.max_columns”,2) or pd.set_option(“max_columns”,2) to view 2 columns and hide the rest
- pd.set_option(“display.max_rows”,None) or pd.set_option(“max_rows”,None) to view all rows without truncation
- pd.set_option(“display.max_rows”,2) or pd.set_option(“max_rows”,2) to view 2 rows and hide the rest
To edit float values’ format:
- sometimes the values are shown as 1.00e^10
- can amend to 2 d.p. using pd.options.display.float_format = ‘{:,.2f}’.format
Advanced Calculations
Data type conversions
- df.dtype to view data types of each series
- df.mean( ) to see mean of series with integer and float types
- df[‘columnname’][index] e.g. df[‘columnname’][0] to obtain the first value in the series
- use .astype(float) or .astype(int) or .astype(str) to convert to float or integer or string types e.g. df[‘columnname’][0].astype(float) or df[‘year’][0].astype(str)
- use .to_datetime( ) to create a new column with formatted date time values, e.g. new column df[‘year_dt’] = pd.to_datetime(df[‘year’],format=’%Y’)
Working with strings
Example dataframe:
df = pd.Series{['Amy, Wong ',' Ben; Tan','Cody, Lim']}
- df = df.str.replace(‘;’ , ‘,’) to replace the ; with ,
- df.str.len( ) to view the number of character in each string of your series
- df.str.strip( ) to remove the trailing and leading spaces in the names (e.g. first 2 names)
- df = df.str.lower( ) or df.str.upper( ) to make all strings lower or upper case.
- we can swap the order of the first and last words (names) using
df.str.split(', ') df=pd.Series([i[::-1] for i in df]) df=[' '.join(i) for i in df]
Working with dates
- pd.period_range( ) to generate a series of dates
'''these show 4 dates, starting from 1/1/2020, each separated by 30 days''' daterange = pd.period_range('1/1/2020', freq='30d', periods=4) date_df = pd.DataFrame(data=daterange,columns=['sample date']) date_df
- .diff( ) to calculate date difference from the prior date in days
date_df['date difference'] = date_df['sample date']).diff(periods=1) date_df
- .values.astype( ) to find first day of the month
date_df['first day of month'] = date_df['sample date']).values.astype('datetime64[M]') date_df
-
date_df.dtypes to check the data types of the dates in the dataframe
- .dt.to_timestamp( ) to convert date type to another date type
from period[30D] to datetime64[ns] typedate_df['sample date'] = date_df['sample date']).dt.to_timestamp() date_df
- Subtract 2 dates to get the dates’ difference in days
'''date 1 minus date 2''' date_df['sample date'] - date_df['first day of month']
- Subtract time difference (in days) from a date to find the dates that are the time difference away from specifed dates
'''date 1 minus date difference''' date_df['sample date'] - date_df['date difference']
or
'''date 1 minus date difference''' date_df['sample date'] - pd.Timedelta('30 d')
- .dt.day_name( ) Get the day names of the date
'''Get day names e.g. Monday Tuesday Wednesday''' date_df['sample date'].dt.day_name()
Working with missing values
-
df.isna( ) identify null values in a dataframe
-
df[‘columnname’].cumsum( ) finds the cumulative sum of values as it goes down the column, and skips null by default when going down the column to get the cumulative amounts.
If df[‘columnname’].cumsum(skipna=False), then the cumulative sum from first null value onwards will all be null. - .groupby( ) by default excludes any null values
'''Get day names e.g. Monday Tuesday Wednesday''' df.groupby(by=['columnname']).max()
To prevent the groupby from excluding null values, use dropna=False
'''Get day names e.g. Monday Tuesday Wednesday''' df.groupby(by=['columnname'],dropna=False).max()
-
df.dropna( ) default is to drop any rows that contain null values in any column
If we want to only drop rows that contain null values in certain column, we can use the subset parameter, or we can just drop the column(s) with no values by using df.dropna(axis=1) -
df.fillna(0) replaces all null values with ‘0’ value, but may be problematic especially if we calculate mean and include these 0 values
-
df.fillna(method=’pad’) replaces all null values with the value before it, but may be problematic
- df.interpolate( ) by default lets us replace all null values with a straight line estimate
Apply / Map / ApplyMap
Apply can be used for both data frames and series.
Map works for series only.
Applymap affects each element in a data frame.
When in doubt, for a loop should do the trick.
- df.apply(lambda x: ) to alter values along an axis in your dataframe or in a series by applying a function.
Lambda function allows you to create the function in the Apply statement without having to create it in advance. This Lambda function will return a series which we will use to populate a (new) column in our data frame.df = pd.DataFrame({"Region":['North','West','East','South','North','West', 'East','South'], "Team":['One','One','One','One','Two','Two','Two','Two'], "Squad":['A','B','C','D','E','F','G','H'], "Revenue":[7500,5500,2750,6400,2300,3750,1900,575], "Cost":[5200,5100,4400,5300,1250,1300,2100,50]}) '''create a new column Profit with "Profit or Loss" or "Loss" based on Revenue vs Cost values''' df['Profit or Loss'] = df.apply(lambda x: 'Profit' if x['Revenue']>x['Cost'] else 'Loss',axis=1) df
- df[columnname].map( ) to substitute each value in a series, using either a function, dictionary, or series.
df = pd.DataFrame({"Region":['North','West','East','South','North','West', 'East','South'], "Team":['One','One','One','One','Two','Two','Two','Two'], "Squad":['A','B','C','D','E','F','G','H'], "Revenue":[7500,5500,2750,6400,2300,3750,1900,575], "Cost":[5200,5100,4400,5300,1250,1300,2100,50]}) '''Step 1: Create a dictionary that maps team one or team two (keys) to their corresponding color red or blue (values)''' team_map = {"One":"Red","Two":"Blue"} '''Step 2: Map the values in the mapped dictionary to a new column, which we'll call team color''' df['Team Color'] = df['Team'].map(team_map) df
- df.applymap(lambda x: ) to apply a function to each element in your dataframe
'''a simple Lambda function which returns the character length of each value in our data frame.''' df.applymap(lambda x: len(str(x)))
- If all else fails, use a ‘for’ loop
'''Start with an empty list that we populate as we iterate through each row in our data frame.''' new_col = [] '''To construct the for loop, we loop through each i in the range from 0 to the length of our data frame.''' for i in range(0,len(df)): '''rev represents the revenue value for that particular row divided by the sum of all revenue in our data frame where the region equals this particular squad's region.''' rev = df['Revenue'][i]/df[df['Region']==df.loc[i,'Region']]['Revenue'].sum() '''we append this value before continuing the loop''' new_col.append(rev) '''Next, we set a new column "revenue share of region" equal to this list.''' df['Revenue Share of Region'] = new_col df.sort_values(by='Region')