Pandas - Basic Analysis:
- df.value_counts( )
- df.sort_values( )
- Boolean indexing
- String handling
Pandas - Indexes
- df.index[ ]
- df.set_index( )
- df.reset_index( )
- df.sort_index( )
- df.loc[ ] or df.series.loc[ ]
- df.iloc[ ]
Additional resources:
- Ways to Filter Pandas Dataframe with multiple conditions
What I learnt:
Basic Analysis
- df.seriesname.value_counts( ) returns a series object, counting all the unique values.
Returns a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on.- This order can be reversed by setting the ascending flag to True
df.seriesname.value_counts(ascending=True)
- Dropna is True by default and we will not get a count of the na values (the missing data values). If our data set has a significant number of na values, we can set dropna to False.
df.seriesname.value_counts(dropna=False)
- This order can be reversed by setting the ascending flag to True
- df.seriesname.sort_values( ) or df.sort_values(“seriesname”) sorts the values in a series
- in ascending order by default
- By default, the NaNs, or missing data, are put right at the end
- We can also sort multiple series in ascending and descending order.
df.sort_values(["seriesname","seriesname2"])
df.sort_values(by["seriesname","seriesname2"])
df.sort_values(["seriesname","seriesname2"]), ascending=[True, False])
- Boolean indexing:
- AND: &
- OR: \
- NOT: ~
- multiple conditions must be grouped using brackets
- e.g. show rows with ‘Gold’ values only
df[df.columnname == 'Gold']
- e.g. show rows with ‘Gold’ and ‘Women’ values only
df[(df.columnname == 'Gold') & (df.columnname2=='Women')]
- e.g. other examples
'''df[<some conditions>]''' df[df['Age']<30]
'''df[<some conditions>][show corresponding selected columns]''' df[df['Age']>50][['Age', 'Name', 'Sex']]
'''to filter rows by score > 90''' df[df['score']>90]
'''filter and count:''' len(df[(df['Age']>50) & (df['Survived']==0) & (df['Sex']=='male')]) len(df[(df['Age']<30) & (df['Survived']==1)])
- e.g. show rows with ‘Gold’ values only
- String handling:
df[df.columnname.str.contains(“value”)]- show rows where values are ‘Florence’ only, at the specified column
df[df.columnname.str.contains("Florence")]
- show rows where values are ‘Florence’ only, at the specified column
Indexes
Indexes are immutable i.e. values cannot be changed
- df.index[ ]
- To view the 100th element in the dataframe
df.index[100]
- To view the 100th element in the dataframe
- df.set_index( ) allows us to set any of the series to be the index. We will use the set index with one key parameter, which is the keys.
- e.g. df.set_index(“columnname”) to set the column as the index / first column but original df won’t be changed.
To make and save changes to the original dataframe:- we need to specify inplace=True
df.set_index("columnname",inplace=True)
- or save it in a new dataframe df2 e.g.
df2 = df.set_index("columnname")
- we need to specify inplace=True
- e.g. df.set_index(“columnname”) to set the column as the index / first column but original df won’t be changed.
- df.reset_index( ) is the opposite of setting an index. Using a df where its index was set previously, e.g. df2, we can reset to its default integer-based index
- Need to specify that the inplace is true, if we want to reset and save changes to the df2 data frame,
df2.reset_index(inplace=True)
- or we can specify that inplace equals false, and then we need to set this resetted dataframe to another dataframe variable name.
df3=df2.reset_index()
- Need to specify that the inplace is true, if we want to reset and save changes to the df2 data frame,
- df.sort_index( ) sorts the first column (your index column) alphabetically.
- use inplace=True to save changes to the dataframe if you are not saving in another dataframe name.
- df.sort_index(inplace=True, ascending=False) to save changes to the df and sort by reverse alphabetical
- df.loc[ ] or df.series.loc[ ]
Subset data by row indexes or column names- To access all rows with a specified value in the column, by making the column the index first then filtering
- Ensure the column you want is the index first, using set_index().
Then, df.loc[‘value’] to filter rows showing that value only - Alternatively, we can just use df.loc[df.columnname == ‘value’]
- Ensure the column you want is the index first, using set_index().
- To access a specifed row: df.loc[“row_index”] where row index is the first item in the row
- To obtain value of an item in a specifed row based on the column:
- df.loc[“row_index”,”column_name”]
Example 1:df.loc["CH","capital"]
df["capital"].loc["CH"]
df.loc["CH"]["capital"]
Output (all 3 above gives the same output):
Beijing
Example 2:
If row index is 3 and column name ‘length’, and if ‘length’ column is the first column, its column index is 0.'''using row index and column name''' df.loc[3,'length']
'''using row index and column index''' df.loc[3,0]
Output (both 2 above gives the same output):
5cm
- df.loc[“row_index”,”column_name”]
- To access all rows with a specified value in the column, by making the column the index first then filtering
- df.iloc[ ]
Subsets data by position, so the row number or column order- primarily integer position based
- allows for the traditional Pythonic slicing
- To see the 1700th indexed entry:
df.iloc[1700]
- To see the 1542nd, 1745th,1855th indexed entry:
df.iloc[[1542,1745,1855]]
- Slicing using iloc
To see first 3 indexes:df.iloc[1:4]
Usually, we use loc if we need to set index first, use iloc if the default index allows us to use it easily already & especially if we want to access chunks of data using Python slicing.
Additional Resource(s):
Ways to Filter Pandas Dataframe with multiple conditions
- Using loc: display(df.loc[ (df[ ]) & (df[ ]) & (df[ ]) … ] ) [[‘columnname1’,’columnname2’,..]]
Example:display(dataFrame.loc[(dataFrame['Salary']>=100000) & (dataFrame['Age']< 40) & (dataFrame['JOB'].str.startswith('D')), ['Name','JOB']])
- Using NumPy
- Using Query: display(df.query[( )])
(eval and query works only with columns)
Example:display(dataFrame.query('Salary <= 100000 & Age < 40 & JOB.str.startswith("C").values'))
- Pandas Boolean indexing display( df[ (df[ ]) & (df[ ]) & (df[ ]) … ] ) [[‘columnname1’,’columnname2’,..]]
(as above) (“Boolean indexing” works with values in a column only):
Example:display(dataFrame[(dataFrame['Salary']>=100000) & (dataFrame['Age']<40) & dataFrame['JOB'].str.startswith('P')][['Name','Age','Salary']])
- Eval multiple conditions: display(df[df.eval( )])
(“eval” and “query” works only with columns)
Example:display(dataFrame[dataFrame.eval("Salary <=100000 & (Age <40) & JOB.str.startswith('A').values")])