Week 9 - Pandas (Part II) - Basic Analysis, Indexes

Liaw Bei Le · May 19, 2021

Pandas - Basic Analysis:

df.value_counts( )
df.sort_values( )
Boolean indexing
String handling

Pandas - Indexes

df.index[ ]
df.set_index( )
df.reset_index( )
df.sort_index( )
df.loc[ ] or df.series.loc[ ]
df.iloc[ ]

Additional resources:

Ways to Filter Pandas Dataframe with multiple conditions

What I learnt:

Basic Analysis

df.seriesname.value_counts( ) returns a series object, counting all the unique values.
Returns a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on.
- This order can be reversed by setting the ascending flag to True
```
  df.seriesname.value_counts(ascending=True)
```
- Dropna is True by default and we will not get a count of the na values (the missing data values). If our data set has a significant number of na values, we can set dropna to False.
```
  df.seriesname.value_counts(dropna=False)
```
df.seriesname.sort_values( ) or df.sort_values(“seriesname”) sorts the values in a series
- in ascending order by default
- By default, the NaNs, or missing data, are put right at the end
- We can also sort multiple series in ascending and descending order.
```
  df.sort_values(["seriesname","seriesname2"])
```
```
  df.sort_values(by["seriesname","seriesname2"])
```
```
  df.sort_values(["seriesname","seriesname2"]), ascending=[True, False])
```

Boolean indexing:

AND: &
OR: \
NOT: ~

multiple conditions must be grouped using brackets

e.g. show rows with ‘Gold’ values only
```
  df[df.columnname == 'Gold']
```

e.g. show rows with ‘Gold’ and ‘Women’ values only

  df[(df.columnname == 'Gold') & (df.columnname2=='Women')]

e.g. other examples

  '''df[<some conditions>]'''
  df[df['Age']<30]

  '''df[<some conditions>][show corresponding selected columns]'''
  df[df['Age']>50][['Age', 'Name', 'Sex']]

  '''to filter rows by score > 90'''
  df[df['score']>90]

  '''filter and count:'''
  len(df[(df['Age']>50) & (df['Survived']==0) & (df['Sex']=='male')])
  len(df[(df['Age']<30) & (df['Survived']==1)])

String handling:
df[df.columnname.str.contains(“value”)]
- show rows where values are ‘Florence’ only, at the specified column
```
  df[df.columnname.str.contains("Florence")]
```

Indexes

Indexes are immutable i.e. values cannot be changed

df.index[ ]
- To view the 100th element in the dataframe
```
  df.index[100] 
```
df.set_index( ) allows us to set any of the series to be the index. We will use the set index with one key parameter, which is the keys.
- e.g. df.set_index(“columnname”) to set the column as the index / first column but original df won’t be changed.
  To make and save changes to the original dataframe:
  - we need to specify inplace=True
    df.set_index("columnname",inplace=True)
  - or save it in a new dataframe df2 e.g.
    df2 = df.set_index("columnname")
df.reset_index( ) is the opposite of setting an index. Using a df where its index was set previously, e.g. df2, we can reset to its default integer-based index
- Need to specify that the inplace is true, if we want to reset and save changes to the df2 data frame,
```
  df2.reset_index(inplace=True)
```
- or we can specify that inplace equals false, and then we need to set this resetted dataframe to another dataframe variable name.
```
  df3=df2.reset_index()
```
df.sort_index( ) sorts the first column (your index column) alphabetically.
- use inplace=True to save changes to the dataframe if you are not saving in another dataframe name.
- df.sort_index(inplace=True, ascending=False) to save changes to the df and sort by reverse alphabetical
df.loc[ ] or df.series.loc[ ]
Subset data by row indexes or column names
- To access all rows with a specified value in the column, by making the column the index first then filtering
  - Ensure the column you want is the index first, using set_index().
    Then, df.loc[‘value’] to filter rows showing that value only
  - Alternatively, we can just use df.loc[df.columnname == ‘value’]
- To access a specifed row: df.loc[“row_index”] where row index is the first item in the row
- To obtain value of an item in a specifed row based on the column:
  - df.loc[“row_index”,”column_name”]
    Example 1:
    df.loc["CH","capital"]
    df["capital"].loc["CH"]
    df.loc["CH"]["capital"]
    Output (all 3 above gives the same output):
    Beijing
    Example 2:
    If row index is 3 and column name ‘length’, and if ‘length’ column is the first column, its column index is 0.
    '''using row index and column name''' df.loc[3,'length']
    '''using row index and column index''' df.loc[3,0]
    Output (both 2 above gives the same output):
    5cm
df.iloc[ ]
Subsets data by position, so the row number or column order
- primarily integer position based
- allows for the traditional Pythonic slicing
- To see the 1700th indexed entry:
```
  df.iloc[1700]
```
- To see the 1542nd, 1745th,1855th indexed entry:
```
  df.iloc[[1542,1745,1855]]
```
- Slicing using iloc
  To see first 3 indexes:
```
  df.iloc[1:4]
```

Usually, we use loc if we need to set index first, use iloc if the default index allows us to use it easily already & especially if we want to access chunks of data using Python slicing.

Additional Resource(s):

Ways to Filter Pandas Dataframe with multiple conditions

Using loc: display(df.loc[ (df[ ]) & (df[ ]) & (df[ ]) … ] ) [[‘columnname1’,’columnname2’,..]]
Example:

display(dataFrame.loc[(dataFrame['Salary']>=100000) & (dataFrame['Age']< 40) & (dataFrame['JOB'].str.startswith('D')), ['Name','JOB']])

Using NumPy

Using Query: display(df.query[( )])
(eval and query works only with columns)
Example:

display(dataFrame.query('Salary <= 100000 & Age < 40 & JOB.str.startswith("C").values'))

Pandas Boolean indexing display( df[ (df[ ]) & (df[ ]) & (df[ ]) … ] ) [[‘columnname1’,’columnname2’,..]]
(as above) (“Boolean indexing” works with values in a column only):
Example:
```
display(dataFrame[(dataFrame['Salary']>=100000) & (dataFrame['Age']<40) & dataFrame['JOB'].str.startswith('P')][['Name','Age','Salary']])
```
Eval multiple conditions: display(df[df.eval( )])
(“eval” and “query” works only with columns)
Example:
```
display(dataFrame[dataFrame.eval("Salary <=100000 & (Age <40) & JOB.str.startswith('A').values")])
```

Share: Twitter, Facebook