Week 9 - Pandas (Part II) - Basic Analysis, Indexes

Liaw Bei Le · May 19, 2021

Pandas - Basic Analysis:

  1. df.value_counts( )
  2. df.sort_values( )
  3. Boolean indexing
  4. String handling

Pandas - Indexes

  1. df.index[ ]
  2. df.set_index( )
  3. df.reset_index( )
  4. df.sort_index( )
  5. df.loc[ ] or df.series.loc[ ]
  6. df.iloc[ ]

Additional resources:

  1. Ways to Filter Pandas Dataframe with multiple conditions

What I learnt:

Basic Analysis

  • df.seriesname.value_counts( ) returns a series object, counting all the unique values.
    Returns a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on.
    • This order can be reversed by setting the ascending flag to True
        df.seriesname.value_counts(ascending=True)
      
    • Dropna is True by default and we will not get a count of the na values (the missing data values). If our data set has a significant number of na values, we can set dropna to False.
        df.seriesname.value_counts(dropna=False)
      
  • df.seriesname.sort_values( ) or df.sort_values(“seriesname”) sorts the values in a series
    • in ascending order by default
    • By default, the NaNs, or missing data, are put right at the end
    • We can also sort multiple series in ascending and descending order.
        df.sort_values(["seriesname","seriesname2"])
      
        df.sort_values(by["seriesname","seriesname2"])
      
        df.sort_values(["seriesname","seriesname2"]), ascending=[True, False])
      
  • Boolean indexing:
    • AND: &
    • OR: \
    • NOT: ~
    • multiple conditions must be grouped using brackets
      • e.g. show rows with ‘Gold’ values only
          df[df.columnname == 'Gold']
        
      • e.g. show rows with ‘Gold’ and ‘Women’ values only
          df[(df.columnname == 'Gold') & (df.columnname2=='Women')]
        
      • e.g. other examples
          '''df[<some conditions>]'''
          df[df['Age']<30]
        
          '''df[<some conditions>][show corresponding selected columns]'''
          df[df['Age']>50][['Age', 'Name', 'Sex']]
        
          '''to filter rows by score > 90'''
          df[df['score']>90]
        
          '''filter and count:'''
          len(df[(df['Age']>50) & (df['Survived']==0) & (df['Sex']=='male')])
          len(df[(df['Age']<30) & (df['Survived']==1)])
        
  • String handling:
    df[df.columnname.str.contains(“value”)]
    • show rows where values are ‘Florence’ only, at the specified column
        df[df.columnname.str.contains("Florence")]
      

Indexes

Indexes are immutable i.e. values cannot be changed

  • df.index[ ]
    • To view the 100th element in the dataframe
        df.index[100] 
      
  • df.set_index( ) allows us to set any of the series to be the index. We will use the set index with one key parameter, which is the keys.
    • e.g. df.set_index(“columnname”) to set the column as the index / first column but original df won’t be changed.
      To make and save changes to the original dataframe:
      • we need to specify inplace=True
          df.set_index("columnname",inplace=True)
        
      • or save it in a new dataframe df2 e.g.
          df2 = df.set_index("columnname")
        
  • df.reset_index( ) is the opposite of setting an index. Using a df where its index was set previously, e.g. df2, we can reset to its default integer-based index
    • Need to specify that the inplace is true, if we want to reset and save changes to the df2 data frame,
        df2.reset_index(inplace=True)
      
    • or we can specify that inplace equals false, and then we need to set this resetted dataframe to another dataframe variable name.
        df3=df2.reset_index()
      
  • df.sort_index( ) sorts the first column (your index column) alphabetically.
    • use inplace=True to save changes to the dataframe if you are not saving in another dataframe name.
    • df.sort_index(inplace=True, ascending=False) to save changes to the df and sort by reverse alphabetical
  • df.loc[ ] or df.series.loc[ ]
    Subset data by row indexes or column names
    • To access all rows with a specified value in the column, by making the column the index first then filtering
      • Ensure the column you want is the index first, using set_index().
        Then, df.loc[‘value’] to filter rows showing that value only
      • Alternatively, we can just use df.loc[df.columnname == ‘value’]
    • To access a specifed row: df.loc[“row_index”] where row index is the first item in the row
    • To obtain value of an item in a specifed row based on the column:
      • df.loc[“row_index”,”column_name”]
        Example 1:
          df.loc["CH","capital"]
        
          df["capital"].loc["CH"]
        
          df.loc["CH"]["capital"]
        

        Output (all 3 above gives the same output):

          Beijing  
        

        Example 2:
        If row index is 3 and column name ‘length’, and if ‘length’ column is the first column, its column index is 0.

          '''using row index and column name'''
          df.loc[3,'length']
        
          '''using row index and column index'''
          df.loc[3,0]
        

        Output (both 2 above gives the same output):

          5cm 
        
  • df.iloc[ ]
    Subsets data by position, so the row number or column order
    • primarily integer position based
    • allows for the traditional Pythonic slicing
    • To see the 1700th indexed entry:
        df.iloc[1700]
      
    • To see the 1542nd, 1745th,1855th indexed entry:
        df.iloc[[1542,1745,1855]]
      
    • Slicing using iloc
      To see first 3 indexes:
        df.iloc[1:4]
      

Usually, we use loc if we need to set index first, use iloc if the default index allows us to use it easily already & especially if we want to access chunks of data using Python slicing.


Additional Resource(s):

Ways to Filter Pandas Dataframe with multiple conditions

  • Using loc: display(df.loc[ (df[ ]) & (df[ ]) & (df[ ]) … ] ) [[‘columnname1’,’columnname2’,..]]
    Example:
    display(dataFrame.loc[(dataFrame['Salary']>=100000) & (dataFrame['Age']< 40) & (dataFrame['JOB'].str.startswith('D')), ['Name','JOB']])
    
  • Using NumPy
  • Using Query: display(df.query[( )])
    (eval and query works only with columns)
    Example:
    display(dataFrame.query('Salary <= 100000 & Age < 40 & JOB.str.startswith("C").values'))
    
  • Pandas Boolean indexing display( df[ (df[ ]) & (df[ ]) & (df[ ]) … ] ) [[‘columnname1’,’columnname2’,..]]
    (as above) (“Boolean indexing” works with values in a column only):
    Example:
    display(dataFrame[(dataFrame['Salary']>=100000) & (dataFrame['Age']<40) & dataFrame['JOB'].str.startswith('P')][['Name','Age','Salary']])
    
  • Eval multiple conditions: display(df[df.eval( )])
    (“eval” and “query” works only with columns)
    Example:
    display(dataFrame[dataFrame.eval("Salary <=100000 & (Age <40) & JOB.str.startswith('A').values")])
    

Twitter, Facebook