Week 9 - Pandas (Part I) - Introduction, Data input and validation, create columns & dataframes

Liaw Bei Le · May 17, 2021

Pandas - Data input and validation:

  1. Install and import Pandas
  2. pd.read_csv( )
  3. df.shape[ ]
  4. df.head( ) , df.tail( )
  5. df.info( )
  6. df.describe( )
  7. df.drop( )
  8. df.isnull( )
  9. df.dtypes
  10. df[‘seriesname’] or df[“seriesname”] or df.seriesname
  11. df[[‘columnname’,’columnname2’]]
  12. type( )
  13. Adding a column into the dataframe
  14. Creating a dataframe

Cheatsheets

Pandas Guide 1 | 2 | 3


What I learnt:

Introduction to Pandas

Pandas : stands for Panel Data System

  • is a Python library
  • allows high-level data manipulation
  • work with DataFrames

Each row or column in a dataframe is also known as a series.

In Jupyter Notebook, we can type the function and press shift+tab to see more info about the function and its parameters.

Data input and validation:

  • Install Pandas in Jupyter Notebook: If pandas is not installed in Jupyter Notebook yet, type !pip install pandas and run it.

  • Import Pandas in Jupyter Notebook,
      import pandas as pd
    
  • .read_csv( )
    We can import data from a wide variety of data sources, including CSVs, Excel files, JSONs, databases, parquet files, etc.
    Ensure csv file is in the same folder as your jupyter notebook .ipynb file, else we’ll need to use the longer file address to import the dataset.
    • To read a CSV file:
      • df=pd.read_csv(‘datasetname.csv’) or
      • df=pd.read_csv(‘csvfileaddress/datasetname.csv’) , change all backslashes \ to forward slashes / for the file address
    • To read a CSV file and set a specified column as the index:
      use index_col: allows you to set which column to be used as the index of the dataframe.
      The default value is None. We can specify column name or column index, to be the index column.
      • df=pd.read_csv(‘datasetname.csv’, index_col = “column_name”)
      • df=pd.read_csv(‘datasetname.csv’, index_col = 0) if we want the first column, which is column indexed 0, to be the index column
    • To read a CSV file and drop rows from the dataframe:
      • df=pd.read_csv(‘datasetname.csv’, skiprows = 4)
        will drop the first 4 rows of data and show us the df from 5th row onwards, but the header will also be dropped
      • df=pd.read_csv(‘brics.csv’, sep=’,’, skiprows=range(1, 3))
        will drop the first 3 rows of data and show you the 4th row onwards while keeping the headers
    • To read a CSV file and only select a few columns to be in the dataframe, we can indicate usecols
      • df=pd.read_csv(‘datasetname.csv’, usecols = [‘columnname1’,’columnname2’])
  • df.shape[ ] shows us the number of rows and columns in the dataset
    • df.shape[0] to see number of rows
    • df.shape[1] to see number of columns
  • df.head( ) and df.tail( ) shows the first and last 5 rows of the data
    • Can specify how many rows you want to see, e.g. df.head(10) to see first 10 rows.
  • df.info( ) to get summary of all fields and no. of non-null values

  • df.describe( ) to get stats of the fields

  • df.drop( [‘fieldnametodrop’], axis=1 ): to delete a field from data, to clean the data

  • df.isnull( ).sum( ) to see how many values in each field is null

  • df.dtypes to see the data types of the values in all your columns

  • df[‘seriesname’] or df[“seriesname”] or df.seriesname to show you the specified series (column or row).
    • Note: df.seriesname will not work if there is a spacing in between the series name.
  • df[[‘columnname’,’columnname2’]] to show multiple series

  • type( ) is helpful in showing you the type of data you are using e.g. dataframe or series?
    • type(df[‘seriesname’]) is a series
    • type(df) is a dataframe
    • type(df[[‘columnname’,’columnname2’]]) is a dataframe
  • Adding a column:
    • df[“True or False”] = [True, True, True, True, False]
      to create a column with True / False values by specifying values
    • df[“density”] = df[“population”] / df[“area”] * 1000000
      to create a column with values by using other columns’ values
      Example:
       '''to create a new column named "name_city"'''
       df['name_city'] = df['name'] + '_' + df['city']
      

      Could also use apply, map, applymap functions or for loop, which will be explained later [see 22 May post].

  • Creating a dataframe
    • using a dictionary of keys and values
        import pandas as pd
        '''create a df using a dictionary of keys and values'''
        scores={"name":['Amy', 'Jeff', 'Phoebe'] , "city":['San Francisco', 'San Francisco', 'Denver'] , "score":[75,92,94]}
        df=pd.DataFrame(scores)
        '''show 'score' column'''
        display(df['score'])
      
        '''column name, column values, then row names in index'''
        df3 = pd.DataFrame({'PSA_Container_1': [10, 103, 55],
        'PSA_Container_2': [37, 19, 17]},index=['Beef','Fish','Spinach'])
        print(df3)
        '''show the corresponding row name with the maximum value in column 1 and column 2'''
        print(df3.idxmax())
      
    • using df=pd.DataFrame(np.random.randn( ))
        '''generate 20 rows, 5 columns of random numbers'''
        df1=pd.DataFrame(np.random.randn(20,5))
        print(df1)
      

Thoughts

This is just an introduction to Pandas. I will post more Pandas functions in my future posts.

The above functions are really useful in helping us to get an initial overview of our data and understand the data we are working with. I cannot wait to use them on the datasets in my projects when doing data analysis with Pandas!

Additional Resource(s):

  1. Kaggle is a great platform to retrieve sample datasets to practise on.

  2. How to read and write files with Pandas

Twitter, Facebook