Week 9 - Pandas (Part I) - Introduction, Data input and validation, create columns & dataframes

Liaw Bei Le · May 17, 2021

Learning Python Pandas Cheatsheets References Reflections Kaggle

Pandas - Data input and validation:

Install and import Pandas
pd.read_csv( )
df.shape[ ]
df.head( ) , df.tail( )
df.info( )
df.describe( )
df.drop( )
df.isnull( )
df.dtypes
df[‘seriesname’] or df[“seriesname”] or df.seriesname
df[[‘columnname’,’columnname2’]]
type( )
Adding a column into the dataframe
Creating a dataframe

Cheatsheets

Pandas Guide 1 | 2 | 3

What I learnt:

Introduction to Pandas

Pandas : stands for Panel Data System

is a Python library
allows high-level data manipulation
work with DataFrames

Each row or column in a dataframe is also known as a series.

In Jupyter Notebook, we can type the function and press shift+tab to see more info about the function and its parameters.

Data input and validation:

Install Pandas in Jupyter Notebook: If pandas is not installed in Jupyter Notebook yet, type !pip install pandas and run it.
Import Pandas in Jupyter Notebook,
```
  import pandas as pd
```
.read_csv( )
We can import data from a wide variety of data sources, including CSVs, Excel files, JSONs, databases, parquet files, etc.
Ensure csv file is in the same folder as your jupyter notebook .ipynb file, else we’ll need to use the longer file address to import the dataset.
- To read a CSV file:
  - df=pd.read_csv(‘datasetname.csv’) or
  - df=pd.read_csv(‘csvfileaddress/datasetname.csv’) , change all backslashes \ to forward slashes / for the file address
- To read a CSV file and set a specified column as the index:
  use index_col: allows you to set which column to be used as the index of the dataframe.
  The default value is None. We can specify column name or column index, to be the index column.
  - df=pd.read_csv(‘datasetname.csv’, index_col = “column_name”)
  - df=pd.read_csv(‘datasetname.csv’, index_col = 0) if we want the first column, which is column indexed 0, to be the index column
- To read a CSV file and drop rows from the dataframe:
  - df=pd.read_csv(‘datasetname.csv’, skiprows = 4)
    will drop the first 4 rows of data and show us the df from 5th row onwards, but the header will also be dropped
  - df=pd.read_csv(‘brics.csv’, sep=’,’, skiprows=range(1, 3))
    will drop the first 3 rows of data and show you the 4th row onwards while keeping the headers
- To read a CSV file and only select a few columns to be in the dataframe, we can indicate usecols
  - df=pd.read_csv(‘datasetname.csv’, usecols = [‘columnname1’,’columnname2’])
df.shape[ ] shows us the number of rows and columns in the dataset
- df.shape[0] to see number of rows
- df.shape[1] to see number of columns
df.head( ) and df.tail( ) shows the first and last 5 rows of the data
- Can specify how many rows you want to see, e.g. df.head(10) to see first 10 rows.
df.info( ) to get summary of all fields and no. of non-null values
df.describe( ) to get stats of the fields
df.drop( [‘fieldnametodrop’], axis=1 ): to delete a field from data, to clean the data
df.isnull( ).sum( ) to see how many values in each field is null
df.dtypes to see the data types of the values in all your columns
df[‘seriesname’] or df[“seriesname”] or df.seriesname to show you the specified series (column or row).
- Note: df.seriesname will not work if there is a spacing in between the series name.
df[[‘columnname’,’columnname2’]] to show multiple series
type( ) is helpful in showing you the type of data you are using e.g. dataframe or series?
- type(df[‘seriesname’]) is a series
- type(df) is a dataframe
- type(df[[‘columnname’,’columnname2’]]) is a dataframe
Adding a column:
- df[“True or False”] = [True, True, True, True, False]
  to create a column with True / False values by specifying values
- df[“density”] = df[“population”] / df[“area”] * 1000000
  to create a column with values by using other columns’ values
  Example:
```
 '''to create a new column named "name_city"'''
 df['name_city'] = df['name'] + '_' + df['city']
```
  Could also use apply, map, applymap functions or for loop, which will be explained later [see 22 May post].

Creating a dataframe

using a dictionary of keys and values

  import pandas as pd
  '''create a df using a dictionary of keys and values'''
  scores={"name":['Amy', 'Jeff', 'Phoebe'] , "city":['San Francisco', 'San Francisco', 'Denver'] , "score":[75,92,94]}
  df=pd.DataFrame(scores)
  '''show 'score' column'''
  display(df['score'])

  '''column name, column values, then row names in index'''
  df3 = pd.DataFrame({'PSA_Container_1': [10, 103, 55],
  'PSA_Container_2': [37, 19, 17]},index=['Beef','Fish','Spinach'])
  print(df3)
  '''show the corresponding row name with the maximum value in column 1 and column 2'''
  print(df3.idxmax())

using df=pd.DataFrame(np.random.randn( ))

  '''generate 20 rows, 5 columns of random numbers'''
  df1=pd.DataFrame(np.random.randn(20,5))
  print(df1)

Thoughts

This is just an introduction to Pandas. I will post more Pandas functions in my future posts.

The above functions are really useful in helping us to get an initial overview of our data and understand the data we are working with. I cannot wait to use them on the datasets in my projects when doing data analysis with Pandas!

Additional Resource(s):

Kaggle is a great platform to retrieve sample datasets to practise on.
How to read and write files with Pandas

Share: Twitter, Facebook