Pandas - Exploratory Data Analysis:
Graphs and Charts
- df.plot( );
- df.plot.area( );
- df.hist( );
- df[[‘columnname1’,’columnname2’]].plot.hist( );
- df.plot.scatter(x=’columnname1’, y=’columnname2’, color=df[‘colors’]);
- scatter_matrix(df);
- pd.crosstab( ).plot.bar( )
- df[‘columnname’].value_counts( ).plot.pie( )
Correlations and statistical functions
Measures of Central Tendency
- df.mean( )
- df.median( )
- df.mode( )
Variance Measures
- df.std( )
- df.boxplot( );
Relationship between variables
- df.corr( ) or df.corr( ).style.background_gradient( )
What I learnt:
Exploratory Data Analysis
Graphs and Charts
Example dataframe:
'''to create a time series data frame with two randomly-generated values per period.'''
daterange = pd.period_range('1/1/1950', freq='1d', periods=50)
date_df = pd.DataFrame(data=daterange,columns=['day'])
date_df['value1'] = np.random.randint(45,65,size=(len(date_df)))
date_df['value2'] = np.random.randint(25,35,size=(len(date_df)))
date_df.head(3)
- df.plot( ); to see a graph with two lines spanning the periods in our dataset. An auto-generated legend is placed on our visual as well. Use the semi-colon at the end of the function to prevent messy code output from being printed out along with our graph.
ax = date_df.plot();
- df.plot.area( ); to generate an area graph which is similar to our line chart above, but it’s filled (with different colours).
Note, a stacked area is an option too using stacked=True, so our lines can be stacked one upon the other.date_df.plot.area(stacked=True);
Example dataframe:
iris = pd.read_csv('iris.csv')
-
df.hist( ); shows a histogram for each variable in the iris dataset. This is the type of visual that can really speed along your process by calling out trends in a clean visual manner.
- df[[‘columnname1’,’columnname2’]].plot.hist( ); to overlay the histograms of our variables on the same axis. When you overlay visuals, we’ll want to increase the transparency by altering the alpha parameter.
iris[['sepal_width','sepal_length']].plot.hist(alpha=0.5);
or
df_cars.mpg.plot.hist(bins=10)
- df.plot.scatter(x=’columnname1’, y=’columnname2’, color=df[‘colors’]); to generate a scatter plot.
'''use .map to translate each species to a color for our plot''' # assign each unique value (type of species e.g. versicolor, setosa, virginica) in our column ('species') to a colour using dictionary colors = {"versicolor":"red","setosa":"green","virginica":"blue"} # in the column ('species') with the values (type of species), map all values to the corresponding colour, into a new column called 'colors' iris['colors'] = iris['species'].map(colors) '''designate sepal width for the x-axis and sepal length for the y-axis''' '''plot each iris observation and assign a color corresponding to the species''' iris.plot.scatter(x='sepal_width', y='sepal_length', color=iris['colors']);
- scatter_matrix(df); is a scatter matrix that generates scatter plots showing the relationship between each of your variables, and along the diagonal, you will see histograms showing how your observations are distributed. The figsize parameter allows you to better fit your visuals to the space available on your screen.
from pandas.plotting import scatter_matrix scatter_matrix(iris,figsize=(15, 9),);
- pd.crosstab( ).plot.bar( ) to create a bar chart
- pd.crosstab(df[‘columnname’,columns=’Count’]).plot.bar( )
if using x-axis as df[‘columnname’] i.e. only one column from the df, we can use y-axis to show the number of counts of the unique values in the column & we could give it a name in legend called ‘Count’'''Create crosstab for cyl (cylinder) column and columns as count''' crosstab_cyl = pd.crosstab(df_cars.cyl, columns='Count') crosstab_cyl.plot.bar()
- pd.crosstab(df[‘columnname1’],df[‘columnname2’]).plot.bar( )
we can also use x-axis as df[‘columnname1’] & y-axis as df[‘columnname2’] to show the number of counts of the unique values in the 2 columns'''Create crosstab for cyl (cylinder) column and and am and create a bar plot''' crosstab_cyl = pd.crosstab(df_cars.cyl, df_cars.am) crosstab_cyl.plot.bar()
- we can even make them stacked using stacked=True
crosstab_cyl.plot.bar(stacked=True)
- pd.crosstab(df[‘columnname’,columns=’Count’]).plot.bar( )
- df[‘columnname’].value_counts( ).plot.pie( ) to create a pie chart with the count of the unique values in the column
df_cars.cyl.value_counts().plot.pie()
Correlations and statistical functions
- Measures of Central Tendency
- df.mean( ) computes the average for each of our numeric variables
- df.median( ) to return the median value
- df.mode( ) to return the most frequent observations for each of our variables. A data set can have multiple modes per variable, the output returns the mode at index zero, but in the case of a tie, it will add additional rows to represent that.
- Variance Measures
- df.std( ) tell you the extent to which these variables deviate from their mean
- df.boxplot( ); here, the rectangles represent the extent from the first quartile to the third quartile in the data. One quartile represents one fourth of your distribution. The whiskers extending beyond are meant to show the range of the data. Dots exceeding the whiskers are considered outliers.
- Relationship between variables
- df.corr( ) allows us to see the correlation between any two variables. A positive number demonstrates that as one variable increases, the other tends to as well, while negative shows the opposite. The closer to one or negative one, the stronger the relationship. A.k.a. the correlation matrix, it is often a good first step before proceeding to machine learning or statistical modeling.
- To make this even easier to interpret, we can create a heat map from this matrix. Call df.corr( ).style.background_gradient( ) on our correlation matrix. Now with color, it’s very clear how strongly and in which direction our variables are correlated.
iris.corr().style.background_gradient(cmap='RdYlGn', axis=None)
Additional Resource(s):
LinkedIn Learning - Pandas Essential Training | LinkedIn Learning - Advanced Pandas