- Measures of Central Tendency
- mean, mode, median
- Measures of Distribution and Density
- min, max
- Skewness and Kurtosis
- df[col].skew( )
- df[col].kurt( )
- Measures of Variance
- Range
- Percentiles and Quartiles
- Outliers
- Variance and Standard Deviation
- Z Score
- Summarizing Data Distribution in Python
- .describe( )
These are my notes from Statistics Fundamentals.
What I learnt:
Measures of Central Tendency, Distribution and Density
Measures of Central Tendency: mean, mode, median
.mean( )
Refers to the average value of the dataset.
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000]})
print (df['Salary'].mean())
.mode( )
Indicates the most frequently occurring value. The dataset could have more than one value as the mode.
If there are two values with the highest frequency, the dataset is bimodal. When there is more than one mode value, the data is considered multimodal.
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000]})
print (df['Salary'].mode())
.median( )
To calculate the median, sort the values into ascending order and then find the middle-most value.
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000]})
print (df['Salary'].median())
Measures of Distribution and Density: min, max
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000]})
print ('Mean: ' + str(df['Salary'].mean()))
print ('Mode: ' + str(df['Salary'].mode()[0]))
print ('Median: ' + str(df['Salary'].median()))
print ('Min: ' + str(df['Salary'].min()))
print ('Max: ' + str(df['Salary'].max()))
The histogram shows the relative frequency of each salary band, based on the number of bins.
It also gives us a sense of the density of the data for each point on the salary scale.
With enough data points, and small enough bins, we could view this density as a line that shows the shape of the data distribution.
Example 1:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
'''create dataframe'''
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000]})
salary = df['Salary']
'''plot chart'''
salary.plot.hist(title='Salary Distribution', color='lightblue', bins=25)
'''dashed lines on graph for mean and median values'''
plt.axvline(salary.mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(salary.median(), color='green', linestyle='dashed', linewidth=2)
plt.show()
To show the density of the salary data as a line on top of the histogram, use stats.gaussian_kde( ):
'''plot chart'''
'''show the density of the salary data as a line on top of the histogram'''
density = stats.gaussian_kde(salary)
n, x, _ = plt.hist(salary, histtype='step', density=True, bins=25)
plt.plot(x, density(x)*5)
'''dashed lines on graph for mean and median values'''
plt.axvline(salary.mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(salary.median(), color='green', linestyle='dashed', linewidth=2)
plt.show()
Example 2:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
'''create dataframe'''
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Hours':[41,40,36,30,35,39,40]})
hours = df['Hours']
'''show the density of the data as a line on top of the histogram'''
density = stats.gaussian_kde(hours)
n, x, _ = plt.hist(hours, histtype='step', density=True, bins=25)
plt.plot(x, density(x)*7)
'''dashed lines on graph for mean and median values'''
plt.axvline(hours.mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(hours.median(), color='green', linestyle='dashed', linewidth=2)
plt.show()
Python Histograms | gaussian_kde( )
Skewness and Kurtosis
We can measure skewness (in which direction the data is skewed and to what degree) using df[col].skew( ) and kurtosis (how “peaked” the data is) using df[col].kurt( ) .
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
'''create dataframe'''
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,30,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
numcols = ['Salary', 'Hours', 'Grade']
for col in numcols:
'''find skewness and kurtosis'''
print(df[col].name + ' skewness: ' + str(df[col].skew()))
print(df[col].name + ' kurtosis: ' + str(df[col].kurt()))
'''show the density of the data as a line on top of the histogram'''
density = stats.gaussian_kde(df[col])
n, x, _ = plt.hist(df[col], histtype='step', density=True, bins=25)
plt.plot(x, density(x)*6)
plt.show()
print('\n')
Now let’s look at the distribution of a real dataset - let’s see how the heights of the father’s measured in Galton’s study of parent and child heights are distributed:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
'''import dataset from statesmodels.api'''
df = sm.datasets.get_rdataset('GaltonFamilies', package='HistData').data
fathers = df['father']
'''show the density of the data as a line on top of the histogram'''
density = stats.gaussian_kde(fathers)
n, x, _ = plt.hist(fathers, histtype='step', density=True, bins=50)
plt.plot(x, density(x)*2.5)
plt.axvline(fathers.mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(fathers.median(), color='green', linestyle='dashed', linewidth=2)
plt.show()
Measures of Variance
Range
Range: identifies the difference between the lowest and highest values using df[col].max() - df[col].min()
# create a single Pandas dataframe for our school graduate data
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,30,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
numcols = ['Salary', 'Hours', 'Grade']
# calculate the range for each of the numeric features
for col in numcols:
print(df[col].name + ' range: ' + str(df[col].max() - df[col].min()))
Percentiles and Quartiles
Percentiles - strict (< or >), weak (<= or >=), rank (grouped percentiles for repeated values)
A percentile tells us where a given value is ranked in the overall distribution.
For example, 25% of the data in a distribution has a value lower than the 25th percentile; 75% of the data has a value lower than the 75th percentile, and so on.
Note that half the data has a value lower than the 50th percentile - so the 50th percentile is also the median!
Example:
import pandas as pd
from scipy import stats
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,30,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
Let’s examine Frederic’s grade using this approach. We know he scored 57, but how does he rank compared to his fellow students?
There are 7 students in total, and 5 of them scored < Frederic; so we can calculate the percentile for Frederic’s grade like this:
5/7×100≈71.4
So Frederic’s score puts him at the 71.4th percentile in his class.
Use .percentileofscore( ) in the scipy.stats package to calculate the percentile for a given value in a set of values:
print(stats.percentileofscore(df['Grade'], 57, 'strict'))
We’ve used the strict definition of percentile; but sometimes it’s calculated as being the percentage of values that are <= to the value you’re comparing. In this case, the calculation for Frederic’s percentile would include his own score:
6/7×100≈85.7
You can calculate this way by using the weak mode of .percentileofscore( ):
print(stats.percentileofscore(df['Grade'], 57, 'weak'))
Dan, Joann, and Ethan scored the same grade (50), so in a sense they share a percentile.
To deal with this grouped scenario, we can average the percentage rankings for the matching scores. We treat half of the scores matching the one we’re ranking as if they are below it, and half as if they are above it.
In this case, there were three matching scores of 50, and for each of these we calculate the percentile as if 1 was below and 1 was above. So the calculation for a percentile for Joann based on scores being less than or equal to 50 is:
(4/7)×100≈57.14
The value of 4 consists of the two scores that are below Joann’s score of 50, Joann’s own score, and half of the scores that are the same as Joann’s (of which there are two, so we count one).
In Python, calculate grouped percentiles like this using the rank mode of percentileofscore( ):
print(stats.percentileofscore(df['Grade'], 50, 'rank'))
Quartiles - we can consider the overall spread of the data by dividing those percentiles into four quartiles (minimum to the 25th percentile, 25th percentile to the 50th percentile (which is the median), 50th percentile to the 75th percentile, 75th percentile to the maximum).
Use .quantile( ) to find the threshold values at the 25th, 50th, and 75th percentiles.
# Quartiles
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
'''find the quartile thresholds for the weekly hours worked by the students'''
print(df['Hours'].quantile([0.25, 0.5, 0.75]))
'''create a box plot for the weekly hours'''
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,30,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
# Plot a box-whisker chart
df['Hours'].plot(kind='box', title='Weekly Hours Distribution', figsize=(10,8))
plt.show()
The box plot consists of:
- A rectangular box that shows where the data between the 25th and 75th percentile (the second and third quartile) lie. A.k.a. the interquartile range.
- Whiskers that extend to show the full range of the data.
- A line in the box that shows the location of the median (the 50th percentile).
Outliers
An outlier is a value that is so far from the center of the distribution compared to other values that it skews the distribution by affecting the mean. There are all sorts of reasons that you might have outliers in your data, including data entry errors, failures in sensors or data-generating equipment, or genuinely anomalous values.
'''create a box plot for the salaries earned by classmates with an outlier'''
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,30,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
# Plot a box-whisker chart
df['Salary'].plot(kind='box', title='Salary Distribution', figsize=(10,8))
plt.show()
Let’s see what the distribution of the data looks like without the outlier using showfliers=False:
'''create a box plot for the salaries earned by classmates with an outlier removed'''
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
# Plot a box-whisker chart
df['Salary'].plot(kind='box', title='Salary Distribution', figsize=(10,8), showfliers=False)
plt.show()
Let’s take a look at the distribution of final grades:
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
# Plot a box-whisker chart
df['Grade'].plot(kind='box', title='Grade Distribution', figsize=(10,8))
plt.show()
The reason that the low and high scores that look like outliers might just be because we have so few data points.
With more data, there are some more high and low scores; so we no longer consider the isolated cases to be outliers.
We need to really understand the data and what we’re trying to do with it, and ensure that we have a reasonable sample size, before determining what to do with outlier values.
Variance and Standard Deviation
Now it’s time to look at how to measure the amount of variance in the data.
Variance
The higher the variance, the more spread your data is around the mean.
We can use .var( ) to calculate the variance of a column in a dataframe
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
print(df['Grade'].var())
Standard Deviation
To calculate the variance, we squared the difference of each value from the mean.
However, this means that the variance is not in the same unit of measurement as our data. To get the measure of variance back into the same unit of measurement, we need to find its square root, which is the standard deviation for our data.
We can use .std( ) to calculate the standard deviation of a column in a dataframe.
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
print(df['Grade'].std())
Standard Deviation in a Normal Distribution
In statistics and data science, we spend a lot of time considering normal distributions; because they occur so frequently. The standard deviation has an important relationship to play in a normal distribution.
Run the following cell to show a histogram of a standard normal distribution (which is a distribution with a mean of 0 and a standard deviation of 1):
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
print(df['Grade'].std())
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
# Create a random standard normal distribution
df = pd.DataFrame(np.random.randn(100000, 1), columns=['Grade'])
# Plot the distribution as a histogram with a density curve
grade = df['Grade']
density = stats.gaussian_kde(grade)
n, x, _ = plt.hist(grade, color='lightgrey', density=True, bins=100)
plt.plot(x, density(x))
# Get the mean and standard deviation
s = df['Grade'].std()
m = df['Grade'].mean()
# Annotate 1 stdev
x1 = [m-s, m+s]
y1 = [0.25, 0.25]
plt.plot(x1,y1, color='magenta')
plt.annotate('1s (68.26%)', (x1[1],y1[1]))
# Annotate 2 stdevs
x2 = [m-(s*2), m+(s*2)]
y2 = [0.05, 0.05]
plt.plot(x2,y2, color='green')
plt.annotate('2s (95.45%)', (x2[1],y2[1]))
# Annotate 3 stdevs
x3 = [m-(s*3), m+(s*3)]
y3 = [0.005, 0.005]
plt.plot(x3,y3, color='orange')
plt.annotate('3s (99.73%)', (x3[1],y3[1]))
# Show the location of the mean
plt.axvline(grade.mean(), color='grey', linestyle='dashed', linewidth=1)
# Show chart
plt.show()
The horizontal colored lines show the percentage of data within 1, 2, and 3 standard deviations of the mean (plus or minus).
In any normal distribution:
- Approx 68.26% of values fall within one standard deviation from the mean.
- Approx 95.45% of values fall within two standard deviations from the mean.
- Approx 99.73% of values fall within three standard deviations from the mean.
Z Score
How do we know how many standard deviations above or below the mean a particular value is?
We call this a Z Score.
Summarizing Data Distribution in Python
We can also use .describe( ) to retrieve summary statistics for all numeric columns in a dataframe. These summary statistics include many of the statistics we’ve examined.
import pandas as pd
df = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic'],
'Salary':[50000,54000,50000,189000,55000,40000,59000],
'Hours':[41,40,36,17,35,39,40],
'Grade':[50,50,46,95,50,5,57]})
df.describe()