Week 15 - Machine Learning - (Supervised Learning) Applying Linear Regression for automobile prices from Principles of M.L. Python by Microsoft Learning

July 1, 2021

Linear regression using scikit-learn
Use categorical data
Apply transformations to features and labels to improve model performance
Compare regression models to improve model performance
Prepare the model matrix
- Create dummy variables from categorical features
- Concatenate the numeric features
Data preparation
- Split the dataset
  - create independently sampled training dataset and test data set
  - .train_test_split( ) to perform Bernoulli sampling
- Scale numeric features
  - Z-Score normalization
    - .StandardScaler( ) then .transform( )
Construct the regression model
- Instantiate and fit the model
  - .LinearRegression( ) & .fit( )
  - print the intercept term and model coefficients
Evaluate model performance
- Plot the predicted values computed from the training features - The predict method is applied to the model with the training data
- Using the test dataset
- Choose & compute performance metric
  - Mean squared error or MSE
  - Root mean squred error or RMSE
  - Mean absolute error or MAE
  - Median absolute error
  - R squared or 𝑅2 , also known as the coefficient of determination
  - Adjusted R squared
- Residuals (errors) of the model should be Normally distributed
  - plot a kernel density plot and histogram of the residuals of the regression model
  - Quantile-Quantile Normal plot, or Q-Q Normal plot
    - If the residuals have a distribution which is approximately Normal, points will nearly fall along the straight line
  - Residual plot

Week 15 - Machine Learning - (Supervised Learning) Classification from Principles of M.L. Python by Microsoft Learning

June 29, 2021

Logistic Regression
- 2-class classification
- Linear regression model with a non linear output
- Response is binary, {0,1} , or positive and negative; response is the prediction of the category
- Response of the linear model is transformed or ‘squashed’ to values close to 0 and 1 using a logistic function
Prepare data for scikit-learn model
- In the dataframe, usually there’s 1 identifier column, 1 label column (y-axis output values), remaining columns are features (x-axis output values)
- Create a numpy array of the label values
- Create the model matrix
  - Categorical variables need to be recoded as binary dummy variables
- Numeric features must be concatenated to the numpy array
- Split the cases into training and test data sets
- Numeric features must be rescaled
Construct the logistic regression model
- Compute the logistic regression model.
  - .LogisticRegression( )
  - Fit the linear model
- Model coefficients
- Compute a sample of class probabilities for the test feature set
  - Class with the highest probability is taken as the score (prediction)
Score and evaluate the classification model
- Transform computed class probabilities into actual class scores
  - Set the threshold to the probability of a score of 0 for the test data
  - Obtain results of the test data
- Quantify the performance of the model
  - Always use multiple metrics to evaluate the performance of any machine learning model
    - Confusion matrix
    - Accuracy
    - Precision
    - Recall
    - F1
    - ROC and AUC
- Class imbalance:
  problem in machine learning where the total number of a class of data (e.g. +ve) is far less than the total number of another class of data (e.g.-ve).
  (e.g. two positive cases for each negative case).
- Naive ‘classifier’ that sets all cases to positive
  - Compute ROC & AUC and compare
Compute a weighted model
- Helps overcome class imbalance problem
- Weight the classes when computing the logistic regression model
  - .LogisticRegression(class_weight = … )
- Compute class probabilities for the test feature set
- To find if there is any significant difference with the unweighted model, compute the scores and the metrics
  - Look at accuracy, ROC & AUC
Scoring threshold can be adjusted
- Helps overcome the problem of asymmetric cost of misclassification to the bank
- To tip the model scoring toward correctly identifying the bad credit cases
- Scoring and evaluating the model for a given threshold value
- Exactly which threshold to pick is a business decision

Week 15 - Machine Learning - (Supervised Learning) Introduction to Regression from Principles of M.L. Python by Microsoft Learning

June 28, 2021

Linear regression using scikit-learn
Overview of linear regression
Data preparation
- Split the dataset
  - create independently sampled training dataset and test data set
  - .train_test_split( ) to perform Bernoulli sampling
- Scale numeric features
  - Min-Max normalization
  - Z-Score normalization
    - .StandardScaler( ) then .transform( )
Train the regression model
- Instantiate and fit the model
  - .LinearRegression( ) & .fit( )
  - print the model coefficients
- Plot the predicted values computed from the training features
  - The predict method is applied to the model with the training data
Evaluate model performance
- Using the test dataset
- Choose & compute performance metric
  - Mean squared error or MSE
  - Root mean squred error or RMSE
  - Mean absolute error or MAE
  - Median absolute error
  - R squared or 𝑅2 , also known as the coefficient of determination
  - Adjusted R squared
- Residuals (errors) of the model should be Normally distributed
  - plot a kernel density plot and histogram of the residuals of the regression model
  - Quantile-Quantile Normal plot, or Q-Q Normal plot
    - If the residuals have a distribution which is approximately Normal, points will nearly fall along the straight line
  - Residual plot

Week 14 - Machine Learning - (Supervised Learning) Multivariate Linear Regression from M.L. course by Stanford Uni (Week 2 Part I)

June 27, 2021

Multivariate Linear Regression

Week 14 - Machine Learning - Data Preparation for automobile prices & german bank credit from Principles of M.L. Python by Microsoft Learning

June 26, 2021

Data Preparation for Machine Learning
Recode column names containing special characters
Treat missing values
Transform column data types
Feature engineering
- Aggregating categories to reduce the number of categories
- Transforming numeric variables to improve their distribution properties
- Compute new features from two or more existing features
Remove duplicates

Week 14 - Machine Learning - (Supervised Learning) Visualizing Data for Classification for german bank credit from Principles of M.L. Python by Microsoft Learning

June 25, 2021

Visualizing Data for Classification
Exploratory data analysis
Examine classes and class imbalance
Visualize class separation by numeric features
- Box plots
- Violin plots
Visualize class separation by categorical features
- Bar charts