Week 15 - Machine Learning - (Supervised Learning) Applying Linear Regression for automobile prices from Principles of M.L. Python by Microsoft Learning

  • Linear regression using scikit-learn
  • Use categorical data
  • Apply transformations to features and labels to improve model performance
  • Compare regression models to improve model performance
  • Prepare the model matrix
    • Create dummy variables from categorical features
    • Concatenate the numeric features
  • Data preparation
    • Split the dataset
      • create independently sampled training dataset and test data set
      • .train_test_split( ) to perform Bernoulli sampling
    • Scale numeric features
      • Z-Score normalization
        • .StandardScaler( ) then .transform( )
  • Construct the regression model
    • Instantiate and fit the model
      • .LinearRegression( ) & .fit( )
      • print the intercept term and model coefficients
  • Evaluate model performance
    • Plot the predicted values computed from the training features - The predict method is applied to the model with the training data
    • Using the test dataset
    • Choose & compute performance metric
      • Mean squared error or MSE
      • Root mean squred error or RMSE
      • Mean absolute error or MAE
      • Median absolute error
      • R squared or 𝑅2 , also known as the coefficient of determination
      • Adjusted R squared
    • Residuals (errors) of the model should be Normally distributed
      • plot a kernel density plot and histogram of the residuals of the regression model
      • Quantile-Quantile Normal plot, or Q-Q Normal plot
        • If the residuals have a distribution which is approximately Normal, points will nearly fall along the straight line
      • Residual plot
Read More

Week 15 - Machine Learning - (Supervised Learning) Classification from Principles of M.L. Python by Microsoft Learning

  • Logistic Regression
    • 2-class classification
    • Linear regression model with a non linear output
    • Response is binary, {0,1} , or positive and negative; response is the prediction of the category
    • Response of the linear model is transformed or ‘squashed’ to values close to 0 and 1 using a logistic function
  • Prepare data for scikit-learn model
    • In the dataframe, usually there’s 1 identifier column, 1 label column (y-axis output values), remaining columns are features (x-axis output values)
    • Create a numpy array of the label values
    • Create the model matrix
      • Categorical variables need to be recoded as binary dummy variables
    • Numeric features must be concatenated to the numpy array
    • Split the cases into training and test data sets
    • Numeric features must be rescaled
  • Construct the logistic regression model
    • Compute the logistic regression model.
      • .LogisticRegression( )
      • Fit the linear model
    • Model coefficients
    • Compute a sample of class probabilities for the test feature set
      • Class with the highest probability is taken as the score (prediction)
  • Score and evaluate the classification model
    • Transform computed class probabilities into actual class scores
      • Set the threshold to the probability of a score of 0 for the test data
      • Obtain results of the test data
    • Quantify the performance of the model
      • Always use multiple metrics to evaluate the performance of any machine learning model
        • Confusion matrix
        • Accuracy
        • Precision
        • Recall
        • F1
        • ROC and AUC
    • Class imbalance:
      problem in machine learning where the total number of a class of data (e.g. +ve) is far less than the total number of another class of data (e.g.-ve).
      (e.g. two positive cases for each negative case).
    • Naive ‘classifier’ that sets all cases to positive
      • Compute ROC & AUC and compare
  • Compute a weighted model
    • Helps overcome class imbalance problem
    • Weight the classes when computing the logistic regression model
      • .LogisticRegression(class_weight = … )
    • Compute class probabilities for the test feature set
    • To find if there is any significant difference with the unweighted model, compute the scores and the metrics
      • Look at accuracy, ROC & AUC
  • Scoring threshold can be adjusted
    • Helps overcome the problem of asymmetric cost of misclassification to the bank
    • To tip the model scoring toward correctly identifying the bad credit cases
    • Scoring and evaluating the model for a given threshold value
    • Exactly which threshold to pick is a business decision
Read More

Week 15 - Machine Learning - (Supervised Learning) Introduction to Regression from Principles of M.L. Python by Microsoft Learning

  • Linear regression using scikit-learn
  • Overview of linear regression
  • Data preparation
    • Split the dataset
      • create independently sampled training dataset and test data set
      • .train_test_split( ) to perform Bernoulli sampling
    • Scale numeric features
      • Min-Max normalization
      • Z-Score normalization
        • .StandardScaler( ) then .transform( )
  • Train the regression model
    • Instantiate and fit the model
      • .LinearRegression( ) & .fit( )
      • print the model coefficients
    • Plot the predicted values computed from the training features
      • The predict method is applied to the model with the training data
  • Evaluate model performance
    • Using the test dataset
    • Choose & compute performance metric
      • Mean squared error or MSE
      • Root mean squred error or RMSE
      • Mean absolute error or MAE
      • Median absolute error
      • R squared or 𝑅2 , also known as the coefficient of determination
      • Adjusted R squared
    • Residuals (errors) of the model should be Normally distributed
      • plot a kernel density plot and histogram of the residuals of the regression model
      • Quantile-Quantile Normal plot, or Q-Q Normal plot
        • If the residuals have a distribution which is approximately Normal, points will nearly fall along the straight line
      • Residual plot
Read More

Week 14 - Machine Learning - Data Preparation for automobile prices & german bank credit from Principles of M.L. Python by Microsoft Learning

  • Data Preparation for Machine Learning
  • Recode column names containing special characters
  • Treat missing values
  • Transform column data types
  • Feature engineering
    • Aggregating categories to reduce the number of categories
    • Transforming numeric variables to improve their distribution properties
    • Compute new features from two or more existing features
  • Remove duplicates
Read More