Week 15 - Machine Learning - (Supervised Learning) Classification from Principles of M.L. Python by Microsoft Learning

Liaw Bei Le · June 29, 2021

  • Logistic Regression
    • 2-class classification
    • Linear regression model with a non linear output
    • Response is binary, {0,1} , or positive and negative; response is the prediction of the category
    • Response of the linear model is transformed or ‘squashed’ to values close to 0 and 1 using a logistic function
  • Prepare data for scikit-learn model
    • In the dataframe, usually there’s 1 identifier column, 1 label column (y-axis output values), remaining columns are features (x-axis output values)
    • Create a numpy array of the label values
    • Create the model matrix
      • Categorical variables need to be recoded as binary dummy variables
    • Numeric features must be concatenated to the numpy array
    • Split the cases into training and test data sets
    • Numeric features must be rescaled
  • Construct the logistic regression model
    • Compute the logistic regression model.
      • .LogisticRegression( )
      • Fit the linear model
    • Model coefficients
    • Compute a sample of class probabilities for the test feature set
      • Class with the highest probability is taken as the score (prediction)
  • Score and evaluate the classification model
    • Transform computed class probabilities into actual class scores
      • Set the threshold to the probability of a score of 0 for the test data
      • Obtain results of the test data
    • Quantify the performance of the model
      • Always use multiple metrics to evaluate the performance of any machine learning model
        • Confusion matrix
        • Accuracy
        • Precision
        • Recall
        • F1
        • ROC and AUC
    • Class imbalance:
      problem in machine learning where the total number of a class of data (e.g. +ve) is far less than the total number of another class of data (e.g.-ve).
      (e.g. two positive cases for each negative case).
    • Naive ‘classifier’ that sets all cases to positive
      • Compute ROC & AUC and compare
  • Compute a weighted model
    • Helps overcome class imbalance problem
    • Weight the classes when computing the logistic regression model
      • .LogisticRegression(class_weight = … )
    • Compute class probabilities for the test feature set
    • To find if there is any significant difference with the unweighted model, compute the scores and the metrics
      • Look at accuracy, ROC & AUC
  • Scoring threshold can be adjusted
    • Helps overcome the problem of asymmetric cost of misclassification to the bank
    • To tip the model scoring toward correctly identifying the bad credit cases
    • Scoring and evaluating the model for a given threshold value
    • Exactly which threshold to pick is a business decision

Classification - Mushroom dataset:

  • Recode categorical variables to numeric
  • Obtain numeric features and label values
  • Split the cases into training and test data sets
  • Train a DecisionTreeClassifier with default parameters and random_state=0, then find the 5 most important features
  • Use validation_curve( ) to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values
  • Finding gamma values corresponding generalization performance on the dataset (underfitting, overfitting, good) based on scores

These are my notes taken from Microsoft Learning’s Principles of Machine Learning in Python - Module 4.

What I learnt:

Classification

Method: Logistic regression.

Goal of regression: Constructing a 2-class classification model and evaluate its performance.

I have reviewed and went through Classification in this link.

Twitter, Facebook