Week 15 - Machine Learning - (Supervised Learning) Classification from Principles of M.L. Python by Microsoft Learning

Liaw Bei Le · June 29, 2021

Logistic Regression
- 2-class classification
- Linear regression model with a non linear output
- Response is binary, {0,1} , or positive and negative; response is the prediction of the category
- Response of the linear model is transformed or ‘squashed’ to values close to 0 and 1 using a logistic function
Prepare data for scikit-learn model
- In the dataframe, usually there’s 1 identifier column, 1 label column (y-axis output values), remaining columns are features (x-axis output values)
- Create a numpy array of the label values
- Create the model matrix
  - Categorical variables need to be recoded as binary dummy variables
- Numeric features must be concatenated to the numpy array
- Split the cases into training and test data sets
- Numeric features must be rescaled
Construct the logistic regression model
- Compute the logistic regression model.
  - .LogisticRegression( )
  - Fit the linear model
- Model coefficients
- Compute a sample of class probabilities for the test feature set
  - Class with the highest probability is taken as the score (prediction)
Score and evaluate the classification model
- Transform computed class probabilities into actual class scores
  - Set the threshold to the probability of a score of 0 for the test data
  - Obtain results of the test data
- Quantify the performance of the model
  - Always use multiple metrics to evaluate the performance of any machine learning model
    - Confusion matrix
    - Accuracy
    - Precision
    - Recall
    - F1
    - ROC and AUC
- Class imbalance:
  problem in machine learning where the total number of a class of data (e.g. +ve) is far less than the total number of another class of data (e.g.-ve).
  (e.g. two positive cases for each negative case).
- Naive ‘classifier’ that sets all cases to positive
  - Compute ROC & AUC and compare
Compute a weighted model
- Helps overcome class imbalance problem
- Weight the classes when computing the logistic regression model
  - .LogisticRegression(class_weight = … )
- Compute class probabilities for the test feature set
- To find if there is any significant difference with the unweighted model, compute the scores and the metrics
  - Look at accuracy, ROC & AUC
Scoring threshold can be adjusted
- Helps overcome the problem of asymmetric cost of misclassification to the bank
- To tip the model scoring toward correctly identifying the bad credit cases
- Scoring and evaluating the model for a given threshold value
- Exactly which threshold to pick is a business decision

Classification - Mushroom dataset:

Recode categorical variables to numeric
Obtain numeric features and label values
Split the cases into training and test data sets
Train a DecisionTreeClassifier with default parameters and random_state=0, then find the 5 most important features
Use validation_curve( ) to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values
Finding gamma values corresponding generalization performance on the dataset (underfitting, overfitting, good) based on scores

These are my notes taken from Microsoft Learning’s Principles of Machine Learning in Python - Module 4.

What I learnt:

Classification

Method: Logistic regression.

Goal of regression: Constructing a 2-class classification model and evaluate its performance.

I have reviewed and went through Classification in this link.

Share: Twitter, Facebook