- Logistic Regression
- 2-class classification
- Linear regression model with a non linear output
- Response is binary, {0,1} , or positive and negative; response is the prediction of the category
- Response of the linear model is transformed or ‘squashed’ to values close to 0 and 1 using a logistic function
- Prepare data for scikit-learn model
- In the dataframe, usually there’s 1 identifier column, 1 label column (y-axis output values), remaining columns are features (x-axis output values)
- Create a numpy array of the label values
- Create the model matrix
- Categorical variables need to be recoded as binary dummy variables
- Numeric features must be concatenated to the numpy array
- Split the cases into training and test data sets
- Numeric features must be rescaled
- Construct the logistic regression model
- Compute the logistic regression model.
- .LogisticRegression( )
- Fit the linear model
- Model coefficients
- Compute a sample of class probabilities for the test feature set
- Class with the highest probability is taken as the score (prediction)
- Compute the logistic regression model.
- Score and evaluate the classification model
- Transform computed class probabilities into actual class scores
- Set the threshold to the probability of a score of 0 for the test data
- Obtain results of the test data
- Quantify the performance of the model
- Always use multiple metrics to evaluate the performance of any machine learning model
- Confusion matrix
- Accuracy
- Precision
- Recall
- F1
- ROC and AUC
- Always use multiple metrics to evaluate the performance of any machine learning model
- Class imbalance:
problem in machine learning where the total number of a class of data (e.g. +ve) is far less than the total number of another class of data (e.g.-ve).
(e.g. two positive cases for each negative case). - Naive ‘classifier’ that sets all cases to positive
- Compute ROC & AUC and compare
- Transform computed class probabilities into actual class scores
- Compute a weighted model
- Helps overcome class imbalance problem
- Weight the classes when computing the logistic regression model
- .LogisticRegression(class_weight = … )
- Compute class probabilities for the test feature set
- To find if there is any significant difference with the unweighted model, compute the scores and the metrics
- Look at accuracy, ROC & AUC
- Scoring threshold can be adjusted
- Helps overcome the problem of asymmetric cost of misclassification to the bank
- To tip the model scoring toward correctly identifying the bad credit cases
- Scoring and evaluating the model for a given threshold value
- Exactly which threshold to pick is a business decision
Classification - Mushroom dataset:
- Recode categorical variables to numeric
- Obtain numeric features and label values
- Split the cases into training and test data sets
- Train a DecisionTreeClassifier with default parameters and random_state=0, then find the 5 most important features
- Use validation_curve( ) to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values
- Finding gamma values corresponding generalization performance on the dataset (underfitting, overfitting, good) based on scores
These are my notes taken from Microsoft Learning’s Principles of Machine Learning in Python - Module 4.
What I learnt:
Classification
Method: Logistic regression.
Goal of regression: Constructing a 2-class classification model and evaluate its performance.
I have reviewed and went through Classification in this link.