Week 15 - Machine Learning - (Supervised Learning) Applying Linear Regression for automobile prices from Principles of M.L. Python by Microsoft Learning
- Linear regression using scikit-learn
- Use categorical data
- Apply transformations to features and labels to improve model performance
- Compare regression models to improve model performance
- Prepare the model matrix
- Create dummy variables from categorical features
- Concatenate the numeric features
- Data preparation
- Split the dataset
- create independently sampled training dataset and test data set
- .train_test_split( ) to perform Bernoulli sampling
- Scale numeric features
- Z-Score normalization
- .StandardScaler( ) then .transform( )
- Construct the regression model
- Instantiate and fit the model
- .LinearRegression( ) & .fit( )
- print the intercept term and model coefficients
- Evaluate model performance
- Plot the predicted values computed from the training features
- The predict method is applied to the model with the training data
- Using the test dataset
- Choose & compute performance metric
- Mean squared error or MSE
- Root mean squred error or RMSE
- Mean absolute error or MAE
- Median absolute error
- R squared or 𝑅2 , also known as the coefficient of determination
- Adjusted R squared
- Residuals (errors) of the model should be Normally distributed
- plot a kernel density plot and histogram of the residuals of the regression model
- Quantile-Quantile Normal plot, or Q-Q Normal plot
- If the residuals have a distribution which is approximately Normal, points will nearly fall along the straight line
- Residual plot
Read More
Week 15 - Machine Learning - (Supervised Learning) Classification from Principles of M.L. Python by Microsoft Learning
- Logistic Regression
- 2-class classification
- Linear regression model with a non linear output
- Response is binary, {0,1} , or positive and negative; response is the prediction of the category
- Response of the linear model is transformed or ‘squashed’ to values close to 0 and 1 using a logistic function
- Prepare data for scikit-learn model
- In the dataframe, usually there’s 1 identifier column, 1 label column (y-axis output values), remaining columns are features (x-axis output values)
- Create a numpy array of the label values
- Create the model matrix
- Categorical variables need to be recoded as binary dummy variables
- Numeric features must be concatenated to the numpy array
- Split the cases into training and test data sets
- Numeric features must be rescaled
- Construct the logistic regression model
- Compute the logistic regression model.
- .LogisticRegression( )
- Fit the linear model
- Model coefficients
- Compute a sample of class probabilities for the test feature set
- Class with the highest probability is taken as the score (prediction)
- Score and evaluate the classification model
- Transform computed class probabilities into actual class scores
- Set the threshold to the probability of a score of 0 for the test data
- Obtain results of the test data
- Quantify the performance of the model
- Always use multiple metrics to evaluate the performance of any machine learning model
- Confusion matrix
- Accuracy
- Precision
- Recall
- F1
- ROC and AUC
- Class imbalance:
problem in machine learning where the total number of a class of data (e.g. +ve) is far less than the total number of another class of data (e.g.-ve).
(e.g. two positive cases for each negative case).
- Naive ‘classifier’ that sets all cases to positive
- Compute ROC & AUC and compare
- Compute a weighted model
- Helps overcome class imbalance problem
- Weight the classes when computing the logistic regression model
- .LogisticRegression(class_weight = … )
- Compute class probabilities for the test feature set
- To find if there is any significant difference with the unweighted model, compute the scores and the metrics
- Look at accuracy, ROC & AUC
- Scoring threshold can be adjusted
- Helps overcome the problem of asymmetric cost of misclassification to the bank
- To tip the model scoring toward correctly identifying the bad credit cases
- Scoring and evaluating the model for a given threshold value
- Exactly which threshold to pick is a business decision
Read More
Week 15 - Machine Learning - (Supervised Learning) Introduction to Regression from Principles of M.L. Python by Microsoft Learning
- Linear regression using scikit-learn
- Overview of linear regression
- Data preparation
- Split the dataset
- create independently sampled training dataset and test data set
- .train_test_split( ) to perform Bernoulli sampling
- Scale numeric features
- Min-Max normalization
- Z-Score normalization
- .StandardScaler( ) then .transform( )
- Train the regression model
- Instantiate and fit the model
- .LinearRegression( ) & .fit( )
- print the model coefficients
- Plot the predicted values computed from the training features
- The predict method is applied to the model with the training data
- Evaluate model performance
- Using the test dataset
- Choose & compute performance metric
- Mean squared error or MSE
- Root mean squred error or RMSE
- Mean absolute error or MAE
- Median absolute error
- R squared or 𝑅2 , also known as the coefficient of determination
- Adjusted R squared
- Residuals (errors) of the model should be Normally distributed
- plot a kernel density plot and histogram of the residuals of the regression model
- Quantile-Quantile Normal plot, or Q-Q Normal plot
- If the residuals have a distribution which is approximately Normal, points will nearly fall along the straight line
- Residual plot
Read More
Week 14 - Machine Learning - (Supervised Learning) Multivariate Linear Regression from M.L. course by Stanford Uni (Week 2 Part I)
- Multivariate Linear Regression
Read More
Week 14 - Machine Learning - Data Preparation for automobile prices & german bank credit from Principles of M.L. Python by Microsoft Learning
- Data Preparation for Machine Learning
- Recode column names containing special characters
- Treat missing values
- Transform column data types
- Feature engineering
- Aggregating categories to reduce the number of categories
- Transforming numeric variables to improve their distribution properties
- Compute new features from two or more existing features
- Remove duplicates
Read More
Week 14 - Machine Learning - (Supervised Learning) Visualizing Data for Classification for german bank credit from Principles of M.L. Python by Microsoft Learning
- Visualizing Data for Classification
- Exploratory data analysis
- Examine classes and class imbalance
- Visualize class separation by numeric features
- Visualize class separation by categorical features
Read More