Machine Learning

ML: Intro to Machine Learning

ML enables computers to learn without explicit programming
ML vs Classical Programming

Software Engineer

Define rules and based on rules decide a mail is spam or not

Challenges

Rigid
Lot of hard coding

Machine Learning

We want the computer to learn the patterns for us
Create algorithm using training/sample data to algorithm
Idea is that the algorithm will identify the patterns
It will create a mechanism (hypothesis)
When a new email is given to algorithm it will identify spam/non-spam

Basic ML pipeline

Input: Training data(Text, Numerical)
Which algorithm to use
Trained model as output(also called hypothesis)

What is machine Learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

E: Experience -> Training Data
T: Task -> eg: classify email as spam or not
P: Performance -> Quantitative measure of how model is performing

Task: Predicting Stock Prices

Experience: Historical data of prices
Performance: differences of actual vs predicted prices

Categorize images into cat, dog, rabbit

Experience: Labelled images of cat, dog and rabbit
Performance: +ve marks for correct prediction and -ve marks for incorrect prediction

Trained Data

Preprocessing(Numpy and Pandas)

Cleaning
Removing outliers
Transforming

ML: Linear Regression-1

Types of tasks

Classification

Labelled data will be given for training and
Label a new data point

Regression(predict real continuous value)

Data/Features about a house

Clustering

Customer data -> last purchase made, frequency of purchase, products checked out,

ML: Linear Regression-2

Linear regression Intuition - single variable

Linear regression tries to fit "Best fit line" through the data
Mean Absolute error
Mean squared error
Draw graph between w1 and loss to get the minima

Linear regression Intuition - multi variable

Age, km driven & price

Model interpretability
Gradient descent revision

x`=x - n* dy/dx(n is learning rate)
keep on repeating until x converges

Mathematical implementation

y hat = w1x1 + w2x2.... wdxd + w0
loss = mean squared error
Goal is minimize loss by updating weights and bias using gradient descent

ML: Linear Regression-3

Business Case: Delhivery Review

Feature engineering
outlier treatment
scaling

ML: Linear Regression-4

Assumptions of Linear Regressions

Linearity
Multi-Collinearity

High correlation among the features
Interpretability of the model will be lost
calculate VIF score
find highest vif score
drop the feature with highest vif score
Repeat above until all the features with vif score under 10

Normality of residual
Hetroskadasticity

Auto correlation

ML: Polyomial Regression, Bias-Variance

Types of gradient descent

Batch Gradient descent
Mini batch GD
Stochastic -

only one single data point is used in each iteration
There will be lot of fluctuations in loss until we reach minima

1 epoh -> no of iterations needed for the entire dataset to be passed for optimization

Polynomial regression

underfitting and overfitting
Bias variance tradeoff

ML: Regularisation, Cross Validation

Bias & Variance tradeoff

Bias -> error = yi - yhat
Variance -> spread = variance of the model -> less consistency

Regularization(L2) -> Ridge regression

Complex model with degree 4 - how to reduce complexity

Reduce influence of higher degree terms

Reduce weights associated to these functions

new loss function

loss= 1/n(sigma(yi - yhat)^2 + sigma(wj)^2)
sigma(wj)^2 is regularization term

Regularization(L1)

loss= 1/n(sigma(yi - yhat)^2 + lamda * sigma(|wj|)

Ridge, loss & Elasticenet regression

loss= 1/n(sigma(yi - yhat)^2 + lambda*(c * sigma(wj)^2+(1-c)sigma(|wj|)

ML: Logistic Regression-1

High bias -> should we increase or decrease complexity of model? -> we should increase- this is underfitting
High Variance -> should we increase or decrease complexity of model? -> we should decrease as model learnt from noise. This is overfitting case
Hyperparameter tuning

Parameters are obtained by ML algorithm
Hyperparameters are obtained through experimentation by engineers

degree of polynomial features -> test with various degrees and identify the best one
regularization constant (lambda) ->

cross validation
k-fold cross validation

Divide Train data into multiple chunks(4 chunks)
use one of the chunk for validation other 3 for train in a cyclic manner so that every chunk will go thru both train and validation cycle

intro to logistic regression

Case Study: AT & T customer churn/attrition

Features -> #customer care calls, account length, discount ::: Target -> churn
Binary Classification problem
Supervised learning problem
yi belongs to {0,1}
In Linear Regressions,

target can take any Real number value
It fits best fit line thru training data
It is learning the best fit line by minimizing loss(MSE)

In Logistic regression - it is best separating line

hyperplane = w1x1 + w2x2+... wdxd +w0
Goal: The wi and w0 for the best separating decision boundary

Thresholding or step function is not differentiable at x=0
Sigmoid function

1/(1+e^-z)
max value is 1, min value is 0
range is (0,1)

sigmoid function

ML: Logistic Regression-2 -> Classification algorithm

Geometric interpretation of sigmoid

Maximum likelihood estimation

Optimization
Sklearn implementation
Accuracy metric
Hyperparameter tuning
log loss
Why not MSE(Mean Squared Error)

We want our loss function to be convex function(Single minima)

ML: Classification Metrics (CM)

log odds

odds = p(win)/p(loss)
odds of csk winning is 4:1
higher the odds imply higher chance of winning
probability of belonging to class1/probability of belonging to class0 -> p/1-p

impact of outliers

two types

outliers on correct side
outliers on incorrect side

multiclass classification

Train 3 separate logistic regression models. Class with more probability will be taken

M1: classifies only one target p(Orange)
p(apple)
p(grape)

confusion matrix
precision
recall
f1 score

ML: CM contd. + Imbalanced Data

Business Case: Jamboree Review

ML: k Nearest Neighbors-1 (KNN)

Classification vs Regression problem

Classification problem is one where the Target is not real value OR target is discrete. Eg: categorizing the shops
Regression has real or continuous value for target

Steps

Calculate the Euclidian distance for the new point to all data points
Sort the data/distances. Include class field as well
take top k classes with min distance
Take the majority vote

KNN is non-parametric algorithm.

Due to this it is slow
No training required ONLY testing required

Assumption:

All the data point in a neighborhood is similar. ie data is homogeneous

KNN is better than logistic regression as KNN handles multi-class problem

ML: k Nearest Neighbors-2

sklearn implementation
bias variance of knn

when k is very small

it will have high variance & low bias
it learns noise in training data
model is overfit

when k is very large

underfit

How to identify the correct value of K

Hyperparameter tuning

Hyperparameter tuning
Time complexity of knn

O(1) -> no training

distance matrices

Time complexity

n data points with d featurs
o(d) time for a distance b/w 2 points
o(nd) for n datapoints
sort -> o(nlogn)
pick k nearest neighbour -> o(k)
Time complexity -> o(nd + n logn)

Space complexity

o(n(d+1))

When dimensionality is very high, use cosine similarity

cosine similarity =x1.x2/mod(x1).mod(x2)

KNN in real world

Direct application in real world is impractical
LSH - Locality sensitive Hashing

knn Imputation

Find K nearest neighbours excluding the row with missing value(fi)
find mean value of fi feature
impute using computed mean

DSML Module Test: Intro to ML and NN

ML: Decision Trees-1

Cons of attrition

new employee might ask for more compensation
Training
Time and resources in recruiting

Chances of attrition for an employee

identify key factors for attrition

Factors that might contributing towards attrition

unhealthy work culture
overtime
age

We can create non-linear decision boundaries using decision trees
Entropy

Measure impurity of a node
Measure the level of heterogeneity
Entropy of a node for binary classification

h(y) = -[p(1) log2p(1) + p(0) log2p(0)]
= -[p log2(p) + -[(1-p) log2(1-p)]

We will always go for a split that give max info gain

For further splits we will keep on repeating the same idea: Go for the split with maximum Info gain.
Keep on doing this until you get pure nodes

Can we have features with more than 2 categories

yes: fi -> a, b, c

ML: Decision Trees-2

how do we decide on split

by maximizing information gain at each split

Gini Impurity

This does same thing as Entropy but does very quickly/in less time
GI = 1 -sigma(p(yi))^2

Splitting Numerical features

By threshold

Sort in ascending order
Start from top take all values as threshold
find information gain for each threshold
find threshold with max info gain and split the feature using the same threshold

Underfit vs Overfit

shallow tree -> underfit
deep tree -> overfit
Goal is to find appropriate depth
Decision Stump

tree with depth 1

Pruning -> removing unwanted branches from decision tree
Hyperparameter Tuning

How to decide when to stop decision tree

max depth

min-sample-leaf

min no of samples a leaf can have

max-leaf-nodes

Encoding categorical features

Generally encoding is required
Encoding is recommended if feature has too many unique values

pincode: 1 lakh unique values
Target encoding -> binning of the feature

feature Scaling and Data Imbalance

Data Imbalance

feature Importance

Decision Tree Regression

ML: Bagging and Random Forest

Ensemble models

group various models(using multiple models)
It takes multiple ML models(Base learners) which are as different as possible
Combine the models to generate final predictions

Bagging(-> Random forests) - Boot Strapped Aggregation

We will pass a subset of datapoints and subset of features to each model
Boot strap -> sampling with replacement
Random forest -> combining multiple decision trees
combining models will reduce variance
Base learners are Decision Trees with low bias and high variance
OOB - Out Of Bag points - can be used to check performance of RF model

Boosting
Stacking
Cascading

Bias - Variance Tradeoff

k = no of base learners/DT
if K increases, variance decreases and vice versa
Error = Bias ^2 + Variance + Irreducible error
if K increases, Variance and error decreases

Row - sampling - rate(m/n)

If RSR increases, variance will decrease and vice versa

Column - sampling - rate(m/n)

IF CSR increases, variance will decrease and vice versa

max depth

IF max depth increases, variance and bias will decrease and vice versa

ML: Boosting -1

Ensemble technique
combining various base models
Base learners - High bias and low variance decision trees

underfit model
additive combining to reduce bias

Boosting -> sequential modeling
high bias means high error
Why boosting

RF can easily parallelized
we cannot parallelize Gradient boosting
we can minimize loss using Gradient boosting
In boosting we are reducing residuals at each step
Indirectly minimizing the loss function at each step
Lets define a custom loss function -> L
Create a pseudo residual -> dl/dy(i)
reduce the pseudo residual in each iteration

Business Case: LoanTap Review

Reduce False negative by Improving Recall
Reduce False Positive by Improving precision

ML: Boosting -2

GBDT implementation
Bias variance
Stochastic Gradient desent
Huber loss

EMG signal classification case study

ML: Other Ensemble Techniques

Ensemble techniques

bagging
boosting
stacking
cascading

XG Boost -> Extreme Gradient Boosting

Light GBM

Gradient based one side sampling
GOSS is sampling and EFB is dimentionality reduciton

Stacking

Mainly used for competition -kaggle

Cascading

ML : Naive Bayes 1

It is a classification algorithm - specifically Text classification

Sentiment analysis

Positive or Negative feedback
Spam vs Ham

Text classification -> Natural Language processing (NLP)

Transformers (98.5% to 99%)
Naive Bayers (80% - 90%)

Naive Bayes can be used for Multiclass classification as well.

Data Preprocessing

Lowercase
Tokenizing the text

Welcome to the movie -> "welcome" "to" "the" "movie"

Remove all the stop words
remove all special chars and puncations
lemmatization

change, changing, changed -> change

Naive Assumption

Occurrence of individual unique words in a sentence is independent from other words

ML : Naive Bayes 2

recap -> preprocessing -> vectorization -> bag of words/count of word in document

Bag of words -> 1 if word is present else 0
count of words -> count of respective word in that message

laplace smoothing

Hyperparameter tuning
Impact of imbalance
underflow problems
feature interpretability
Impact of outliers

Words that have very low occurrence in Training dataset
We can set threshold for all words (count>10)

ML: SVM-1 (Support Vector Machine)

Introduction

concept of margins
kernel functions

Hard Margin SVM

Goal 1: Maximize the margin
Goal 2: There should be no misclassified points

Soft Margin SVM

Hyperparameter C

Mathematics
Comparison with log loss
Data Imbalance

ML: SVM-2

Comparison with log loss

Data imbalance
Primal dual Equivalence
Support Vectors
Kernel SVM
Polynomial kernel
RBF Kernel

ML: Intro to Unsupervised Learning, KMeans Revisited

SVM Code

Customer Segmentation case study
Machine Learning Tasks

Regression, Classification, Clustering, Recommendation & Timeseries
when target variable has specific set of outputs {0,1} then it is Classification Problem
when target variable can have any real number as output then it is Regression Problem
when NO Target variable then it is Unsupervised learning

Clustering

Group sample/data points on the basis of Similarity/distance
Similarity/distance can be calculated using

Eucledian
Manhattan
Minkavski
Cosine Similarity
Kernel functions

Intra and inter cluster distance

intra distance calculation methods

center to far most point distance
distance between 2 far most points

inter cluster distances

distance between centroids
distance between farthest points
distance between nearest points

Ideal values for Intra and Inter cluster distances

low intra cluster distances
High inter cluster distances

what metric to evaluate clustering

should make business sense
Technical way is Dunn Index -> low intra and high inter cluster distances

Dunn Index

min i,j distaince(i,j) / max k distaince(k) -> Distance between farthest points in cluster i and cluster j/ max distance in any cluster
we want to maximize Dunn Index

KMeans Intro

Simple, Popular and Baseline
Each cluster will have its unique centroid
We can identify the clusters using centroid
centroid is mean/avg of all data points inside the cluster
Core Idea

K centroids and each data point will be assigned centroid which is closet to it
Each centroid represents a cluster

Task

Find K centroids representing K clusters

Optimization problem - NP-Hard
Approx algo/heuristic algo

lloyd's algo

KMeans Mathematical Foundation
Lloyds Algo

Initialization: Pick K pts randomly
Assignment: For each xi in D, select the nearest centroid cj, add xi to sj
Update the centroids
Repeat above step2 & 3 until centroid doesn't change

ML: KMeans++

Determining K

WCSS (Within cluster sum of squares)

Initialization

Bad initialization leads to suboptimal clustering
If we initiate centroids close enough - it will take more time/iterations to reach convergence

Brute force to address sub optimal clustering

do the initialization multiple times (say 100 times)

Idea 1: Take points as far away as possible
Idea 2: Pick the second centroid on the basis of probability such that higher distance means higher probability
Issue with K Means and K Medians