Machine Learning
- ML enables computers to learn without explicit programming
- ML vs Classical Programming
- Software Engineer
- Define rules and based on rules decide a mail is spam or not
- Challenges
- Rigid
- Lot of hard coding
- Machine Learning
- We want the computer to learn the patterns for us
- Create algorithm using training/sample data to algorithm
- Idea is that the algorithm will identify the patterns
- It will create a mechanism (hypothesis)
- When a new email is given to algorithm it will identify spam/non-spam
- Basic ML pipeline
- Input: Training data(Text, Numerical)
- Which algorithm to use
- Trained model as output(also called hypothesis)
- What is machine Learning
- A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
- E: Experience -> Training Data
- T: Task -> eg: classify email as spam or not
- P: Performance -> Quantitative measure of how model is performing
- Task: Predicting Stock Prices
- Experience: Historical data of prices
- Performance: differences of actual vs predicted prices
- Categorize images into cat, dog, rabbit
- Experience: Labelled images of cat, dog and rabbit
- Performance: +ve marks for correct prediction and -ve marks for incorrect prediction
- Trained Data
- Preprocessing(Numpy and Pandas)
- Cleaning
- Removing outliers
- Transforming
- Types of tasks
- Classification
- Labelled data will be given for training and
- Label a new data point
- Regression(predict real continuous value)
- Data/Features about a house
- Clustering
- Customer data -> last purchase made, frequency of purchase, products checked out,
- Linear regression Intuition - single variable
- Linear regression tries to fit "Best fit line" through the data
- Mean Absolute error
- Mean squared error
- Draw graph between w1 and loss to get the minima
- Linear regression Intuition - multi variable
- Age, km driven & price
- Model interpretability
- Gradient descent revision
- x`=x - n* dy/dx(n is learning rate)
- keep on repeating until x converges
- Mathematical implementation
- y hat = w1x1 + w2x2.... wdxd + w0
- loss = mean squared error
- Goal is minimize loss by updating weights and bias using gradient descent
- Feature engineering
- outlier treatment
- scaling
- Assumptions of Linear Regressions
- Linearity
- Multi-Collinearity
- High correlation among the features
- Interpretability of the model will be lost
- calculate VIF score
find highest vif score
drop the feature with highest vif score - Repeat above until all the features with vif score under 10
- Normality of residual
- Hetroskadasticity
- Auto correlation
- Types of gradient descent
- Batch Gradient descent
- Mini batch GD
- Stochastic -
- only one single data point is used in each iteration
- There will be lot of fluctuations in loss until we reach minima
- 1 epoh -> no of iterations needed for the entire dataset to be passed for optimization
- Polynomial regression
- underfitting and overfitting
- Bias variance tradeoff
- Bias & Variance tradeoff
- Bias -> error = yi - yhat
- Variance -> spread = variance of the model -> less consistency
- Regularization(L2) -> Ridge regression
- Complex model with degree 4 - how to reduce complexity
- Reduce influence of higher degree terms
- Reduce weights associated to these functions
- new loss function
- loss= 1/n(sigma(yi - yhat)^2 + sigma(wj)^2)
- sigma(wj)^2 is regularization term
- Regularization(L1)
- loss= 1/n(sigma(yi - yhat)^2 + lamda * sigma(|wj|)
- Ridge, loss & Elasticenet regression
- loss= 1/n(sigma(yi - yhat)^2 + lambda*(c * sigma(wj)^2+(1-c)sigma(|wj|)
- High bias -> should we increase or decrease complexity of model? -> we should increase- this is underfitting
- High Variance -> should we increase or decrease complexity of model? -> we should decrease as model learnt from noise. This is overfitting case
- Hyperparameter tuning
- Parameters are obtained by ML algorithm
- Hyperparameters are obtained through experimentation by engineers
- degree of polynomial features -> test with various degrees and identify the best one
- regularization constant (lambda) ->
- cross validation
- k-fold cross validation
- Divide Train data into multiple chunks(4 chunks)
- use one of the chunk for validation other 3 for train in a cyclic manner so that every chunk will go thru both train and validation cycle
- intro to logistic regression
- Case Study: AT & T customer churn/attrition
- Features -> #customer care calls, account length, discount ::: Target -> churn
- Binary Classification problem
- Supervised learning problem
- yi belongs to {0,1}
- In Linear Regressions,
- target can take any Real number value
- It fits best fit line thru training data
- It is learning the best fit line by minimizing loss(MSE)
- In Logistic regression - it is best separating line
- hyperplane = w1x1 + w2x2+... wdxd +w0
- Goal: The wi and w0 for the best separating decision boundary
- Thresholding or step function is not differentiable at x=0
- Sigmoid function
- 1/(1+e^-z)
- max value is 1, min value is 0
- range is (0,1)
- sigmoid function
ML: Logistic Regression-2 -> Classification algorithm
- Geometric interpretation of sigmoid
- Maximum likelihood estimation
- Optimization
- Sklearn implementation
- Accuracy metric
- Hyperparameter tuning
- log loss
- Why not MSE(Mean Squared Error)
- We want our loss function to be convex function(Single minima)
- log odds
- odds = p(win)/p(loss)
- odds of csk winning is 4:1
- higher the odds imply higher chance of winning
- probability of belonging to class1/probability of belonging to class0 -> p/1-p
- impact of outliers
- two types
- outliers on correct side
- outliers on incorrect side
- multiclass classification
- Train 3 separate logistic regression models. Class with more probability will be taken
- M1: classifies only one target p(Orange)
- p(apple)
- p(grape)
- confusion matrix
- precision
- recall
- f1 score
ML: CM contd. + Imbalanced Data
Business Case: Jamboree Review
- Classification vs Regression problem
- Classification problem is one where the Target is not real value OR target is discrete. Eg: categorizing the shops
- Regression has real or continuous value for target
- Steps
- Calculate the Euclidian distance for the new point to all data points
- Sort the data/distances. Include class field as well
- take top k classes with min distance
- Take the majority vote
- KNN is non-parametric algorithm.
- Due to this it is slow
- No training required ONLY testing required
- Assumption:
- All the data point in a neighborhood is similar. ie data is homogeneous
- KNN is better than logistic regression as KNN handles multi-class problem
- sklearn implementation
- bias variance of knn
- when k is very small
- it will have high variance & low bias
- it learns noise in training data
- model is overfit
- when k is very large
- underfit
- How to identify the correct value of K
- Hyperparameter tuning
- Hyperparameter tuning
- Time complexity of knn
- O(1) -> no training
- distance matrices
- Time complexity
- n data points with d featurs
- o(d) time for a distance b/w 2 points
- o(nd) for n datapoints
- sort -> o(nlogn)
- pick k nearest neighbour -> o(k)
- Time complexity -> o(nd + n logn)
- Space complexity
- o(n(d+1))
- When dimensionality is very high, use cosine similarity
- cosine similarity =x1.x2/mod(x1).mod(x2)
- KNN in real world
- Direct application in real world is impractical
- LSH - Locality sensitive Hashing
- knn Imputation
- Find K nearest neighbours excluding the row with missing value(fi)
- find mean value of fi feature
- impute using computed mean
- Cons of attrition
- new employee might ask for more compensation
- Training
- Time and resources in recruiting
- Chances of attrition for an employee
- identify key factors for attrition
- Factors that might contributing towards attrition
- unhealthy work culture
- overtime
- age
- We can create non-linear decision boundaries using decision trees
- Entropy
- Measure impurity of a node
- Measure the level of heterogeneity
- Entropy of a node for binary classification
- h(y) = -[p(1) log2p(1) + p(0) log2p(0)]
= -[p log2(p) + -[(1-p) log2(1-p)] - We will always go for a split that give max info gain
- For further splits we will keep on repeating the same idea: Go for the split with maximum Info gain.
- Keep on doing this until you get pure nodes
- Can we have features with more than 2 categories
- yes: fi -> a, b, c
- how do we decide on split
- by maximizing information gain at each split
- Gini Impurity
- This does same thing as Entropy but does very quickly/in less time
- GI = 1 -sigma(p(yi))^2
- Splitting Numerical features
- By threshold
- Sort in ascending order
- Start from top take all values as threshold
- find information gain for each threshold
- find threshold with max info gain and split the feature using the same threshold
- Underfit vs Overfit
- shallow tree -> underfit
- deep tree -> overfit
- Goal is to find appropriate depth
- Decision Stump
- tree with depth 1
- Pruning -> removing unwanted branches from decision tree
- Hyperparameter Tuning
- How to decide when to stop decision tree
- max depth
- min-sample-leaf
- min no of samples a leaf can have
- max-leaf-nodes
- Encoding categorical features
- Generally encoding is required
- Encoding is recommended if feature has too many unique values
- pincode: 1 lakh unique values
- Target encoding -> binning of the feature
- feature Scaling and Data Imbalance
- Data Imbalance
- feature Importance
- Decision Tree Regression
- Ensemble models
- group various models(using multiple models)
- It takes multiple ML models(Base learners) which are as different as possible
- Combine the models to generate final predictions
- Bagging(-> Random forests) - Boot Strapped Aggregation
- We will pass a subset of datapoints and subset of features to each model
- Boot strap -> sampling with replacement
- Random forest -> combining multiple decision trees
- combining models will reduce variance
- Base learners are Decision Trees with low bias and high variance
- OOB - Out Of Bag points - can be used to check performance of RF model
- Boosting
- Stacking
- Cascading
- Bias - Variance Tradeoff
- k = no of base learners/DT
- if K increases, variance decreases and vice versa
- Error = Bias ^2 + Variance + Irreducible error
- if K increases, Variance and error decreases
- Row - sampling - rate(m/n)
- If RSR increases, variance will decrease and vice versa
- Column - sampling - rate(m/n)
- IF CSR increases, variance will decrease and vice versa
- max depth
- IF max depth increases, variance and bias will decrease and vice versa
- Ensemble technique
- combining various base models
- Base learners - High bias and low variance decision trees
- underfit model
- additive combining to reduce bias
- Boosting -> sequential modeling
- high bias means high error
- Why boosting
- RF can easily parallelized
- we cannot parallelize Gradient boosting
- we can minimize loss using Gradient boosting
- In boosting we are reducing residuals at each step
- Indirectly minimizing the loss function at each step
- Lets define a custom loss function -> L
- Create a pseudo residual -> dl/dy(i)
- reduce the pseudo residual in each iteration
- Reduce False negative by Improving Recall
- Reduce False Positive by Improving precision
- GBDT implementation
- Bias variance
- Stochastic Gradient desent
- Huber loss
- EMG signal classification case study
- Ensemble techniques
- bagging
- boosting
- stacking
- cascading
- XG Boost -> Extreme Gradient Boosting
- Light GBM
- Gradient based one side sampling
- GOSS is sampling and EFB is dimentionality reduciton
- Stacking
- Mainly used for competition -kaggle
- Cascading
- It is a classification algorithm - specifically Text classification
- Sentiment analysis
- Positive or Negative feedback
- Spam vs Ham
- Text classification -> Natural Language processing (NLP)
- Transformers (98.5% to 99%)
- Naive Bayers (80% - 90%)
- Naive Bayes can be used for Multiclass classification as well.
- Data Preprocessing
- Lowercase
- Tokenizing the text
- Welcome to the movie -> "welcome" "to" "the" "movie"
- Remove all the stop words
- remove all special chars and puncations
- lemmatization
- change, changing, changed -> change
- Naive Assumption
- Occurrence of individual unique words in a sentence is independent from other words
- recap -> preprocessing -> vectorization -> bag of words/count of word in document
- Bag of words -> 1 if word is present else 0
- count of words -> count of respective word in that message
- laplace smoothing
- Hyperparameter tuning
- Impact of imbalance
- underflow problems
- feature interpretability
- Impact of outliers
- Words that have very low occurrence in Training dataset
- We can set threshold for all words (count>10)
ML: SVM-1 (Support Vector Machine)
- Introduction
- concept of margins
- kernel functions
- Hard Margin SVM
- Goal 1: Maximize the margin
- Goal 2: There should be no misclassified points
- Soft Margin SVM
- Hyperparameter C
- Mathematics
- Comparison with log loss
- Data Imbalance
- Comparison with log loss
- Data imbalance
- Primal dual Equivalence
- Support Vectors
- Kernel SVM
- Polynomial kernel
- RBF Kernel
- SVM Code
- Customer Segmentation case study
- Machine Learning Tasks
- Regression, Classification, Clustering, Recommendation & Timeseries
- when target variable has specific set of outputs {0,1} then it is Classification Problem
- when target variable can have any real number as output then it is Regression Problem
- when NO Target variable then it is Unsupervised learning
- Clustering
- Group sample/data points on the basis of Similarity/distance
- Similarity/distance can be calculated using
- Eucledian
- Manhattan
- Minkavski
- Cosine Similarity
- Kernel functions
- Intra and inter cluster distance
- intra distance calculation methods
- center to far most point distance
- distance between 2 far most points
- inter cluster distances
- distance between centroids
- distance between farthest points
- distance between nearest points
- Ideal values for Intra and Inter cluster distances
- low intra cluster distances
- High inter cluster distances
- what metric to evaluate clustering
- should make business sense
- Technical way is Dunn Index -> low intra and high inter cluster distances
- Dunn Index
- min i,j distaince(i,j) / max k distaince(k) -> Distance between farthest points in cluster i and cluster j/ max distance in any cluster
- we want to maximize Dunn Index
- KMeans Intro
- Simple, Popular and Baseline
- Each cluster will have its unique centroid
- We can identify the clusters using centroid
- centroid is mean/avg of all data points inside the cluster
- Core Idea
- K centroids and each data point will be assigned centroid which is closet to it
- Each centroid represents a cluster
- Task
- Find K centroids representing K clusters
- Optimization problem - NP-Hard
- Approx algo/heuristic algo
- lloyd's algo
- KMeans Mathematical Foundation
- Lloyds Algo
- Initialization: Pick K pts randomly
- Assignment: For each xi in D, select the nearest centroid cj, add xi to sj
- Update the centroids
- Repeat above step2 & 3 until centroid doesn't change
- Determining K
- WCSS (Within cluster sum of squares)
- Initialization
- Bad initialization leads to suboptimal clustering
- If we initiate centroids close enough - it will take more time/iterations to reach convergence
- Brute force to address sub optimal clustering
- do the initialization multiple times (say 100 times)
- Idea 1: Take points as far away as possible
- Idea 2: Pick the second centroid on the basis of probability such that higher distance means higher probability
- Issue with K Means and K Medians
- Centroids identify each cluster
- Stock Portfolio case study
- Hierarchical Clustering intro
- Agglomerative Clustering
- Start with data points and eventually build a single cluster
- combining data points
- bottom up approach
- Dendogram -> tree that is formed with above approach
- Divisive Clustering
- Starts with all the data as a single cluster and breaks into smaller clusters
- Top down approach
- Proximity Matrix
- Implementation - Seipy & Sakit - learn
- Limitations
- computationally expensive O(n^2)
- Introduction
- Another clustering alogrithm
- K means, Hierarchical where one point belongs to one cluster is Hard Clustering
- Multi dimensional GMM
- Expectation maximization
- initialize mean and stddev values
- for each xi, calculate probability of belonging to c1, c2
- update mu1 = mu1 + sigma pi * mi
- GMM algo
- d features, k clusters
- initialize : initialize mu and sigma for all clusters
- expectation: Calculate the probability of all xi to belong to jth clusters
- end of update: when the gaussian parameters converge
- sklearn implementation
- online vs offline algorithm
- Introduction to DBScan
- DBScan algo
- Density based Spatial Clustering with noise
- Core Points
- Border Points
- Noise Points
- Border points can have core points as neighbors But noise points doesn't have core points as neighbors
- Hyperparameter tuning
- Pros and Cons
- Introduction to anomaly detection
- Introduction
- RANSAC - Random sampling consensus
- Elliptic Envelope
- FastMCD -> Fast min covariance determinant
- Isolation Forest
- inliers will have deeper nodes
- Local Outlier Factor
- distance of k nearest neighbours is less for inliers
- density around inliers is max for inliers
- Local outlier factor
- lof score
- one class svm
- comparing different outlier detection methods
- motivations for high dimentional viz
- principal component analysis
- Principal components are orthogonal to each other
- pca maths
- PCA Math
- PCA Scratch impl
- sklearn pca impl
- Digits dataset
- t-SNE intro
- Mean/Median
- Backfill and forward fill
- Forward fill - Fill value with previous value
- Backfill - Fill value with next value
- Linear Imputation
- avg of previous and next values
- Anamolies
- exceptions
- data entry = correctit
- correct entry but onetime entry -> change it
- consider it as missing value and use linear imputation
- Time series = Trend + Seasonality + residuals
- how to figure out trend
- moving avg with window size 3
- Mobiplus case study
- Missing Values
- Anomalies
- Breaking down a time series
- Additive and Multiplicative Seasonality
- Additive Time Series = Trend + Seasonality + residual
- Multiplicative Time Series = Trend * Seasonality * residual
- Decomposition from Scratch
- Generalizing forecast methodology
- Simple forecasting methods
- Mean/Median
- Naive Forecasting
- Seasonal forecast
- Drift method
- Simple exponential smoothing
- Smoothing methods for forecasting
- Moving avg forecasting
- Simple exponential smoothing(Holts method)
- Double exponential smoothing
- Triple exponential smoothing
- concept of stationarity
- Stationarity
- Auto correlation and partial auto-correlation function
- Auto regression model
- Moving avg model
- ARMA model
- ARIMA model
- SARIMA model
- ARIMA model family
- AR
- MA
- ARMA
- ARIMA
- ARIMA model
- Ranged Estimates - Confidence interval
- Change Point
- Exogeneous Variables
- Forecasting using Lin Reg
- Facebook prophet
- Walmart case study
- Apriori Algorithm
- Market basket analysis
- Identify item sets that occur together very frequently
- Association Rules
- Association Metrics
- Introduction and formulation
- collaborative filtering
- Content based filtering
- Recommendation as Regression/Classification
- Matrix factorization
- Principal component analysis
- Singular value decomposition
Business Case: AdEase review
Git & GitHub: Setup for MLOps
Building Cars24 ML tool using Streamlit
Develop Web APIs using Flask
Containerization - Docker & DockerHub
Deploying APIs on AWS using ECS
GitHub Actions - Setting up CI pipelines
GitHub Actions - Setting up CD pipelines
Business Case:Zee Review
Experiment Tracking & Data Management using MLFlow
ML System Design - 1
ML System Design - 2
Building ML pipelines with AWS Sagemaker
Processing large scale data using Apache Spark
- Use Cases
- Auto Suggest -> (Search engine -> predict suffix )
- AD on search result page
- Compressed representation
- Image segmentation (computer vision)
- Why NN
- Automate feature selection
- Perform really well with high dimensional data
- NN
- Input -> process -> output
- dendrites -> cell state -> axons
- thicker dendrites are more important
- Task : Where to touch an object or not
- Object dimension, Probable temperature, known or unknown object -> Neuron(computation) -> touch or not
- Summarizing Biological Neuron
- Takes Input -> perform some processing -> fires output to other neurons
- inputs are called features
- Every input is associated with weight
- It is said that Neural networks are logistic regression on steroids
- Neuron = Linear + Activation
logistic regression unit = Neuron
N-Layer Neural Network - 1
N-Layer Neural Network - 2
N-Layer BackPropagation
Tensorflow and Keras - 1
Tensorflow and Keras - 2
Optimizers for NNs
Hyper Parameter Tuning for NNs
Autoencoders
Practical aspects of designing MLPs and debugging
Model interpretability: LIME
DSML Module Test: Neural Networks
NN : Model interpretability: LIME Contd.
Neural Networks - No Class Day
Introduction to Computer Vision(CNN)
Revisting CNN: Deal with Overfitting
Cnn under the hood
Introduction to Transfer Learning
Image similarity: Understanding Embeddings
Cnn for medical diagnosis
Object Localisation and Detection -1
Object Localisation and Detection -2
Object Segmentation
Business Case: Porter review
Object Segmentation Contd.
Object Segmentation(contd) and Siamese Net
Generative Models & GANs Introduction
- NLP -> Making machines understand Text
- Areas in which NLP can be used
- Information Retriva(IR)/NLP) -> Language models -> LLMs
- Language Modeling : Predicting next word in sentence
- Spelling correction
- Keyword based information retrival
- Topic modeling
- Text classification
- Information extraction
- Closed domain conversational agent
- Text Summarization
- inshorts
- Creating automated abstracts
- Question Answering
- machine translation
- Open domain Conversational Agent
- Tokenization -> Sentences to Words/Tokens
- Split
- regular expressions
- word tokenize from nlp
- Remove unwanted dataset size
- remove hashtags and hyperlinks before tokenization
- remove stopwords and punctuations
- Normalize
- Stemming vs lammatization
- Embeddings -> Extract meaningful features from tokens
- Vocabulary -> set of unique words
- Text => Sentences -> words
- Case Study
- handle text data
- Logistic regression model
- Confusion matrix
- TF-IDF
- n-gram
- Why do we capture context
- To extract semantic and syntactic info
- co-occurrence matrix
NLP : BERT
Comments
Post a Comment