Machine Learning


  • ML enables computers to learn without explicit programming
  • ML vs Classical Programming
    • Software Engineer
      • Define rules and based on rules decide a mail is spam or not
        • Challenges
          • Rigid
          • Lot of hard coding
    • Machine Learning
      • We want the computer to learn the patterns for us
      • Create algorithm using training/sample data to algorithm
      • Idea is that the algorithm will identify the patterns
      • It will create a mechanism (hypothesis)
      • When a new email is given to algorithm it will identify spam/non-spam
  • Basic ML pipeline
    • Input: Training data(Text, Numerical)
    • Which algorithm to use
    • Trained model as output(also called hypothesis)
  • What is machine Learning
    • A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
      • E: Experience -> Training Data
      • T: Task -> eg: classify email as spam or not
      • P: Performance -> Quantitative measure of how model is performing
  • Task: Predicting Stock Prices
    • Experience: Historical data of prices
    • Performance: differences of actual vs predicted prices
  • Categorize images into cat, dog, rabbit
    • Experience: Labelled images of cat, dog and rabbit
    • Performance: +ve marks for correct prediction and -ve marks for incorrect prediction
  • Trained Data
    • Preprocessing(Numpy and Pandas)
      • Cleaning
      • Removing outliers
      • Transforming
  • Types of tasks
    • Classification
      • Labelled data will be given for training and 
      • Label a new data point
    • Regression(predict real continuous value)
      • Data/Features about a house 
    • Clustering
      • Customer data -> last purchase made, frequency of purchase, products checked out, 

  • Linear regression Intuition - single variable
    • Linear regression tries to fit "Best fit line" through the data
    • Mean Absolute error
    • Mean squared error
    • Draw graph between w1 and loss to get the minima
  • Linear regression Intuition - multi variable
    • Age, km driven & price
  • Model interpretability
  • Gradient descent revision
    • x`=x - n* dy/dx(n is learning rate)
    • keep on repeating until x converges
  • Mathematical implementation
    • y hat = w1x1 + w2x2.... wdxd + w0
    • loss = mean squared error
    • Goal is minimize loss by updating weights and bias using gradient descent

  • Feature engineering
  • outlier treatment
  • scaling
  • Assumptions of Linear Regressions
    • Linearity
    • Multi-Collinearity
      • High correlation among the features
      • Interpretability of the model will be lost
      • calculate VIF score
        find highest vif score
        drop the feature with highest vif score
      • Repeat above until all the features with vif score under 10
    • Normality of residual
    • Hetroskadasticity

    • Auto correlation

  • Types of gradient descent
    • Batch Gradient descent
    • Mini batch GD
    • Stochastic - 
      • only one single data point is used in each iteration
      • There will be lot of fluctuations in loss until we reach minima
    • 1 epoh -> no of iterations needed for the entire dataset to be passed for optimization
  • Polynomial regression

  • underfitting and overfitting
  • Bias variance tradeoff
  • Bias & Variance tradeoff
    • Bias -> error = yi - yhat
    • Variance -> spread  = variance of the model -> less consistency
  • Regularization(L2) -> Ridge regression
    • Complex model with degree  4 - how to reduce complexity
      • Reduce influence of higher degree terms
        • Reduce weights associated to these functions
      • new loss function
        • loss= 1/n(sigma(yi - yhat)^2 + sigma(wj)^2)
        • sigma(wj)^2 is regularization term
  • Regularization(L1)
    • loss= 1/n(sigma(yi - yhat)^2 + lamda * sigma(|wj|)
  • Ridge, loss & Elasticenet regression
    • loss= 1/n(sigma(yi - yhat)^2 + lambda*(c * sigma(wj)^2+(1-c)sigma(|wj|)
  • High bias -> should we increase or decrease complexity of model? -> we should increase- this is underfitting
  • High Variance -> should we increase or decrease complexity of model? -> we should decrease as model learnt from noise. This is overfitting case
  • Hyperparameter tuning
    • Parameters are obtained by ML algorithm
    • Hyperparameters are obtained through experimentation by engineers
      • degree of polynomial features -> test with various degrees and identify the best one
      • regularization constant (lambda) -> 

  • cross validation
  • k-fold cross validation
    • Divide Train data into multiple chunks(4 chunks)
    • use one of the chunk for validation other 3 for train in a cyclic manner so that every chunk will go thru both train and validation cycle

  • intro to logistic regression
    • Case Study: AT & T customer churn/attrition
      • Features -> #customer care calls, account length, discount ::: Target -> churn
      • Binary Classification problem
      • Supervised learning problem
      • yi belongs to {0,1}
      • In Linear Regressions, 
        • target can take any Real number value
        • It fits best fit line thru training data
        • It is learning the best fit line by minimizing loss(MSE)
      • In Logistic regression - it is best separating line
        • hyperplane = w1x1 + w2x2+... wdxd +w0 
        • Goal: The wi and w0 for the best separating decision boundary
      • Thresholding or step function is not differentiable at x=0
      • Sigmoid function
        • 1/(1+e^-z)
        • max value is 1, min value is 0
        • range is (0,1)
  • sigmoid function
ML: Logistic Regression-2 -> Classification algorithm
  • Geometric interpretation of sigmoid

  • Maximum likelihood estimation

  • Optimization
  • Sklearn implementation
  • Accuracy metric
  • Hyperparameter tuning
  • log loss
  • Why not MSE(Mean Squared Error)
    • We want our loss function to be convex function(Single minima)


  • log odds
    • odds = p(win)/p(loss)
    • odds of csk winning is 4:1
    • higher the odds imply higher chance of winning
    • probability of belonging to class1/probability of belonging to class0 -> p/1-p

  • impact of outliers
    • two types
      • outliers on correct side
      • outliers on incorrect side
  • multiclass classification
    • Train 3 separate logistic regression models. Class with more probability will be taken
      • M1: classifies only one target p(Orange)
      • p(apple)
      • p(grape)
  • confusion matrix
  • precision
  • recall
  • f1 score
ML: CM contd. + Imbalanced Data


Business Case: Jamboree Review


  • Classification vs Regression problem
    • Classification problem is one where the Target is not real value OR target is discrete. Eg: categorizing the shops
    • Regression has real or continuous value for target
  • Steps
    • Calculate the Euclidian distance for the new point to all data points
    • Sort the data/distances. Include class field as well
    • take top k classes with min distance
    • Take the majority vote
  • KNN is non-parametric algorithm. 
    • Due to this it is slow
    • No training required ONLY testing required
  • Assumption:
    • All the data point in a neighborhood is similar. ie data is homogeneous
  • KNN is better than logistic regression as KNN handles multi-class problem



  • sklearn implementation
  • bias variance of knn
    • when k is very small 
      •  it will have high variance & low bias
      • it learns noise in training data
      • model is overfit
    • when k is very large 
      • underfit
    • How to identify the correct value of K
      • Hyperparameter tuning
  • Hyperparameter tuning
  • Time complexity of knn
    • O(1)  -> no training
  • distance matrices
    • Time complexity
      • n data points with d featurs
      • o(d) time for a distance b/w 2 points
      • o(nd) for n datapoints
      • sort -> o(nlogn)
      • pick k nearest neighbour -> o(k)
      • Time complexity -> o(nd + n logn)
    • Space complexity
      • o(n(d+1))
  • When dimensionality is very high, use cosine similarity
    • cosine similarity =x1.x2/mod(x1).mod(x2)
  • KNN in real world
    • Direct application in real world is impractical
    • LSH - Locality sensitive Hashing
  • knn Imputation
    • Find K nearest neighbours excluding the row with missing value(fi)
    • find mean value of fi feature
    • impute using computed mean

  • Cons of attrition
    • new employee might ask for more compensation
    • Training
    • Time and resources in recruiting
  • Chances of attrition for an employee
    • identify key factors for attrition
  • Factors that might contributing towards attrition
    • unhealthy work culture
    • overtime
    • age
  • We can create non-linear decision boundaries using decision trees
  • Entropy
    • Measure impurity of a node
    • Measure the level of heterogeneity
    • Entropy of a node for binary classification
      • h(y) =  -[p(1) log2p(1) + p(0) log2p(0)]
                =  -[p log2(p) + -[(1-p) log2(1-p)]
    • We will always go for a split that give max info gain
      • For further splits we will keep on repeating the same idea: Go for the split with maximum Info gain. 
      • Keep on doing this until you get pure nodes
  • Can we have features with more than 2 categories
    • yes: fi -> a, b, c

  • how do we decide on split
    • by maximizing information gain at each split
  • Gini Impurity
    • This does same thing as Entropy but does very quickly/in less time
    • GI = 1 -sigma(p(yi))^2
  • Splitting Numerical features
    • By threshold
      • Sort in ascending order
      • Start from top take all values as threshold
      • find information gain for each threshold
      • find threshold with max info gain and split the feature using the same threshold
  • Underfit vs Overfit
    • shallow tree -> underfit 
    • deep tree -> overfit
    • Goal is to find appropriate depth
    • Decision Stump
      • tree with depth 1
    •  Pruning -> removing unwanted branches from decision tree
    • Hyperparameter Tuning
      • How to decide when to stop decision tree
        • max depth
      • min-sample-leaf
        • min no of samples a leaf can have
      • max-leaf-nodes
    • Encoding categorical features
      • Generally encoding is required 
      • Encoding is recommended if feature has too many unique values
        • pincode: 1 lakh unique values
        • Target encoding -> binning of the feature
  • feature Scaling and Data Imbalance
    • Data Imbalance

  • feature Importance

  • Decision Tree Regression

  • Ensemble models
    • group various models(using multiple models)
    • It takes multiple ML models(Base learners) which are as different as possible
    • Combine the models to generate final predictions
      • Bagging(-> Random forests)  - Boot Strapped Aggregation
        • We will pass a subset of datapoints and subset of features to each model
        • Boot strap -> sampling with replacement
        • Random forest -> combining multiple decision trees 
        • combining models will reduce variance
        • Base learners are Decision Trees with low bias and high variance
        • OOB - Out Of Bag points - can be used to check performance of RF model
      • Boosting
      • Stacking
      • Cascading
    • Bias - Variance Tradeoff
      • k = no of base learners/DT
      • if K increases, variance decreases and vice versa
      • Error = Bias ^2 + Variance + Irreducible error
      • if K increases, Variance and error decreases
    • Row - sampling - rate(m/n)
      • If RSR increases, variance will decrease and vice versa
    • Column - sampling - rate(m/n)
      • IF CSR increases, variance will decrease and vice versa
    • max depth
      • IF max depth increases, variance and bias will decrease and vice versa
  • Ensemble technique 
  • combining various base models
  • Base learners - High bias and low variance decision trees
    • underfit model
    • additive combining to reduce bias
  • Boosting -> sequential modeling
  • high bias means high error
  • Why boosting
    • RF can easily parallelized
    • we cannot parallelize Gradient boosting
    • we can minimize loss using Gradient boosting
    • In boosting we are reducing residuals at each step
    • Indirectly minimizing the loss function at each step
    • Lets define a custom loss function -> L
    • Create a pseudo residual -> dl/dy(i)
    • reduce the pseudo residual in each iteration
  • Reduce False negative by Improving Recall
  • Reduce False Positive by Improving precision

  • GBDT implementation
  • Bias variance 
  • Stochastic Gradient desent
  • Huber loss

  • EMG signal classification case study

  • Ensemble techniques
    • bagging
    • boosting
    • stacking
    • cascading
  • XG Boost -> Extreme Gradient Boosting

  • Light GBM
    • Gradient based one side sampling
    • GOSS is sampling and EFB is dimentionality reduciton

  • Stacking 
    • Mainly used for competition -kaggle

  • Cascading
  • It is a classification algorithm - specifically Text classification
    • Sentiment analysis 
      • Positive or Negative feedback
      • Spam vs Ham
    • Text classification -> Natural Language processing (NLP)
      • Transformers (98.5% to 99%)
      • Naive Bayers (80% - 90%)
    • Naive Bayes can be used for Multiclass classification as well.
  • Data Preprocessing
    • Lowercase
    • Tokenizing the text
      • Welcome to the movie -> "welcome" "to" "the" "movie"
    • Remove all the stop words
    • remove all special chars and puncations
    • lemmatization
      • change, changing, changed -> change
  • Naive Assumption
    • Occurrence of individual unique words in a sentence is independent from other words
  • recap -> preprocessing -> vectorization -> bag of words/count of word in document
    • Bag of words -> 1 if word is present else 0
    • count of words -> count of respective word in that message
  • laplace smoothing

  • Hyperparameter tuning
  • Impact of imbalance
  • underflow problems
  • feature interpretability
  • Impact of outliers
    • Words that have very low occurrence in Training dataset
    • We can set threshold for all words (count>10)


ML: SVM-1 (Support Vector Machine) 
  • Introduction
    • concept of margins
    • kernel functions
  • Hard Margin SVM
    • Goal 1: Maximize the margin
    • Goal 2: There should be no misclassified points
  • Soft Margin SVM

  • Hyperparameter C

  • Mathematics
  • Comparison with log loss
  • Data Imbalance

  • Comparison with log loss

  • Data imbalance
  • Primal dual Equivalence
  • Support Vectors
  • Kernel SVM
  • Polynomial kernel
  • RBF Kernel
  • SVM Code

  • Customer Segmentation case study
  • Machine Learning Tasks
    • Regression, Classification, Clustering, Recommendation & Timeseries
    • when target variable has specific set of outputs {0,1} then it is Classification Problem
    • when target variable can have any real number as output then it is Regression Problem
    • when NO Target variable then it is Unsupervised learning
  • Clustering
    • Group sample/data points on the basis of Similarity/distance
    • Similarity/distance can be calculated using
      • Eucledian
      • Manhattan
      • Minkavski
      • Cosine Similarity
      • Kernel functions
    • Intra and inter cluster distance
      • intra distance calculation methods
        • center to far most point distance
        • distance between 2 far most points
      • inter cluster distances
        • distance between centroids
        • distance between farthest points
        • distance between nearest points
      • Ideal values for Intra and Inter cluster distances
        • low intra cluster distances
        • High inter cluster distances
      • what metric to evaluate clustering
        • should make business sense
        • Technical way is Dunn Index -> low intra and high inter cluster distances
  • Dunn Index
    • min i,j distaince(i,j) / max k distaince(k)  -> Distance between farthest points in cluster i and cluster j/ max distance in any cluster
    • we want to maximize Dunn Index
  • KMeans Intro
    • Simple, Popular and Baseline
    • Each cluster will have its unique centroid
    • We can identify the clusters using centroid
    • centroid is mean/avg of all data points inside the cluster
    • Core Idea
      • K centroids and each data point will be assigned centroid which is closet to it
      • Each centroid represents a cluster
    • Task
      • Find K centroids representing K clusters
    • Optimization problem - NP-Hard
    • Approx algo/heuristic algo
      • lloyd's algo

  • KMeans Mathematical Foundation
  • Lloyds Algo
    • Initialization: Pick K pts randomly
    • Assignment: For each xi in D, select the nearest centroid cj, add xi to sj
    • Update the centroids
    • Repeat above step2 & 3 until centroid doesn't change
  • Determining K
    • WCSS (Within cluster sum of squares)
  • Initialization
    • Bad initialization leads to suboptimal clustering
    • If we initiate centroids close enough - it will take more time/iterations to reach convergence
  • Brute force to address sub optimal clustering
    • do the initialization multiple times (say 100 times)
  • Idea 1: Take points as far away as possible
  • Idea 2: Pick the second centroid on the basis of probability such that higher distance means higher probability
  • Issue with K Means and K Medians
    • Centroids identify each cluster

  • Stock Portfolio case study
  • Hierarchical Clustering intro
  • Agglomerative Clustering
    • Start with data points and eventually build a single cluster
    • combining data points
    • bottom up approach
    • Dendogram -> tree that is formed with above approach
  • Divisive Clustering
    • Starts with all the data as a single cluster and breaks into smaller clusters
    • Top down approach
  • Proximity Matrix
  • Implementation - Seipy & Sakit - learn
  • Limitations
    • computationally expensive O(n^2)
  • Introduction
    • Another clustering alogrithm
    • K means, Hierarchical where one point belongs to one cluster is Hard Clustering

  • Multi dimensional GMM
  • Expectation maximization
    • initialize mean and stddev values
    • for each xi, calculate probability of belonging to c1, c2
    • update mu1 = mu1 + sigma pi * mi 
  • GMM algo
    • d features, k clusters
    • initialize : initialize mu and sigma for all clusters
    • expectation: Calculate the probability of all xi to belong to jth clusters
    • end of update: when the gaussian parameters converge
  • sklearn implementation
  • online vs offline algorithm

  • Introduction to DBScan

  • DBScan algo
    • Density based Spatial Clustering with noise 
      • Core Points
      • Border Points
      • Noise Points
    • Border points can have core points as neighbors But noise points doesn't have core points as neighbors
  • Hyperparameter tuning

  • Pros and Cons
  • Introduction to anomaly detection


  • Introduction
    • RANSAC - Random sampling consensus
  • Elliptic Envelope
    • FastMCD -> Fast min covariance determinant
  • Isolation Forest
    • inliers will have deeper nodes
  • Local Outlier Factor
    • distance of k nearest neighbours is less for inliers
    • density around inliers is max for inliers

  • Local outlier factor
    • lof score 
  • one class svm

  • comparing different outlier detection methods
  • motivations for high dimentional viz
  • principal component analysis
    • Principal components are orthogonal to each other
  • pca maths

  • PCA Math
  • PCA Scratch impl
  • sklearn pca impl
  • Digits dataset
  • t-SNE intro



  •  
    • Mean/Median
    • Backfill and forward fill
      • Forward fill - Fill value with previous value
      • Backfill - Fill value with next value
    • Linear Imputation
      • avg of previous and next values
    • Anamolies
      • exceptions
        • data entry = correctit
        • correct entry but onetime entry -> change it
          • consider it as missing value and use linear imputation
    • Time series = Trend + Seasonality + residuals
    • how to figure out trend
      • moving avg with window size 3

  • Mobiplus case study

  • Missing Values
  • Anomalies
  • Breaking down a time series
  • Additive and Multiplicative Seasonality
    • Additive Time Series = Trend + Seasonality + residual
    • Multiplicative Time Series = Trend * Seasonality * residual
  • Decomposition from Scratch
  • Generalizing forecast methodology
  • Simple forecasting methods
    • Mean/Median
    • Naive Forecasting
    • Seasonal forecast
    • Drift method
  • Simple exponential smoothing
  • Smoothing methods for forecasting
    • Moving avg forecasting

    • Simple exponential smoothing(Holts method)

    • Double exponential smoothing
    • Triple exponential smoothing
  • concept of stationarity
  • Stationarity

  • Auto correlation and partial auto-correlation function
  • Auto regression model
  • Moving avg model
  • ARMA model
  • ARIMA model
  • SARIMA model

  • ARIMA model family
    • AR
    • MA
    • ARMA
    • ARIMA
  • ARIMA model
  • Ranged Estimates - Confidence interval
  • Change Point
  • Exogeneous Variables

  • Forecasting using Lin Reg
  • Facebook prophet
  • Walmart case study
  • Apriori Algorithm
  • Market basket analysis
    • Identify item sets that occur together very frequently

  • Association Rules
  • Association Metrics
  • Introduction and formulation
  • collaborative filtering
  • Content based filtering
  • Recommendation as Regression/Classification


  • Matrix factorization
  • Principal component analysis
  • Singular value decomposition




Business Case: AdEase review

Git & GitHub: Setup for MLOps

Building Cars24 ML tool using Streamlit

Develop Web APIs using Flask

Containerization - Docker & DockerHub

Deploying APIs on AWS using ECS

GitHub Actions - Setting up CI pipelines

GitHub Actions - Setting up CD pipelines

Business Case:Zee Review

Experiment Tracking & Data Management using MLFlow

ML System Design - 1

ML System Design - 2

Building ML pipelines with AWS Sagemaker

Processing large scale data using Apache Spark

  • Use Cases
    • Auto Suggest -> (Search engine -> predict suffix )
    • AD on search result page
    • Compressed representation
    • Image segmentation (computer vision)
  • Why NN
    • Automate feature selection
    • Perform really well with high dimensional data
  • NN
    • Input -> process -> output
    • dendrites -> cell state -> axons
    • thicker dendrites are more important
  • Task : Where to touch an object or not
    • Object dimension, Probable temperature, known or unknown object -> Neuron(computation) -> touch or not
  • Summarizing Biological Neuron
    • Takes Input -> perform some processing -> fires output to other neurons
    • inputs are called features 
    • Every input is associated with weight
  • It is said that Neural networks are logistic regression on steroids
  • Neuron = Linear + Activation
    logistic regression unit = Neuron


N-Layer Neural Network - 1
N-Layer Neural Network - 2
N-Layer BackPropagation
Tensorflow and Keras - 1
Tensorflow and Keras - 2
Optimizers for NNs
Hyper Parameter Tuning for NNs
Autoencoders
Practical aspects of designing MLPs and debugging
Model interpretability: LIME
DSML Module Test: Neural Networks
NN : Model interpretability: LIME Contd.
Neural Networks - No Class Day

Introduction to Computer Vision(CNN)
Revisting CNN: Deal with Overfitting
Cnn under the hood
Introduction to Transfer Learning
Image similarity: Understanding Embeddings
Cnn for medical diagnosis
Object Localisation and Detection -1
Object Localisation and Detection -2
Object Segmentation
Business Case: Porter review
Object Segmentation Contd.
Object Segmentation(contd) and Siamese Net
Generative Models & GANs Introduction

  • NLP -> Making machines understand Text
  • Areas in which NLP can be used
    • Information Retriva(IR)/NLP) -> Language models -> LLMs
    • Language Modeling : Predicting next word in sentence 
      • Spelling correction
      • Keyword based information retrival
      • Topic modeling
      • Text classification
      • Information extraction
      • Closed domain conversational agent
      • Text Summarization
        • inshorts
        • Creating automated abstracts
      • Question Answering
      • machine translation
      • Open domain Conversational Agent
  • Tokenization -> Sentences to Words/Tokens 
    • Split
    • regular expressions
    • word tokenize from nlp
  • Remove unwanted dataset size
    • remove hashtags and hyperlinks before tokenization
    • remove stopwords and punctuations
  • Normalize
    • Stemming vs lammatization

  • Embeddings -> Extract meaningful features from tokens
  • Vocabulary -> set of unique words
  • Text => Sentences -> words

  • Case Study
    • handle text data
    • Logistic regression model

  • Confusion matrix

  • TF-IDF
  • n-gram

  • Why do we capture context
    • To extract semantic and syntactic info
  • co-occurrence matrix









NLP : BERT


Comments

Popular posts from this blog

LangChain

AutoGen