Python

Data is the new oil -> every time you are on net, you generate data #datadrivendecisions

Raw data -> clean -> manipulate and make sense of data -> model (predictions)

Why to learn

Begginer- Data analyst

Tableu & Excel
sql
python)

Intermediate -> Data analysis and Visualization

Python Libraries(Numby, Panda, Matlab)
Probability & stats
Product Analytics(case studies)

Advanced (Data scientist)

Foundations of ML & DL (Adv python, Math for ML & DL)
Machine Learning
Deep Learning
ML Ops
ADV DSA

Introduction to Python & Data Types

Puzzle: 3 boxes, apple, oranges, mix with a incorrect labels. Pick any box and single fruit from it and correct labels on each box.

Clue: Every label on box is incorrect
Pick a fruit from basket labelled A+O

Lets say you pick up O. Correct the lable on the basket to O.
Boxes we have left are lableled A & O but their correct lables can only be A OR A+O
The box labelled A can't be A. So its A+O
So, the box labelled O is A

Python

idle editor
cmd prompt
IDE

visual studio
Jupiter
colab -> https://colab.research.google.com/?utm_source=scs-index

Data Types

v2=input("some text") -> always gives strings

Operators

** -> exponentional
// -> floor division
math.ceil(math.pi*A*A);

Control Statements 1

Control Statements 2

Iterations: While Loops

print("a","b","c",sep="*"
print("a",end=" ")
case=int(input());
t=1;
while t<case:
number = int(input());
int i=1;
while i<=10:
print(i * number, end=" ");
i++;
t++;

Iterations: For Loops

a=[1,2,3,4,5,6,7]
for i in a:
for i in [1,2,3,4,5,6,7]:
for i in range(1,8):
range(1,8)
range(8) -> 0 to 7
range(start, end, jump)

Nested Loops + Jump Statements

pass #code for future use. Does nothing
none
continue
break

counter=0
while True:
a=input("")
counter+=1
if a=='q':
break;
print(counter)

    T = int(input())
    for i in range(T):
        A = int(input())
        B = int(input())
        lcm=max(A,B)
        while(1):
            if ((lcm%A==0) and (lcm%B==0)):
                break;
            lcm+=lcm;
        print(lcm);

Pattern Printing

print("* "*6)
chr(65) means A

Take an integer N as input, print the corresponding pattern for N.

For example if N = 5 then pattern will be like:

____*
___**
__***
_****
*****

Functions1

help(range)
doc strings
def myfunction():
"""
documentation of function
"""

Functions2

Lambda functions(Anonymous functions)

One line function
lambda input: output

v = lambda x : x+10
v(2)
(lambda x:x+90)(100)
lambda x : x+10 if x>8 else x-20

Lists - 1D

Data Structures:

The way to structure the data.
Store is properly
Process it
Retrieve the same

list.append
list.insert(5,55)
list.extend([4,5,6]) //append multiple values
list+[3,4]

    N=int(input())
    N1 = input().split()
    list=[]
    for i in N1:
        list.append(int(i))
    X = int(input())
    Y = int(input())
    list.insert(X-1,Y)
    for i in list:
        print(i,end=" ")    
    return 0

Lists - Slicing

[start:end:step] //
runs[0:3:1]
runs[2,5]
oddmatches[0:len(list):2] => oddmatches[:len(list):2] => oddmatches[::2]

Lists - 2D

a=[1,2,3,4,5]
last=a[-1]
rest=[:len(a)-1]
last+rest
last=a.pop() #removes the last element
a.pop(index)
a.remove(val) #removes element
a.index(val)
a.count(val)
reverse list

a[::-1]
a[-1::-1]
a.reverse() #reverses the list in same variable
reversed(a) #for strings, tuple

List 2D

[[1,2,3],[4,5,6],[7,8,9]]

rows=3
cols=3
for i in range(rows):

for j in range(cols):

print a[i][j]

for i in random:
for j in i:
print j
outerlist=[]
for i in range(3):
innerlist=[]
for j in range(3):
a=int(input())
innerlist.append(a)
outerlist.append(innerlist)
outerlist

Strings1

80% of data is in string format
' " """ are same
""" are used in multi line comments
ASCII -> American standard code for information interchange
"1"*6
F-Strings and String Formatting

l=3, b=2, a=l*b
print("lenght = {}, breadth ={}, area={}".format(l,b,a))
print(f"lenght = {l}, breadth ={b}, area={a}")

ord("a") // to print ASCII value of a char
chr(97)
a="adsf"
for i in a:
print i
Reverse a string

a=input()
a[::-1]
list(reversed(a))

Palindrome or not
if a[::-1].lower() == a.lower() :
String of comma separated values, convert to string of individual values

for i in range(len(strval)):
a=strval[i];
res=chr(a)
str1=str1+res
return(str1)
a=input()
a.split(",")
for i in a.split("-"):
print(chr(int(i))
a="adf"
j=list(reversed(a)) //gives list
"*".join(j)
join -> list to str
split -> str to list

String2

.find() -> find substring location in a string

"it is a dancing doll".find("dancing") -> 8
-1 if not found

.index() -> location of any specific element in a list
.replace("a","b")
.count("a")
.isdigit()
.isalpha() //alphabets are not
.isupper() & .islower()
in operator #returns true if string is present inside other string

"str" in "Thias asdf asdf"

Tuples + Sets

ReadOnly Lists

t=(4,5,6,0)
t=(4,)
t[2]

Packing and Unpacking

when data type is not defined, it will be packed as tuple
a=1,3,54 -> is a tuple

unpacking

a,b,c=(1,4,56)

a= [(2,'asdf'),(3,'asdf'),(4,'asdf'),(5,'asdf')]
for i in a:
print i #prints tuple
for i,j in a
print i #prints id
list(tuple) #convert tuple to list

SETS

Unique data
similar to lists, tuples
{1,2,3}
No order & hence no indexing
{} is dictionary
b=set() #is empty set
{1,2,3}

.add(5)
.remove(2)
.pop() # remove any one element
.update({4,5,6}) #append/update

list("Venket") -> ["V","e","n","k","e","t"]
set("Venket")-> {"V","e","n","k","t"}
Symmetric Difference

A^B => (A-B) + (B-A)

SETS Operations

voda={"Abc","Def","Ehj"}
air={"Abc","Def","Ehj"}
voda.intersection(air)
{"Abc","Def","Ehj"}
voda.union(air)
voda.difference(air)
voda.symmetric_difference(air)

Dictionaries

Sets can only store immutable objects
s={(1,2,3,4)}
Data structure is to organize data so that we can retrieve data and store data
Word: Meaning OR {key:value}

{a: "meaning of a"}

a={"first":"first val","second":"second value","third":"third value"}
Dictionalries

Not ordered
duplicate keys not allowed
not subscritable
Values can be any data structure & can exists duplicates
a[first]="val2"
a.update(b) #add to dictionaries
res=a.get("z","not found")
print(res)
a.pop("key")
Iterating

for i in a:
print(i)
print(a[i])
a.keys()
a.values()
a.items() #gives tuples
for k,v in a.items():
print(k,v)

How to check if key exists

"ads" in a.keys()

Allowed datatypes as a key in disctionary

any data type that is immutable

set
strings
boolean
int

Take a String as input. Create a dictionary using following criteria

There will be one key for each unique key
key will be charter name and value will be count of occurence

Get all unique chars
count that using count method of string

str1="adfasdfasdf"
d={}
for i in set(str1):
d[i]=str1.count(i);

Comprehension and memory

It is like a Utility in Python
List Comprehension

[<output> for loop ]
[i for i in range(1,100) ]
[i for i in range(1,100) if i%2==0]
["Even" if i%2==0 else "Odd" for i in range(1,100)]
students=["one","two","three"]
marks=[2,3,4]
{students[i]:marks[i] for i in range(len(students))}

Memory

Stacks -> store variable and refer to address of actual value
Heaps -> store actual values

a=5
print(id(a))
Garbage Collection
Mutable vs Immutable
shallow copy

a=b #same address

Deep Copy

b=a[:]

Numpy-1

DAV -> Data Analysis and Visualization

NumPy -> Numerical Python
Pandas ->
Matplotlib/seaborn for visualization

NPS -> Net Promoter Score

1-6 is sad (Detractors)
7-8 is neutral (Passive)
9-10 happy (Promoter)
% of Promoters - % of Detractors

If NPS is > 70% -> Great performance

Range of NPS-> -100 to 100

Huge volume of data
adv features
pypi.org

!pip install numpy
import numpy as np

Why Numpy

Numpy is like List built using C code
List data is not contiguous where as Numpy data is contigous. Hence, Numpy is faster.
It takes less space as the memory is contigous
List can contain hetrogenous data and numby is homogeneous data
It makes mathematical operations in one line instead of multi line

a=[1,2,3]
np.array(a)
[i**2 for i in range(1,100) ]
a=[1,2,3]
arr=np.array(a)
arr**2
%timeit [i**2 for i in range(1,100) ]
%timeit arr**2
arr.ndim
a=array([[1,2,3], [1,2,3]])
np.arange(10)
np.arange(1, 10, 0.5)
np considers as single data type
np.array([1,2,3],dtype='float')
a.dtype
np.array(a)[3:5]
m1=np.array([1,2,3,4,5,6,7,8,9])
new=m1>6 #returns true/false for each element in an array
m1[new] #returns values satisfied
m1[[2,5]] #multiple index in a list
m1[true,false,true,false] #returns values when true and leaves false
m1 = np.arange(1,20)
filter = m1%2 ==0 #true and false
m1[filter]
score = np.loadtxt('survey.txt',dtype='int')
score[:5] #first 5 elements
score.shape #count of elements
len(score) #count of elements
score.min()
score.max()

detractors = score[score <= 6]
len(detractors)
detractors.shape[0] #first element
promoters = score[score >= 6]
len(promoters)
total=len(score)
perc_dect=(len(detractors)/total)*100
perc_promo=(len(promoters)/total)*100
perc_promo - perc_dect

Numpy-2

Case study on fitbit

date, step_count, mood, calories_burned, hours_of_sleep, active
data = np.loadtxt('survey.txt',dtype='str')
data.shape #rows & cols
data[:5] #first 5 rows
a=np.array(range(16))
a.shape #elements
type(a.shape) #tuple

Reshape

a=np.array(range(16))
a.reshape(4,4) #rows, cols

a=np.array(range(10,91,10))
a.shape
a.reshape(2,-1) #arrange with 2 rows
a[1:3,2:4] #no errors even if index exceeds size
a[: , 1] #gives 1D
a[: , 1:2] #gives in 2 D
a1=np.arange(10,91,10)
a1 #array([10, 20, 30, 40, 50, 60, 70, 80, 90])
a1[[2,3]] #gives array([30,40])
2D

a=a1.reshape(-1,3)
a[[0],[0]]
a[[0,1,2],[0,1,2]] #diagnoal elements

Transpose

data_t=data.T
date,step_count,mood, calories_burned, hours_of_sleep,activity_status=data.T
step_count.astype('int')
calories_burned = np.array(calories_burned, dtype='int')
np.unique(mood)

Filter

m1=np.arange(12)
m1>6
m1=np.arange(12).rearrange(3,4)
m1>6
m1[m1>6]
filter = mood=='Happy'
step_count[filter] #steps on happy day

Aggregate functions

a=np.arange(1,5)
np.sum(a)
np.mean(a)
np.min(a)
np.count_nonzero(a)
a=np.arange(12).reshape(3,4)
np.sum(a)
axis 0 -> column
axis 1 -> rows
np.max(a, axis=0)
a=np.array([1,2,3,4])
a=np.array([2,5,4,1])
a>b
np.any(a>b) #checks if there is any true
np.all(a>b)
arr = np.array([1,2,3,4,-4,8,5,-2])
np.where(arr<0,"wrong value","correct value")
step_cnt_happy_or_neutral = step_count[(mood=='Neutral' | (mood=='Happy')]
len(step_cnt_happy_or_neutral)
np.mean(step_cnt_happy_or_neutral.astype('int'))
step_count.astype('int')>4000
mood[step_count.astype('int')>4000)
np.unique(mood[step_count.astype('int')>4000], return_counts=True)

Numpy-3

import numpy as np
data=np.loadtxt("../fit.txt", dtype='str')
date,step_count,mood, calories_burned, hours_of_sleep,activity_status=data.T
max steps from dataset

step_count = np.array(step_count, dtype = 'int')
np.max(step_count)

Index of max count & get date for the max count record

step_count.argmax() #gives the index of the record with max steps
date[step_count.argmax()]

np.ones

np.ones((5,2)) #matrix with 1s
np.ones(4, dtype='int')

Multiplication

a=np.array([1,2,3,4])
b=np.array([1,2,3,4])
a+b #performs operation element by element
a*b #performs operation element by element

Matrix Multiplication

np.dot(a,b)
np.matmul(a,b) #cannot work with matrix and scaler value.
a@c
np.dot(a,3)

Argmax

np.argmin([2,6,7,348,3,,1,4]) -> returns index of min value
np.argmin(b, axis=0) -> min in columns

def vdesu(x):
if x%2==0;
x+=2;
else:
x+=3;
return x;

a=np.arange(1,13)
v1=np.vectorize(vdesu)
v1(a)
f1=np.vectorize(math.log)
f1(np.array([1,2]))
https://colab.research.google.com/drive/1leiKIvtZg-Lc7EMBceyCZ4NgnnbhVCrB?usp=sharing
Array Broadcast

a=np.arange(0,40,10)
np.tile(a,3,1)
a=arange(0,4)

Image App

import matplotlib.pyplot as plt
kaju=plt.imread("dog.jpeg")
plt.imshow(kaju)
kaju.show
kaju[1,1,1] #row, column, channel/color
plt.imshow(kaju[::-1,:,:]) #reverse picture
plt.imshow(kaju[:,::-1,:]) #mirror image
zoom the face

plt.imshow(kaju[20:250,20:450,:]) # crop
plt.imshow(kaju[::20,::20,:]) #blured image by jumping pixels

Contrast

np.where(kaju>150,255,0)

plt.imshow(np.array([[[0,255,0,]]])) //R G B for each pixel with 0 as darkest and 255 is lightest
inverse color
plt.imshow(kaju[:,:,::=1])

Numpy-4

1D array is called Vector
2D array is called Matrix
more than 2D are called as Tensors
B = np.arange(24).reshape(2,3,4) //2 matrices of 3*4 rows and cols
import matplotlib.pyplot as plt
img=np.array(plt.imread('fruits.png'))
plt.imshow(img)
Shallow Copy

Only copies header with change in shape.
b=a
c=a.view()

Deep Copy

c=a.copy #doesn't work with type is Object

Math functions by default creates deep copy
np.shares_memory(a,c)
Deep copy for Object data type

import copy
copy = copy.deepcopy(arr)

https://colab.research.google.com/drive/11PCLFO4MR_nKeM4QqFbIslq7sqJedmsX?usp=sharing
Splitting

x=np.arange(9)
np.split(x,3)
np.split(x,[4,6]) #split on index
np.split(x,[2,5,8]) #split on index
np.hsplit(x,2) #horizontal split - splits on column to get multiple arrays on 2D matrices
np.vsplit(x,2) #horizontal split

VStack

a=np.arange(10)
b=np.arange(11,21)
c=np.vstack([a,b])
d=np.vstack([a,c])
a=np.arange(10)
b=np.arange(11,21)
c=np.hstack([a,b])

z=np.array([[2,4]])
zz=np.concatenate([z,z], axis=0) #vstack
flat=np.concatenate([z,z], axis=None)
arr = np.arange(6)
a = np.expand_dims(arr, axis=0) #used to increase the dim. axis denotes where it should have 1 dim
a.shape #returns (1,6)
a = np.expand_dims(arr, axis=1)
a.shape #return (6,1)
arr=np.arange(6)
arr[np.newaxis,:] #new axis at first index
arr[:,np.newaxis] #new axis on column
arr=np.arange(9).reshape(1,1,9)
1*1*9
np.squeeze(arr) #removes all 1dimentions and returns original arr
arr=np.arange(9).reshape(2,1,5)
np.squeeze(arr) #returns 2*5 matrix

Pandas -1

Works on top of NumPy
Can load any datatype
Same as excel or csv
can write sqls on it

pip install pandas
import numpy as np
import pandas as pd
df=pd.read_csv("mckinsay.csv")
type(df) -> pandas.core.frame.DataFrame
when csv has one column it is called Series & multiple columns is DataFrame
df.info() #complete info about the DataFrame

String is Object in Pandas
df.head(5) #first 5 rows
df.tail(10)
df.shape #gives (rows, cols)
df.columns #to get column names
df.keys #to get column names
df['country']
df[['country','name']]
df['country'].unique()
df['country'].nunique()
df['country'].value_counts() #similar to groupby count
df.rename({"country":"Country","population":"Population"}, axis=1) #columns
df.drop('country',axis=1)
df['new year'] = df['year']+2

Create DataFrame from Scratch

pd.DataFrame(ll, columns=['first','second'..]) #pass the lists of lists matrix
pd.DataFrame({'country',['Afg','Afg','Afg'],
'year',[1900, 1950,1980]
})

https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing
Working with Rows

df[index] = [i for i in range(1, df.shape[0]+1)] #change the default/implicit index to explicit

Loc/ILoc

ILoc - index/implicit based location
df.loc[2] #row with explict index of 2
df.iloc[2] #row with index3 will be returned
df.iloc[1:3,:]

Pandas -2

import numpy as np
import pandas as pd
df.read('mckinsey.csv')
Set column as index

df.set_index("country")
temp.iloc[0:5,1:3] #
temp.reset_index(inplace=True)

Add New row

new_row={'col':val,'col2':val2....} #using dictionary
df.append(new_row,ig)
df.loc[len(df)] = ['value1','valu2'..] #

Drop rows

df.drop(1) #explict drop
df.duplicated()
df.drop_duplicates()
df.drop_duplicates(keep='first/last/False')
df.drop_duplicates(subset='<column name>')

Mathematical

df['life_emp'].mean()/.sum()

Sort

df.sort_values("life_exp",ascending=False)
df.sort_values(['year','life_exp'],ascending=[False,True])

Join

users = pd.DataFrame({"userid":[1,2,3], "name":["a","b","c"]})
msgs= pd.DataFrame({"userid":[1,1,2,3], "msg":["hmm","acha","ok","asdf"]})
pd.concat([users,msgs],axis=1)
users.merge(msgs,on='userrid') #inner join
users.merge(msgs,on='userrid',how='outer') #outer join
users.merge(msgs, left_on='id', right_on='userid')

IMDB

movies = pd.read_csv('movies.csv',index_col=0) #remove unamed col
movies.shape
directors = pd.read_csv('directors.csv',index_col=0)
directors.shape
movies['director_id'].value_counts() #directors making movies
movies['director_id'].nunique()
movies['director_id'].isin(directors['id'])
np.all(movies['director_id'].isin(directors['id'])
data=movies.merge(directors,how='left', left_on='director_id', right_on='id')
data.drop(['director_id','id_y'],axis=1,inplace=True)
data.info()
data.describe() # gives count, mean, std, min, 25%, 50%, 75%. max for each int columns
data.describe(include=object) #gives values applicable to Object type columns
data['revenue'] = data['revenue']/1000000.round(
data[data['vote_average']>7]
data.loc[data['vote_average']>7,['title','director_name']]

Pandas -3

def encode(text):
if text='Male':
return 0
else:
return 1
df.iloc[0]['gender']
encode(df.iloc[0]['gender'])
data['gender'].apply(encode)
df['gender_mapping'] = data['gender'].apply(encode)
How to find sum of revenue and budget per movie

data[['revenue','budget']].apply(np.sum, axis=1)

How can I find profit per move(revenue - budget)?

def prof(x):
return x['revenue'] - x['budget']
data['profit'] = data[['revenue','budget']].apply(prof,axis=1)
data

Group By

data.loc[data['director_name']=='Raja Mouli','title'].count()

data['director_name'].nunique()
data['director_name'].value_counts()
data.groupby('director_name').ngroups
data.groupby('director_name').groups
data.groupby('director_name').get_group('Venkat')
data.groupby('director_name').get_group('Venkat')['title']
data.groupby('director_name').get_group('Venkat')['title'].count()
How can we find multiple aggregations of any feature

data.groupby('director_name')['year'].aggregate(['min','max'])

Highest budget movie for every director

data.groupby('director_name')['budget'].max()

Filter out director names with max budget >100Million

data_dir_budget = data.groupby('director_name')['budget'].max().reset_index() #to get normal data frame
names = data_dir_budget.loc[data_dir_budget["budget"] >= 10000000, "director_name"]
data['director_name'].isin(names)
data.loc[data['director_name'].isin(names)]

Lambda Function

x = lambda a : a+10
x = lambda a,b : a+b
x(2,6)

which director is getting max vote_average

data.groupby('director_name')['vote_average'].max().sort_values(ascending=False)
data.groupby('director_name').filter(lambda x:x['vote_average'].max()>=8.3)

Filter Risky Movies

def func(x):
x['risky'] = x['budget'] - x['revenue'].mean() >=0
return x

data_risky = data.groupby('director_name').apply(func)
data_risky.loc[data_risky['risky']]

lambda a,b: <True> if a>b else <False>
lambda a,b: "Dancing" if a>b else "Cooking"
Filter only the ages that are greater than 18

ages = [13, 12,17, 19, 56,7]
ages[lambda a: True if a>18 else False]
filter(lambda a: a>18,ages) #filter helps pass list to lambda
list(filter(lambda a: a>18,ages))

Square of every number in list

a = [13, 12,17, 19, 56,7]
list(map(lambda a: a**2,a))

Pandas -4

data.groupby('director_name')['title'].count().sort_values(ascending=False)
data_agg = data.groupby('director_name')[['year','title']].aggregate({"year":['min','max'],'title':'count'})
data_agg.columns
[i for i in data_agg.columns]
["_".join(i) for i in data_agg.columns]
data_agg.columns = ["_".join(i) for i in data_agg.columns]
data_agg.reset_index()
data_agg['years_active'] = data_agg['year_max'] - data_agg['year_min']
data_agg['movies_per_year'] = data_agg['title_count']/data_agg['yrs_active']
data_agg

Pfizer

data = pd.read_csv('pfizer_1.csv')
MELT #convert few columns to rows

pd.melt(data,id_vars=['Data','Drug_Name','Parameter'])
pd.melt(data,id_vars=['Data','Drug_Name','Parameter'], var_name='time', value_name='reading')

PIVOT #to reshape the data

import numpy as np
df = pd.read_csv('weather.csv')
df.pivot(index = 'city', columns='date')
df.pivot(index = 'city', columns='date', values='humidity') #to get humidity alone
df.pivot(index = 'date', columns='city')

Pivot Table

df.pivot_table(index = 'city', columns='city', aggregate='mean')
pd.pivot_table(data_tidy, index='Drug_Name', columns='Data',values=['Temperature'],aggfunc=np.mean]

Handling Missing Values

type(None) => NoneType #used for non-number entries. It is object data type
type(np.nan) => float #used for numbers
pd.series([1,2,np.nan, None]) => None will be converted to nan
data.isna() # to check null values
data.isnull() # to check null values
data.isna().sum() #sum of null values in the data set
data.dropna() #
data['2.30'].fillna(0)
data['2.30'].fillna(data['2.30']).mean())

Pandas -5

data_melt = pd.melt(data, id_vars = ['Date','Drug_name','Parameter'], var_name = 'time', value_name = 'reading')
data_tidy = data_melt.pivot(index=['Date','time','Drug_name'], columns = 'Parameter', values='reading')
data_tidy = data_tidy.reset_index()
def temp_mean(x):
x['Temperature_avg'] = x['Temperature'].mean()
return x
data_tidy.groupby(['Drug_Name']).apply(temp_mean)
data_tidy.groupby(['Drug_Name'])['Temperature'].mean()
data_tidy.isnull().sum()
Display the rows where temp is missing

data_tidy['temperature'].isnull()
data_tidy[data_tidy['Temperature'].isnull()]
data_tidy['Temperature'].fillna(data.tidy['Temperature.avg'])

Pandas Cut

tem_points = [5,20,35,50,60]
temp_labels = ['low','medium','high','very high']
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'],bins=tem_points, labels=temp_labels)

String Function and motivation for datetime

data_tidy['Drug_Name].str.replace("hydrochloride","asdf")
data_tidy[data_tidy['Drug_Name].str.contains("hydrochloride")]
pd.to_datetime[data_tidy['timestamp'])
data_tidy['timestamp'][0].year
data_tidy['timestamp'].dt.year
data_tidy['timestamp'][0].strftime['%y']

DAV1 - Mini Business Case - Study

Data Visualisation-1

Exploratory -> Python is good

Understanding data/what are characteristics of data.

Explanatory -> Tableau is good

Story telling for others

Why data visualization in Python

Quick Analysis
unstructured data(tableau, excel, PowerBI requires structured data)
Easy and wide manipulation options

Science behind data visualization

Anatomy of chart
how to use the right plot

Art in data visualization

Color, scale, labels
highlighting something

Libraries

matplotlib
seaborn(wrapper on matplotlib to make simpler and beautiful)

!pip install matplotlib
!pip install seaborn
import seaborn as sns
import matplotlib.pyplot as plt
Terminologies

Columns are called as Features/Variables
Rows are records/data points/samples

Data Types

Numerical
Categorical

Ordinal -> has order like low, medium, high which have inherent order
Non-ordinal -> Male, female where both are same and no order

Choose right plot

How many variables/features are involved
Variables type -> Numerical/Categorical

Types of variables

1 variable -> univariate
2 variable -> bi variate
3 or more -> multi variate

Univariate

Numerical
Categorical

Bivariate

Numerical - Numerical
Categorical - Categorical
Numerical - Categorical

Multivariage

Num - Num - Num
C-C-C
N-N-C
N-C-C

Anatomy of matplotlib

Figure -> entire visualization

suptitle -> Title of entire visualization
Axes -> Charts1..n

Title
Major tick, Minor tick
Axis
plot
xlabel, ylabel
legend

x_val=[0,1,2]
y_val=[3,5,9]
plt.plot(x_val,y_val)
data=pd.read_csv('final_vg.csv')
data.head()
Univariate

Categorical

Count -> bar #Ideal categories should be around 5
cat_counts = data['genre'].value_counts()
x_bar = cat_counts.index
y_bar = cat_counts
plt.figure(figsize=(12,8))
plt.bar(x_bar[:5], y_bar[:5], color="red",width=0.2)
plt.xticks(rotation=90)
plt.xlabel("genre")
plt.show() #ensures only final chart is shown
#seaborn
sns.countplot(x="Genre",data=data, order=data['Genre'].value_counts().index, color='blue')
plt.xticks(rotation=90)
plt.show()
%age -> pie #

plt.pie(y_bar, labels=x_bar,startangle=90,explode=(0.2,0,0,0,0,0,0,0,0,0,0))
plt.show()

Numerical

how is data distributed
outliers
is it skewed
special numbers -> min, max, range..
histogram -> divide data into bins and depict the frequency
Histogram - Popularity of video games over the years. which year has max popularity

plt.hist(data['year'])
plt.show()
count, bins, _ = plt.hist(data['year'])
count
bins

KDE === Kernel Density Estimate Plot

sns.kdeplot(data['Year'])

Box Plot

outlier
lower whisker #Q3 - 1.5*IQR - min score
Q1/lower quartile #25th percentlie
Q2 #50th percentlie
Q3/upper quartile #75th percentlie
upper whisker #Q3 + 1.5*IQR - max score
outlier
Inter quartile range - Q3-Q1
plt.figure(figsize=12,5))
sns.boxplot(y=data["Global_Sales"])

Data Visualisation-2

Revision

Univariate

Categorical - Bar, Pie
Numerical - Hist, KDE, Box Plot

Bivariate Analysis

Numerical-Numerical(continuous - continuous)

sales, year

how does sales change over years
how are features associated(correlation)
Line Plot

ih=data.loc[data['Name']=='Ice Mockey']
sns.lineplot(x='Year', y='Global_sales', data=ih)
plt.grid()

rank, sales

Line chart fails if you have too many points at same x-axis
Scatter plot helps in understanding grouping & co-relation

Categorical-Categorial

publisher, platform

Preferred platform for publisher
distribution of publisher for top 3 platforms
Distribution of one wrt other category

Stacked bar # platform on x-axis & publisher on y-axis
Dodged bar chart #platform on x-axis & publisher for each platform as bar

top3_pub=data['Publisher'].value_counts().index[:3]
top3_gen=data['Genre'].value_counts().index[:3]
top3_plat=data['Platform'].value_counts().index[:3]
top3_data=data.loc[((data['Publisher'].isin(top3_pub) & data['Genre'].isin(top3_gen)) & data['Platform'].isin(top3_pub))]
Compare the top3 platforms these publishers use

plt.figure(figsize=(12,8))
sns.countplot(x='Publisher', hue='Platform', data=top3_data, dodged=False)

stacked bar vs dodged

If total is of more importance - Use Stacked Bar chart
If comparison is of more importance - use dodge bar chart
For 2 categorical variables - best representation is dodge chart

Categorical - Numerical

What qns can be asked

What is avg/sum for every publisher
Sales distribution for top3 publisher
Multi box Plots

sns.boxplot(x='Publisher', y='Global_Sales', data=top3_data

Bar chart

sns.barplot(x='Publisher', y='Global_Sales', data=top3_data, estimator=np.mean)

Data Visualisation-3

Revision
subplots

plt.figure(figsize(12,8))
plt.subplot(2,3,1)
sns.barplot(x='Publisher', y='Global_Sales', data=top3_data, estimator=np.mean)
plt.subplot(2,3,3)
sns.barplot(x='Publisher', y='Global_Sales', data=top3_data, estimator=np.mean)
fig, ax split.subplots(2,2,figsize=(12,8))
ax[0,0].scatter((top3_data['NA_sales'],top3_data['EU_Sales']))

muti variate

C-C-C

Not a practical case and hence not covering

N-N-N

scatter -> size
Line -> can't be used

C-C-N

C-C -> Stacked/Dodged bar -> add hue/color
C-N

Box ->multi-box with color
multi-bar

C-N-N

eg: compare NA sale with EU sale for each Genre
scatter plot with hue to show color for Genre
line plot with Hue

Advance charts

Joint plot

Scatter + Histogram -> sns.jointplot(x='NA_Salses', y='EU_Sales', data=top3_data)
Scatter + Density -> sns.jointplot(x='NA_Salses', y='EU_Sales', data=top3_data, hue='Genre')

pair plot

K numerical variables.
Every single paid of them to be compared

N1 with N2
N2 with N3
N1 with N3
sns.pairplot(data=top3_data)
Used for correlation analysis

heatmap

top3_data.corr()
sns.heatmap(top3_data.corr())
sns.heatmap(top3_data.corr(), cmap="coolwarm")
sns.heatmap(top3_data.corr(), cmap="coolwarm",annot=True)

Probability - Basic Definitions

Where do we use probability in real world

online shopping -> landing page
Text recommendation using the dictionary created for each user
chatgpt -> LLM models is probability behind it
online/ott platforms

Probability

favorable outcomes/total outcomes

Sample Space

set of all possible outcomes

Event

subset of sample space
Any collection of outcomes

p(event)

size of event/size of sample space

Set Operations

Intersection
Union
Complemen

opposite

80% like cappacuno, 40% espresso, 30% both
How many like cappachuno and not espresson

Mutually Exhaustive - If given events cover all possible outcomes

Business Case: Netflix Intro

Gather data (http://www.kaddle.com/shivamb/netflix-shows) -> netflix_titles.csv
Data Preprocessing

Cleaning data
missing values
Transformation
Scaling
encoding

Exploratory data Analysis(EDA)

Explore
Visualize

Modeling
Predict
Evaluate & Monitor

df.describe(include='object').T

df.isnull().sum()/len(df)*100 #missing data by %age

df['type'].value_counts(normalize=True)

df['type'].value_counts().plot(kind='pie',autopct="%.2f")

missing values

Numerical -> replace with medium, mean
Categorical -> replace with Mode(for country, Genre, showtype), unknown, other, na

duration col

Separate TV show and movies
perform analysis on both the data sets separately

unnest

stack()
explode()

date col

year, month, week, week of day

duplicity

inconsistencies

Conditional Probability

Conditional probability

Calculating probability after a specific condition is met eg: probability of students attending class on Friday
auto complete/recommendation system -> How are you [doing, things, liking]
p[Xb='You' | Xa="how are') #after pipe is event that already occurred, first one is the event for which probability is being calculated
It is known that - 60% people use Swiggy, 50% use Zomato, 20% use both. Among those who use Swiggy, what fraction also use Zomato

p[swiggy]=0.6
p[zomato]=0.5
p[swiggy & Zomato]=0.2
p[zomato | swiggy] = 20/60 #p(z&s)/p(s)

It is known that - 30% of emails are spam, and 70% are not spam. The word "Purchase" occurs in 80% of spam mails. It is also occurs in 10% of non-spam emails. Overall, in what percentage of emails would we see the work "purchase"?

p(spam)=0.3
p(Not spam)=0.7
p(purchase|spam)=0.8
p(purchase|non-spam)=0.1

Tree method

24 spam emails have "purchase"
7 non-spam emails have "purchase"
31 mails have "purchase" and total 100 => hence 31%

A = spam & purchase
B = not spam & Purchase
p(purchase|spam) = P(p & S)/p(S)
p(purchase|notspam) = P(p & NS)/p(NS)

It is known that 5% of all LinkedIn users are premium users. 10% of premium users are actively seeking new job opportunities. Only 2% of non-premium users are actively seeking new job opportunities. Overall, what percentage of people are actively seeking new job opportunities.

p[js|prem] = p[js & prem]/p[prem]
P(Premium users)=0.05
P(Non Premium users)=0.95
p(seeking job|Premium users)=0.1
p(seeking job|non-Premium users)=0.02

Tree method

Premium users & seeking job -> 0.5
Non-Premium users & seeking job -> 1.9

p[JS]=p[js & prem] + p[js & non-prem]
p[JS]=p[js|prem] * p[prem] + p[js|non prem] * p[non-prem] #total law of probability

Q: An e-commerce website shows two types of ads: Type A and Type B. 60% of the visitors see Type A ads, and 40% visitors see Type B ads. The click-through rate for Type A ads is 5% and the click-through rate for Type B ads is 3%. What is the overall click through rate?

Ans: (3+1.2) = 4.2
p[click]=p[click&A] + p[click&B] = (3+1.2) = 4.2
p[click]=p[A] * p[click|A] + p[B] * p[click|B]
=0.6 * 0.05 + 0.4 * 0.03 = 0.042

Conditional Probability Formula

p[A|B] = p[A & B]/p[B]

Total Probability

p[C] = p[C|A] * p[A] + p[C|A'] * p[A'] #Total Probability

Multiplication Rule

p[A & B] = p[A|B] * p[B] #called multiplication rule

Q: In an NPS survey, it is seen that 70% are promoters, 20% are neutral, 10% are detractors.
90% of promoters, 40% of neutral, and 5% of detractors recommend the product to a friend. What is the overall percentage of people who recommend the product.

71.5

Bayes' Theorem - 1

A disease affects 10% of the population. Among those who have the disease, 80% get "Positive" test results. Among those who don't have the disease, 5% get "Positive" test result. Overall, what percentage of people tested "Positive"?

p[D] = 10% = 0.1
p[+ve] = p[+ve|D] + p[+ve|ND]
p[+ve|D] = 80%
p[+ve|ND] = 5% = 0.05
p[+ve] = p[D] * p[+ve|D] + p[ND] * p[+ve|ND] = 0.1 * 0.8 + .90*.05 =0.08+4.5
12.5
what is P(+ve & decease) = p[+ve|D] * p[D] = 0.8*.1 = .08
what is P(+ve & no decease) = p[A|B] * p[B] = 0.9*0.05 =0.45

Given a test is +ve, what are the chances that I am actually infected?

p[D|+ve]?
Total +ve= 125 of 1000
+ve & D = 80
p[D | +ve] = 80/125 = 0.64
p[D | +ve] = p[D & +ve] / p[+ve]
p[+ve | D] = p[D & +ve] / p[D]
p[D | +ve] = p[+ve | D] * p[D] / p[+ve]
p[+ve] = p[D] * p[+ve|D] + p[ND] * p[+ve|ND] = 0.1 * 0.8 + .90*.05 =0.08+4.5 =
p[D | +ve] = p[+ve | D] * p[D] / (p[D] * p[+ve|D] + p[ND] * p[+ve|ND])

For a new cohort in DSML, we have the following information:
30% of the people know SQL
80% of the people know SQL and also Excel
40% of the people who do not know SQL, also know Excel

p[sql] = 0.3
p[no sql]=0.7
p[excel | sql] = 0.8
p[excel | no sql] = 0.4
p[excel] = .3 * .8 + .7 * .4 = .52
Among those who know Excel, what percentage know sql
p[sql|excel] = p[excel | sql] * p[sql] / (p[excel | sql] * p[sql] + p[excel | no sql] * p[no sql]) = 0.8 *0.3 / (0.8 *0.3 + 0.7 * 0.4) = .24 / .52 = 46.15

p[sql & excel] = 2.4
p[Nsql & excel] = 2.8

In a city, 7% of people are on Twitter.
5% on Linkedin
4% on both

A random person is choosen, what is the probability that he is on twitter?

A random person on Linked is choosen, what is the probability that he is on twitter?

p[T | L] = p[T & L] / p[L] = 0.04/0.05 =0.8

If providing a information about event B changes the probability of event A then we call them dependent events

if p[T|L] <> p[T] then they are dependent

A website has noticed the following stats. Among those who saw the ad
70% saw on youtube
50% saw on Amazon
35% saw on both

A random person is chosen, what is the probability that he saw it on Youtube

p(y) = 0.7

A random person who saw the ad on Amazon is chosen. What is the probability that he also saw the ad on Youtube

p(y|A) = p(y)/p(A) = 0.35/0.5 = 0.7
p(y|A) = p(y) then the are independent events

Independent Events

p[y/a] = p[y]
p(y & A)/p(A) = p[y]
p(y & A) = p[y] * p(A)

Interview Questions

A and B are two independent events, where it is known that $P(A u B) = 0.5$ and $P(A) = 0.3$. What is $P(B)

$P(A u B) = p(A) + p(B) - p(A n B)
Given they are independent $P(A u B) = p(A) + p(B) - p(A) * p( B)
0.5 = 0.3 + p(B) - 0.3 * p(B)
0.2 = 0.7 p(B)
p(B) = 2/7

Amit can solve a math problem with probability of 0.7, and Bharat can solve it with a probability of 0.5. Both of them attempt this problem independently.

What is the probability that both of them will solve it.

p(A n B) = p(A) * p(B) = 0.7 * 0.5 = 0.35 # they are independent events

What is the probability that neither of them solve it

p(A u B)' = 1 - P(A u B) = 1 -( p(A) + p(B) - p(A n B))
= 1 - (0.7 + 0.5 - 0.35) = 0.15

Mutually exclusive events are dependent
disjoint event are always dependent
50% of people who gave the first round of an interview were called back for 2nd round.
95% of the people who got involved for second round, felt that they had a good first round.
75% of the people who did not get invited for 2nd round also felt that they had a good first round.

Given that a person felt good about the first round, what is the probability that he cleared the first round

p[cleared | felt good] = (50*.95)/.(50*.95 + .50*.75) = 0.5 /(0.475 + 0.375) =.56

Bayes' Theorem - 2

A city has 2 taxi companies A & B. A has 60 % of taxies and B has 40% of taxies. A taxis are involved in 3% of accidents and B taxis are involved in 6% of accidents. If a taxi is involved in accident what is the probability that it is B taxi

A=60%, B=40%
p(acc | A)=0.03 & p(acc | B)=0.06
p(B | acc)=?
p(B | acc) =p(acc | B) * p(B) / ( p(acc | A) * p(A) + p(acc | B) * p(B))
=(0.06 * 0.4)/((0.06 * 0.4) * (0.03 * 0.6)) = ~0.57

It is known that 30% of emails are spam and 70% are not spam. The word "purchase" occurs in 80% of spam emails. It also occurs in 10% of non-spam emails. A new mail does not have the word "purchase" what is the probability that it is spam?

p(spam | not purchase) = p(not purchase| spam ) * p(spam ) / ( p(not purchase| spam ) * p(spam ) + p(Not purchase| Not spam ) * p(Not spam )) =~0.086

5% of all LinkedIn users are premium users. 10% of premium users are seeking new jobs.
2% of non-premium users are seeking new jobs. A randomly chosen person is NOT seeking new jobs. What is the probability that he is a premium user?

p[seeking job | premium] = .1* 0.05 , p[not seeking job | premium] = .9* 0.05
p[seeking job | non premium] = .02* 0.95, p[not seeking job | non premium] = .98 * .95
p[premium | not seeking]= p[not seeking job | premium] / ( p[not seeking job | premium] + p[not seeking job | non premium]) = .045/(0.045+.931)=0.046

An website shows two types of ads:
60% of the visitors see Type A ads, and 40% visitors see Type B ads. The click-through rate for A is 5%, and for B is 3%. A visitor to the website does not click the ad. What is the probability that he saw Type A ad?

p[no click|typeA] = p[typeA|no click]*(p[typeA|no click] + p[typeB|no click])
= .6*.95/(.6*.95 + .4*.97) = 0.594

Facebook has a content team that labels pieces of content on the platform as either spam or not spam. 90% of them are diligent raters and will mark 20% of the content as spam and 80% as non spam. The remaining 10% are not diligent rater and will mark 0% of the content as spam and 100% as non spam. Assume the pieces of content are labelled independently of one another for every rater. Given that the rater has tabled four pieces of content as good, what is the probability that the rater is diligent.

p(4 good content|non diligent)=1
p(4 good content|diligent)=0.8^4 =0.9
p[diligent|4 good content] = p[4 good content|diligent]/([4 good content|diligent] + [4 good content|not diligent])
= .9 / (.9 + 1) = .72/.73 = .473

Suppose 5 percent of men and 0.25 percent of the women are color-blind.
A random color-blind person is chosen. What is the probability of this person being male? Assume there are equal number of men and women overall.

p(male | color blind) = p(color blind | men)/(p(color blind | men) + p(color blind | women))
= .5 * .05 +(.5*.05 + .5 * .0025)=95%

A gambler has in his pocket a fair coin and a two-headed coin. He selects one of the coins at random, and flips it. It lands heads. Compute probability that it is fair coin.

p[fair|heads]=p[heads|fair] *p[fair]/(p[heads|fair] *p[fair] + p[heads|unfair] *p[unfair])
=0.5 *0.5 / (0.5 *0.5 + 0.5 = 0.33

A gambler has in his pocket a fair coin and a two-headed coin. He selects one of the coins at random, and flips it twice. It shows heads both the times. What is the probability that it is fair coin?

p[fair|2heads]=p[2heads|fair] *p[fair]/(p[2heads|fair] *p[fair] + p[2heads|unfair] *p[unfair])
= 0.5 * 0.5 * 0.5 / (0.5 * 0.5 * 0.5 + 1 * 1*0.5)
=0.2 = 1/5

Toss coin 3 times & you get HHT. What are the chances that it is a fair coin

p[fair|2H 1T]=p[2H 1T|fair] *p[fair]/(p[2H 1T|fair] *p[fair] + p[2H 1T | unfair] *p[unfair])
=0.5^3 * 0.5 /(0.5^3 * 0.5 + 0) = 1

A family has 2 children, at least one of them is a girl. What is probability that both are girls.

p(B) = 0.5 p(G) = 0.5
sample space = {BG,GB,BB,GG}
p(atleast 1 girl) = 3/4
p(2 girls) = 1/2
p(both are girls | atleast one of them is girl) =
p(B/A) = p(A n B)/(p(A) =
p(2 girls | atleast 1 girls) =p(2 girls N atleast 1 girls)/ p(atleast 1 girls)
1/4 / 3/4 = 1/3

Descriptive Statistics

Describe data in detail
Speed meter -> describes speed (tells you the speed)
Google maps -> tells you how long it takes to reach destination. It calculates based on avg speed, past data & traffic on road. It is derived/inferred info.
Describing data - this is avg, max, min, mean. as a fixed quantity. It is called descriptive Statistics.
If we are using given data to infer some other information, it is called inferential statistics.
Descriptive means Summarizing. Driving at X km/hr.
Inferential means drawing conclusions from data. You will reach in 1hr.

Hypothesis testing, Predictive analytics, Confidence Interval, Recommended systems.

Glass Door -> Salary at FAANG
mean say 35L and max is 40L from the users on Glass door for same position and experience
What salary will you ask for -> expectation is avg or abv avg.
Median vs Mean

Mean gets impacted with outlier. Hence go for Median
Median is more robust to outliers.

Mode

Most frequently occurring number or data

Weighted mean

sum(Wi * Vali)/sum(Wi)

The mean weight of 2 children in a family is 40 Kgs. If the weight of the mother is included, the mean becomes 45. What is the weight of the mother?

A+B /2= 40
A+B+M/3=45
80 + M=135
M=55

Range

Highest - lowest (max-min)

Percentile

%age of values less than or equal to given value

30, 30, 35, 40, 45 -> %tile of 40 is 4/5*100= 80%

Median - 50th percentile
Q1 - 25th
Q3 - 75th

Inter Quartile range

Q3 - Q1(75th - 25th)
upper whisker = Q3 + 1.5 IQR #stop at logical max
lower whisker = Q3 + 1.5 IQR #stop at logical min

Case Study: Sehwag vs Dravid - who is more consistent?

https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/035/130/original/sehwag.csv?1684996594
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sehwag=pd.read_csv("sehwag.csv")
p_25 = np.percentile(sehwag['Runs'],25) #25 percentile
p_50 = np.percentile(sehwag['Runs'],25) #25 percentile
p_75 = np.percentile(sehwag['Runs'],25) #25 percentile
iqr_sehwag= p_75 - p_25
sns.boxplot(data=sehwag['Runs'],orient="h")
upper = p_75 + 1.5 * iqr_sehwag
lower = p_25 - 1.5 * iqr_sehwag #assume to be 0 if it is going below 0
#count of outliers
len(sehwag[sehwag['Runs'] > upper])

#repeat above with 'dravid.csv'
25% for Dravid is 10 and 8 for Sehwag
75% for Dravid is 54 and 46 for Sehwag
Dravid has 1% outliers and 6% for Shewag
For consistency - less outliers should be there. Hence, Dravid is more consistent than Shewag

https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/035/126/original/weight-height.csv?1684995383
df_hw = pd.read_csv("weight-height.csv")
df_hw.head()
#Plot of Value against its percentile
#Plot of percentile for every value -> Cumulative Distribution Function

Gaussian distribution

CDF - Cumulative distribution function

Plot with #number of people <than given height on Y axis & height on X axis
Plot with % of people < than given height on Y axis & height on X axis #is CDF

from statsmodels.distributions.empirical_distribution import ECDF #empirical means from data
e=ECDF(df_hw['Height'])
plt.plot(e.x, e.y)
Standard Deviation

variance = (h1-mu)^2 +(h2-mu)^2+(h3-mu)^2+(h4-mu)^2....+(hn-mu)^2/n
standard deviation = sqrt(variance)

When any normal data is plotted in most number of cases in histogram format. When mean and standard deviation is calculated and take range from mu+sigma to mu-sigma, the total number of values in this range corresponds to 68% of data.(it is emperical data/experimental data)
mu - 2 *sigma to mu+ 2*sigma corresponds to 95% of data
mu - 3 *sigma to mu+ 3*sigma corresponds to 99.7% of data
This is called 68/95/99 rule
Any curve that follows 68/95/99 rule is called NORMAL/Gaussian Curve
The height of people in Gaussian with mean 65 inches and standard deviation 2.5 inches. What fraction of people are shorter than 67.5?

50 + 34 = 84%

The height of people is Gaussian with mean 65 inches and standard deviation 2.5 inches. What fraction of people whose height between 60 and 72.5?

mu-2sigma TO mu is 95/2 = 47.5
mu to mu+3sigma is 99.7/2 = 48.85
Total = 97.35

How many standard deviation away is 69.1 from 65?

x-mu/sigma = 4.1/2.5 = 1.64

mu + z* sigma = X #X is Standard deviation away
z = (X - mu)/sigma
Z-score table

For a given value of Z how many % values are less than that in the data
How many std deviation away from the mean the value is
For a given value of Z what % of data is less than the corresponding item
for any value X, calculate Z -> X -mu / sigma
using z score table -> value from the table is P
P% values are less than X in the original data
How many people have height less than 69.1

Z = (69.1 - 65)/2.5 = 1.64
69.1 is 1.64 sigma away from mean(65)
mu + 1.64*sigma is 69.1
94.95% of the data <= 69.1

Z score table -> CDF for a given z value

Libraries scipy and statsmodel for probability stats

from scipy.stats import norm
norm.cdf(-1) #gets values less than one sigma mu -sigma

How many people are having height less than 69.1

z = (69.1 - 65)/2.5
norm.cdf(z)

Cricket ball manufacturer. Mean of the ball size is 50mm. Standard deviation is 2mm
What is the corresponding value to z-score of 1.5.

1.5 = (x - 50)/2 = 53

What fraction of bass have diameter smaller than 53mm?

z=1.5 sigma away
norm.cdf(1.5) = 93.3% less than 53mm

How many balls have diameter >= 53mm?

1 - norm.cdf(2)

The height of people is Gaussian with mean 65 inches and standard deviation 2.5 inches. What fraction of people are shorter than 67.5?

norm.cdf((67.5-65)/2.5) = 84.13

mu = 65 inches, sigma =2.5
96% people are shorter than me. What is my height

norm.ppf(0.96)*2.5 = X - 65
X=69.37

MS Interview qn

Skaters take a mean of 7.42 seconds and std dev of 0.34 seconds for 500 meters. What should his speed be such that he is faster than 95% of his competitors?

I take less time than 95% of the competitors
95% of people have higher time than me
5% of people have less time than me
z=norm.ppf(0.05)
X= z*sigma +mu =norm.ppf(0.05) * 0.34 +7.42 = 6.86 seconds
speed = 500/6.86 = 72.88 Meters per sec

A retain outlet sells around 1000 toothpastes a week, with a std dev = 200.
If we have 1300 stock units as our inventory then what fraction of weeks will we go out of stock.

1-norm.cdf(300/200) = 6.7%

Business Case: Netflix Review, Aerofit Intro

df=pd.read_csv('netflix_titles.csv')
df.isna().sum() #checking for null values
df.isna().sum() #%age of missing vlaues
Major Challenges with the data set

Clubbed/Nested data

Director, cast, country, listed_in

Missing Values

Drop
Replace

MODE based imputation for Categorical values
Smarter idea: MODE with extra context like Genre when cast is missing

preprocessing required

When numeric and string is combined. Split the same to find avg like values on numeric

For separate analysis of movies & TV shows, it is important to first distinguish them. Find out what percentage of titles present in the dataset are TV shows and what percentage of them are movies?

df['type'].value_counts(normalize=True) * 100

Clubbed/Nested data
Business Problem

Market research team at AeroFit wants to identify the characteristics of the target audience for each type of treadmill offered by the company, to provide a better recommendation of the treadmills to the new customers. The team decides to investigate whether there are differences across the product with respect to customer characteristics.

Perform descriptive analysis to create a customer profile for each AeroFit treadmill product by developing appropriate tables and charts.
For each Aerofit treadmill product, construct two-way contingency tables and compute all conditional and marginal probabilities along with their insights/impact on the business.

Ask: For each type of Treadmill, one have to answer which type of people are more likely to purchase it.
why: Identify target audience
what does good looks like?

Import the dataset and do usual data analysis steps like checking the structure & characteristics of the dataset
Detect Outliers (using boxplot, "describe" method by checking the difference between mean and median)
check if features like marital status, age have any effect on the product purchased(using countplot, hist plots, boxplots etc)
Representing the marginal probability like - what percent of customers have purchased KP281, KP481 or KP781 in a table(can use pandas.crosstab here)
check correlation amount different factors using heatmaps or pairplots
with all the above steps you can answer questions like: what is the probability of a male customer buying a KP781 threadmill?
Customer Profiling - Categorization of users
Probability - marginal, conditional probability
Some recommendations and actionable insights based on the inferences

Central limit theorem

Guassian distribution

It is called normal distribution - For any random variable in real life, if we plot a graph of their distribution it will ideally be similar to Gaussian distribution
Most random variable follows Gaussian
It is centered around mean

Standard Distribution: Way to measure, how far away normal values are from center
Z-score : How many standard deviation away is a value from the center
CDF -> how many % values are less than the given element

norm.dist(z)
z table

Sampling in Pandas

df_height.sample(10)
sample_mean_10 = [np.mean(df_height.sample(10)) for i in range(10000)]
sns.histplot(sample_mean_10)

A distribution of sample means will always be a normal/Gaussian distribution
Central Limit Theorem

Mean of distribution of sample means of any dataset is equal to the Population Mean
Number of items in sample decreases, standard deviation of sample increases
Std Dev of sample mean distribution is proportional to 1/sqrt(n) #n is Number of items in single sample
Std Dev of sample mean distribution is called Standard Error.

Std Err = sigma / sqrn(n) #sigma is population std dev
mu sample ~ mu population

x' ~ N(mu, sigma/sqrt(n))

Examples

Systolic blood pressure of a group of people is known to have an average of 122 mmHg and a standard deviation of 10 mmHg. Calculate the probability that the average blood pressure of 16 people will be greater than 125 mmHg.

std error = 2.5
1-norm.cdf(125-122/2.5) = ~ 0.115

Weekly toothpaste sales have a mean 1000 and std dev 200. What is the probability that the average weekly sales next month is more than 1110.

Ask: Calculate mean of sample of 4. sigma of sample mean = std error = sigma population/sqrt(n)= 200/sqrt(4) = 100 1-norm.cdf(1100-1000/100) = ~ 0.135

In an e-commerce website, the average purchase amount per customer is $80 with a standard deviation of $15. if we randomly select a sample of 50 customers, what is the probability that the average purchase amount in the sample will be less than $75.

std error = 15/sqrt(50)=2.1213 norm.cdf(-5/(1.1213)) = 0.00921

Confidence interval

Recap

%tile & CDF are equivalent
z-score is how many standard deviations away from mean

Question

The average time taken for customers to complete a purchase is 4 minutes with a std dev of 1 min. Find the probability that a randomly selected customer will complete a purchase within 6 minutes? Assume Gaussian

z = 6-4 /1 = 2
norm.cdf((6-4)/1) = 0.97

The average time taken for customers to complete a purchase is 4 minutes with a std dev of 1 min. Find the probability that the average time of the next 5 customers is less than 6 minutes

std error =1/sqrt(5)
import numpy as np
from scipy.stats import norm
norm.cdf((6-4)/(1/np.sqrt(5))) => 0.9999

The average order value of an e-commerce website is 50, with a standard deviation of 5.
What is the probability that the average of the next 3 orders exceeds 60?

import numpy as np
from scipy.stats import norm
1-norm.cdf((60-50)/(5/np.sqrt(3)))

Confidence Interval

Few sample data -> can we predict a range for population. What is the probability that value will lie in the range.
Avg age of content creators of Instagram

90%(Confidence) sure that age lie between 13-20(Interval).

For any norm distribution, take Za & Zb at 5% and 95% of the curve
Za=norm.ppf(0.05) = -1.64
Zb=norm.ppf(0.95) = 1.64
90% of all values will like between mu -1.64*sigma & mu +1.64*sigma

Take a random X(mean of random sample). There is 90% chance that X will fall between
mu -1.64*sigma & mu +1.64*sigma
mu -1.64*sigma > X & X < mu +1.64*sigma
mu -1.64*sigma-population/sqrn(N) > X & X < mu +1.64*sigma-population/sqrn(N)
from many experiments it is known/assumed that sigma -sample mean ~ sigma population
hence mu -1.64*sigma-samplemean/sqrn(N) > X & X < mu +1.64*sigma-samplemean/sqrn(N)
Given a sample with muSample & sigmaSample. 90% chance that population mean will lie between muSample +- 1.64 sigmaSample/sqrt(n)
muSample +- norm.ppf(0.05) * sigmaSample/sqrt(n)
muSample +- norm.ppf(0.025) * sigmaSample/sqrt(n) #95% confidence
As sample size increases the range decreases
Examples

The sample mean recovery time of 100 patients after taking a drug was seen to be 10.5 days with a standard deviation of 2 days. Find the 95% confidence interval of the true mean.

muSample +- norm.ppf(0.025) * sigmaSample/sqrt(n)
import numpy as np
from scipy.stats import norm
samplemean=10.5
samplesigma=2
conflevel=0.025
size=100
print(samplemean + norm.ppf(conflevel) * samplesigma/np.sqrt(size))
print(samplemean - norm.ppf(conflevel) * samplesigma/np.sqrt(size))
10.11 to 10.89

CI using Bootstrap

survey_1 = [35,36,33,34,35]
np.mean(survey_1)

Combinatorics

Probability = count of favorable events/total events
Questions

India and Pakistan play a 3-match series. How many results are possible? Note that we consider(Ind, Ind, Pak) different from (Ind, Pak, Ind) etc.

Tree -> with all possibilties = 8
Possible outcomes 2 * 2* 2 = 8

In a bowl-out, for a specific ball you have to choose a bowler and a wicketkeeper. Suppose you have 5 bowlers and 3 wicketkeepers. How many ways you can select for a ball?

5 ways for bowler and 3 ways for wicketkeeper = 15
Tree -> Each bowler, select a keeper - there will be 15 combinations

There are 3 ways to move from Chennai to Bangalore, and 4 ways to move from Bangalore to Delhi. There are 2 ways to move from Chennai to Hyderabad, and 3 ways to move from Hyderabad to Delhi. In how many ways can we move from Chennai to Delhi?

2 trees/maps

Permutations

N Objects to Arrange in R Slots
N!/(N-R)! = nPr

5 letters -> A, A, B, C, D and 2 slots. How many possible ways can you arrange these

13(manual process by creating trees using a1,, a2 and eliminate a2)

Combinations

Counting when Order doesn't matter
Total possible ways of chooseing = nPr/r!

Question

You have to choose top 3 order of batsmen for India from (Rohit, Kohli, Shreyas, Rahul).
How many possible ways you can choose those players =4 * 1=4
How many possible ways to arrange them? 4*3*2 = 24
A Maruti showroom has 3 colour in their "Baleno" model and 3 colours in "Swift" model. In how many ways can they place it such that Baleno and Swift are kept in alternate slots.

3*3*2*2*1*1 + 3*3*2*2*1*1 = 36+36 = 72

Arrange A A B C D in 2 slots

If no common elements -> 4C2 = 6
Both are same - 1
Ways to arrange - 4C2 * 2 = 12
Both are same -1

Total = 13

Coin Tosses: When coin is tossed 100 times, what are the chances that we get 52 heads

100c52 =

Binomial and Geometric distributions

Binomial -> criteria is

Fixed number of interview - say n
For each interview success rate is say P
Each interview is independent of other interview
X : number of successes

if n=1 then X will be {0,1}
probability of success say is 0.1 then

p(x=0) =0.9 & p(x=1) =0.1

PMF:

Probability Mass function: Distribution where X-axis is actual event values and Y-axis is probability
Plot between probability against values is called PMF

p=0.1
x_vals=[0,1]
probs=[1-p,p]
sns.barplot(x=x_vals, y=probs)
x_vals=[0,1]
probs=binom.pmf(x_vals,n=3,p=0.1)
sns.barplot(x=x_vals, y=probs)

when n=1 the same is called Bernoulli distribution
when n=50

x_vals=np.arange(0,51)
probs=binom.pmf(x_vals,n=50,p=0.1)
sns.barplot(x=x_vals, y=probs)

When N interviews are given

Total possible outcomes is 2^N
p(x=k) => nck * p^k * (1-p)^n-k

The above is similar to choose K boxes out of the n
when k=1 it will be nc1
so similarly for k boxes it is nck

nCk => math.comb(n,k)

Geometric Progression

Keep giving interviews till you get success

{S}
{F,S}
{F,F,S}
{F,F,F,S}
{F,F,F,F,S}
on what interview will you get success?

p(x=1)=0.1
p(x=2)=(0.9)* 0.1
p(x=3)=(0.9)*(0.9)*0.1
p(x=k)=(0.9)**(k-1) * 0.1
p(x=k)=(1-p)**(k-1) * p
code

p=0.1
x_vals = np.arange(1,20)
probs_geom = geom.pmf(x_vals,p)

question

Probability of player Messi to goal penalty short is 0.8. What is the proability that he will have 7 or less success in 10 chances

x_vals = np.arange(0, 11)
probs=binom.pmf(x_vals, n=10, p=0.8)
sns.barplot(x=x_vals, y=probs)
np.sum([binom.pmf(k=1, n=n, p=p) for i in np.arange(0,8)])
binom.cdf(k=7,n=10,p=0.8) #cumulative probability till the given K

What is the probability that we will score 8 or more

1- binom.cdf(k=7,n=10,p=0.8)

Suppose we float 10 quizzes with four options each. Calculate the probability that a student, who randomly guesses, answers 2 questions correctly.

p(x=2) = nC2 p^2*(1-p)^n-2
binom.pmf(k=2,n=10,p=0.25)
n=10
k=2
n p=0.25
math.comb(n,k) = p^k * (1-p)^(n-k)

Types of Probability

Marginal Probability

Probability that Sachin score a century
Probability that Sachin team wins

Conditional probability

df_sachin[["century","Won"]].value_counts() #results groupby century and won
Given that Sachin has scored a century what are the chances that India wins

30/46 #for given set

Given that India wins what are the chances that Sachin has scored a century
30/184

Joint Probability

Probability that sachin scores a century and India wins

pd.crosstab(index=df_sachin["century"], columns=df_sachin["won"]] #contingency table

what kind of ditribution

N is fixed and 2 possible outcomes and probability of one of outcome is known-> Binominal distribution
Geometric

search on google/netflix
dictionary lookup

Hypothesis Testing Framework

Assume something & test the assumption
Statistically prove the assumption is correct or not. This essentially called Hypothesis testing.
Default assumption is called NULL Hypothesis
Suspect is accused of murder. (criminal or not criminal)

Innocent until proven guilty -

Assumption is Innocent and asked to prove the assumption is incorrect

Burden of proof is on those who want to reject the default assumption.

who are introducing the new claim

Cricket

Third empire - When there is dispute on field - decision of on field is challenged

on field gives a soft call for lbw. - what is default assumption for the 3rd empire(not out, out) OR onfield empire is correct vs not correct

Covid test(+ve or -ve). -ve default assumption

Null Hypothesis(H0) -> Default Assumption
Alternate Hypothesis(Ha) -> your assumption when you reject the null hypothesis
p[extreme data | Ho] is low -> this is called P-value

If P-Value is low - should you accept your hypothesis or reject

Reject null hypothesis

Default behavior -> Null hypothesis

Collect evidence/data opposite to null hypothesis

p(evidence | if Ho is true) is low -> reject the null hypothesis
A Juice brand claims that its new manufacturing process has reduced the sugar content in its juice boxes to 8 gms. Now, food safety and standards authority of India(FSSAI) wants to test the claim of the juice brand, and choose the correct option:

Ho: The new manufacturing process has reduced the sugar content in its juice to 8gms H1:

Default Assumption should not be CLAIM. rather it should be default behavior
If P-value is less than alpha(significance level) then we reject the null hypothesis
When test says - NO Virus and reality also have No Virus then TRUE NEGATIVE
When test says - Virus and reality also have Virus then TRUE POSITIVE
When test says - Virus and reality NO Virus then FALSE POSITIVE
When test says - NO Virus and reality have Virus then FALSE NEGATIVE(Type 2 error)
Test is Negative -> unable to reject NULL hypothesis
Left handed or left tail test
right handed or right tail test
Two/double tail test

Z-Test

How to decide Ho

Default behavior

fair coin
innocent

based on who is testing a claim

Hypothesis testing framework

NULL and alternate hypothesis
Identify the distribution(gaussian, distribution)
Left, right, Two tailed
Compute p-value[Probability of Seeing observed values GIVEN Ho is True]
Compare p-value with alpha

Central Limit Theorem (CLT)
D.Mart - avg weekly sale is 1800 with std dev 100. They hired a marketing team to improve sales. Marketing team started with 50/2000 stores to test the strategy. Firm says the avg sale of 50 stores increased to 1850.

NULL and alternate hypothesis

Ho -> sale is same
Ha -> sale increased (claim of marketing team)

Identify the distribution(gaussian, distribution)

gaussian

Left, right, Two tailed

right tailed

Compute p-value[Probability of Seeing observed values GIVEN Ho is True]

mu population =1800 & std dev = 100
mu sample =1800 & std dev sample = 100/sqrt(50)
p[sales>1800 | Ho is true]

z= (1850-1800)/(100/sqrt(50))
1-norm.cdf(z) = 0.0002

Compare p-value with alpha

alpha=0.01(if observed value lies in extreme 1% then we can reject Ho. 99% confidence that the claim made is true)
alpha>p-value => reject Ho

Yes, marketing has increased the sales

alpha -> significance
1-alpha -> confidence level
The weights of apples in a fruit market are normally distributed with a mean of 150 grams and standard deviation of 20 grams. If an apple weighs 140 grams, what is its z-score

mu=150, stddev=20
z-score = 140-150/20 = -0.5

A coffee shop claims that their coffee cups contain, on average, at least 12 ounces of coffee. A random sample of 36 coffee cups showed an average of 11.8 ounces with a std dev of 1.5 ounces. Conduct a z-test to determine if the coffee shops claim is supported. What is the p-value?

Notes

There are 3 distributions, population, sample distribution and single sample
mu pop = mu sample distribution
sigma pop/sqrt(n)=sigma sample distribution
if pop std dev is not given assume it to be same as single sample std dev

mu = 12 std dev = when not given it is same as sample = 1.5
single sample size = 36, mean =11.8 and std dev =1.5
sample mean distribution, mean = 12, stddev = 1.5/sqrt(36) = 0.25
z=(11.8-12)/(1.5/sqrt(36)) = -0.8
Ho = avg content is 12 ounces
Ha = avg is >= 12
right tailed
% values >= 11.8

A fitness app claims that its users walk an average of 8000 steps per day. A random sample of 30 users showed an average of 7600 steps per day with standard deviation of 1200 steps. Conduct a right tailed Z-test at a 5% significance level to determine if the app's claim is supported. What is the p-value?

claim 8000 steps
mu population = 8000 sigma population=1200
alpha = 0.05
Observed value = 7600 and n=30
Sample mean distribution
mean =8000, std dev = 1200/sqrt(30)
z-score = (7600-8000)/(1200/sqrt(30))=

Critical value

T-test

A french cake shop claims that the average number of pastries they can produce in day exceeds 500. The average number of pastries produced per day over a 70 day period was found to be 530. Assume that the population standard deviation for the pastries produced per day is 125. Test the claim using a z-test with the critical z-value = 1.64 at the alpha(significance level)=0.05, and state your interpretation.

framework

NULL and alternate hypothesis
Identify the distribution(gaussian, distribution)
Left, right, Two tailed (decided by Ho and Ha values NOT by observed and mean)
Compute p-value[Probability of Seeing observed values GIVEN Ho is True]
Compare p-value with alpha

Ho -> Cake shop produces 500 cakes per day, mu =500, Ha>500
right tailed test
70 day sample -> std dev = sigma pop/sqrt(70)
z-score = (530-500)/(125/sqrt(70)) = 2.01
p-value = 1-norm.cdf(z) = 0.022
alpha = 0.05
p-value < alpha -> hence reject Ho

https://www.scaler.com/instructor/meetings/i/t-test-24/
iq_score = [110,105,98,102,99,104,115,95]
population avg =100
Ho = pill has no effect, avg is still 100
H1 = Avg>100

z-score = (x-mu)/(sigma population/sqrt(n))
t-score/t statistics = (x-mu)/(sigma sample/sqrt(n))

from scipy.stats import ttest_lsamp,ttest_ind
t_statistic , pvalue = ttest_lsamp(iq_scores, mean for Ho, alternative='greater/less/default)
t_statistic , pvalue = ttest_lsamp(iq_scores, 100)
When Population mean not available -> 2 tests/samples

compare general people with people took medicine

T test using 2 samples

we try to find if the samples belong to same population

T Test

1 sample -> Sample against population
2 sample -> sample against sample
For Number vs Categorical (only 2 categories at a time)
Applicable when we compare numeric data of 2 categories

Innings1 vs Innigs2 runs
Drug1 vs Drug2 recovery

>2 sample -> Anova test

Chisquared test

recap

z-test

world follows normal distribution
Calculate p-value using z-score
norm.cdf
Requires population mean and std dev which is a challenge

t-test

doesn't follow normal distribution. It follows diff distribution
Requires population mean and std dev of sample
Use when population std dev is not provided OR when N is small
t-test function from python lib

Numeric values with 2 categories - t-test
Numeric values with 2+ categories - Annova-test
Numeric with Numeric - Correlation
Category with Category -> Chisquared test

compare if gender makes a diff in Product Sales

Degree of Freedom

Given a cumulative/aggregate value, how many values are required to be known to calculate all values in the data.

(#row-1)*(#col-1) -> for matrix
(#row-1) + (#col-1) -> for 2 arrays

Coin Toss

50 times toss of fair coin

Expected 25 heads and 25 tails. Actual/observed is 28 heads and 22 tails. Is the coin fair. Degree of freedom is 1
Ho -> coin is fair, Ha -> coin is unfair
X^2 = sum(observed - expected)^2/expected => chi-statistics

If X^2 is low, high chance that Ho being true
Here we compare Observed, Expected vs Head and Toss (Categorical vs Categorical)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import chisquare
from scipy.stats import chi2

alpha = 0.05
#Ho : coin is fair
#H1: coin is unfair

#chisquare(Observed values, Expected values)
#returns chi statiscs & p value -> p value indicates the number of values on right side in graph
chisquare(Observed values, Expected values)
chi2_contingency( -> Python to calculate Expected values
[[],[]
]) -> Returns chi_stats, P_value, Df(degree of freedom), expected table
Test of Independence
OR
Are 2 categorical variables dependent on each other or not
A marketing manager wants to determine if there is a relationship between the type of advertising (online, print or TV) and the purchase decision(buy or not buy) of a product. The manager collects data from 300 customers and records their advertising exposure and purchase decisions. What statistical test should the manager use to analyze this data?

Chi Square- independence test

Assumptions in Chi Square tests

Variables are category - category
Observations are independent
Each cell is mutually exclusive
Chi square only works if each cell >=5

Anova

What is Hypothesis testing

Testing a claim
Observed data is different/significantly different than Ho
Is the diff between Observed data & Normal Ho) mean is by chance(high P-value) or significantly(low p-value)

Z-test
t-test
Chi Square test

df_aerofit.head()

Product, Age, Gender, Education, MartialStatus, Usage, Fitness, Income, Miles
Does Gender affects the income

T-test(2 sample) as one is Categorical and other is numeric

Does Gender has any impact on Product

Chi-square test

Product impact on Income

sns.boxplot(x='Product', y='Income', data=df_aerofit)

Which test to use when

Num vs Num -> Correlation
Cat vs Cat -> Chi square
Cat vs Num

2 Categories -> t-test
> 2 categories -> Anova

Does the product get impacted by income

Ho -> No impact of income over product purchased
Divide data into 3 equal parts

Create one column -> which randomly distributes the data in 3 parts
df_aerofit["random_group"] = np.random.choice{
["g1","g2","g3"],
size=len(df_aerofit)
}

Variance between groups -> If Ho is true this should be low
Variance within groups -> If Ho is true this should be high
F_ration = Variance between groups/Variance within groups -> If Ho is true this should be low & If Ha is true this should be high
F_score = Variance between groups/Variance within groups

Coding

income_g1 = df_aerofit[df_aerofit["random_group"]=="g1"]["Income"]
income_g2 = df_aerofit[df_aerofit["random_group"]=="g2"]["Income"]
income_g3 = df_aerofit[df_aerofit["random_group"]=="g3"]["Income"]
income_product1 = df_aerofit[df_aerofit["product"]=="product1"]["Income"]
income_product2 = df_aerofit[df_aerofit["product"]=="product2"]["Income"]
income_product3 = df_aerofit[df_aerofit["product"]=="product3"]["Income"]
from scripy.stats import f_oneway
f_oneway(income_g1,income_g2,income_g3)
f_oneway(income_product1,income_product2,income_product3)

Compare impact of a categorical column to another numeric column
Product aginst income
Assumptions of Anova

Data is gaussian
Data is independent across each record
Equal variance in diff groups

The above assumptions are incorrect in "Kruskal wallis Test"

>= 2 categories vs numeric column
from scripy.stats import f_oneway,kruskal
kruskal(income_product1,income_product2,income_product3)

Test if a distribution is Gaussian OR not

sns.histplot(height) // visually using may not be accurate
68/95/99 -> mu+sigma -> 68%
Check 1 percentile of your data vs 1 percentile of gaussian distribution(Theoretical quartile)
do same for every percentile
Draw graph for your quantile vs gaussian quantile data

from statsmodels.graphics.gofplots import qqplot
qqplot(height) // will be diagonal line
This is z-score on x-axis and your data in graph

Food delivery

path='waiting_time.csv'
df_wt=pd.read_csv(path)
sns.histplot(df_wt["time"])
qqplot(df[wt["time"],line='s')
plt.show() -> not a straight line in diagonal direction. Hence, not a Gaussian

We are checking if some data is Gaussian or not

Ho -> data is Gaussian
H1 -> not Gaussian

Shapiro Test -> To test if data is Gaussian or not

You take 50-200 sample points
run Shaprire test
p-value is low -> reject Ho
from scripy.stats import shapiro
heigh_samp = height.sample(100)
shapiro(heigh_samp)

Equal Variance Assumption

Ho -> Variance is equal
H1 -> not equal
Levene Test

from script.stats import ttest_ind,levene
ttest_ind(height_men,height_women)
height_men.var()
height_women.var()
levene(height_men,height_women)
Reject Ho if pvalue is low

Numeric Data

Z Test -> if sigma population is known
T Test

Category vs Numeric

2 Categories -> T Test 2 samples
> 2 Categories -> Anova
> 2 Categories -> Krushkal if Anova fails

Category vs Category

Chi Square tests

To Check Gaussian

QQ Plot
Shapiro Test

To Check if Variance is same or not

Levene test

Numeric vs Numeric

Correlation test

Correlation Test

df_hw=pd.read_csv("weight-height.csv")
sns.scatterplot(x=df_hw["Height"],y=df_hw["Weight"])
Co-Variance

1/n(sigma(hi-hmu) * (wi*wmu))
If co-variance is +ve, then it is called +vely co-related

Ice cream sales VS Amount of Rainfall

co-variance will be negative. It is called -vely co-related

Height vs Rainfall

Net area will be 0 => uncorrelated

Co-variance value changes based on unit of height/weight. Hence, it is decided to divide with certain metric to standardize data. The metric is Standard Deviation. The value that is obtained after dividing by standard deviation is Co-relation co-efficient.

rho = 1/n(sum((hi-hmu)/stddev) * ((wi*wmu)/stddev))
Value of rho will be between -1 & 1
0 ~ uncorrelated, 1 is positive correlation, -1 is negative correlation

df_hw(["Height","Weight"]).corr()
Why co-variance is not good quantitive measure to check correlation.

Because it changes based on scale

Salary vs Experience

Counter Intutive

Intuition -> the relation is +ve
mathematically -> the relation is neutral

Reason

Data is not along a line
Non-linear relationship

Non linear relationships are not captured properly by Correlation Co-efficient
Pearson Correlation: is the correlation that we learnt until now

Works only with Linear relationships

Spearman Correlation

sort both the values(salary, experience) and get the rankx and ranky
rho = 1/n(sum((rankx-rankxmu)/stddev) * ((ranky*rankymu)/stddev))
from scripy.stats import pearsonr, spearmanr
pearsonr(df_hw["Height"],df_hw["Weight"])
spearmanr(df_hw["Height"],df_hw["Weight"])
Spearman works extremely well in monotonic increase/decrease data
p-value tells you how many random numeric variable with the given data points were actually related
Pearson is used in general. When we know the graph is not linear use spearman
If Pearson is giving near 0 values, then verify with Spearman as well

Business Case: Aerofit Review,Walmart Intro

import pandas as pd
import seaborn as sbn
df=pd.read_csv("aerofit.csv")
sbn.boxplot(y='income', x='Product', data=df) //to check outliers & get insights
sbn.boxplot(y='income', x='Gender', hue='Product', data=df) //get insights based on Gender
df['Product'].value_counts()
sbn.heatmap(df.corr(), annot=True) //correlation between different attributes
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True]
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True, normalize='columns')*100]

What is famous among females(conditional probability)
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True, normalize='index')*100]

Out of 100 products sold for KP781, how many are being bought by females -
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True, normalize=True)*100]
Marginal Probability -> no conditions -> simple probability
Conditional Probability -> p(4|'Red') -> card is 4 given it is read = 2/26
Joint Probability -> p(4 of blacks) = p(AnB) = p(A|B) * p(B) = 2/26 * 26/52 = 2/52
Business Problem

The management team at Walmart Inc. wants to analyze the customer purchase behavior(specifically, purchase amount) against the customer's gender and the various other factors to help the business make better decisions. They want to understand if the spending habits differ between male and female customers: Do women spend more on Black Friday than men? The company collected the transactional data of customers who purchased products from the Walmart stores during Black Friday. The dataset has the following features:
user_id, Product_id, Gender, Age, Occupation

When sample data is given and asked the projection for Population, do the bootstrapping and find the Confidence Interval for Population.

Advanced Distribution - 1

CDF - Cumulative distributed function. Sum of all probabilities till the given value

upto 5 heads
prob that atmost 5 interviews are cracked
p(x<= given value)

Value of X at a given value.
p(x=a)

Discreet distributions(Binomial|Geometric)

Poisson Distribution

Will Messi score a goal in match
How many goals will Messi score in the match
Rate(lambda/mu) -> Average number of events in a given time interval

Avg no of goals in a match(90 mins) eg: say 2.5 goals
Avg no of customer visiting a place
Rate will change with interval

Rules

It should be countable
Independence - Occurrence of one event doesn't impact other events
Rate -> It is constant
No simultaneous events -> 2 goals can't happen at same time

Question

Bangalore has 3 accidents per day on average. What are the chances that Bangalore will have atmost 5 accidents tomorrow or 2 accidents tomorrow
p(x<=5) -> cdf function

poissons.cdf(rate=3,k=5)

p(x=4) -> poissons.pmf(rate=3,k=4)

Binomial PMF

nCk p^k (1-p)^n-k

Poissons PMF

lambda ^ k * e^-lambda/k!
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import poisson, binom
pisson.cdf(k=5, mu=3) //atmost 5 accidents
pisson.pmf(k=4, mu=3) //exactly 4 accidents

Question

On an average there are 3 typos per page. What is the probability that a random page has atmost one typo

pisson.cdf(k=1, mu=3) //atmost

Restaurant opens for 8 hrs. Avg number of customers for 8 hrs is 74. What is the probability that in next 2 hrs there will be at most 15 people.

poisson.cdf(k=15, mu=74/8*2) //atmost 15 customers
1- poisson.cdf(k=6, mu=74/8*2) //atleast 7 customers

You receive 240 messages per hour on average - assume Poisson distributed. What is the probability of the one message arriving over a 30-second time interval

pisson.pmf(k=1, mu=240/3600*30) //exactly 1
No message in 15 seconds?

pisson.pmf(k=0, mu=240/3600*15) //exactly 0

There are 80 students in a kindergarten class. What is the probability that exactly 3 of them will forget their lunch today? Each one of them has 0.015 probability of forgetting their lunch on any given day.

pisson.pmf(k=3, mu=0.015*80) //exactly 3
using binomial

p=0.015, n=80
binom.pmf(k=3, n=80, p=0.015)

BootStrapping

When we want to calculate central value. In real world we only get 6-10 values. How to ensure that the values makes sense.

select 5 of the values. Repeat it for 10K time
The graph will be normal distribution

Advanced Distribution - 2

Recap Poissons dist

Criteria to use

Constant rate
Counting / countable in a limited quantity between 50 - 100

when lambda is small & n is large, both poission and binomial gives same results

Independent events
No simultaneous events

Questions

Avg accidents in the city = 3/ day. Probability that there will be 5 accidents tomorrow

pisson.pmf(k=5, mu=3) //exactly 5

Whats app -> 240 messages/hr on avg. How many messages will you get on avg in 30secs

240/3600*30 = 2 msgs/30sec.
Probability of 1 msg in next 30 secs

pisson.pmf(k=1, mu=2) //exactly 1

Probability of 0 msg in next 15 secs

pisson.pmf(k=0, mu=1) //exactly 0

3 messages in 20 secs

pisson.pmf(k=1, mu=240/3600*20)

What is the avg time between 2 messages(Scale)

3600/240

avg messages per second

240/3600

Rate - avg occurrences in a timeframe(lambda/mean)
Scale - avg time between 2 occurrences
Probability of no messages in 10 secs

p(t<=10)=poisson.pmf(k=0, mu=240/3600*10)

Probability of waiting more than 10 seconds for the next message

p(t>10) = 1-p(t<=10)= 1 - poisson.pmf(k=0, mu=240/3600*10)
this is unknown distribution - CDF
This unknown distribution is called as Exponential distribution

Exponential Distribution

p(x=0) = 1-p(t<=10)
e^-lambda * lambda ^0/0! = e^-lambda
e^-lambda = 1-p(t<=10)
expon.cdf = 1-e^-lambda

Distribution of events in 10 secs -> Poissons distribution
Distribution of time till next event -> Exponential distribution
Probability of no events in next 10 seconds

poisson.pmf(k=0, mu=240/3600*10)

Time till next events follow a exponential distribution. Probability of time being greater than 10 seconds

1 - dist.cdf(10)
1 - expon.cdf(x=10, scale=15)

Question

7 days -> 490 trains per day
rate = 490/7 per day = 70/24 per hr
Avg time between 2 trains = 24/70 hr - Scale
Calculate probability that it will take atleast 30 mins for next train to come

p(t>30) = 1-p(t<=30)
= 1- cdf(t=30, scale=24/70)

Suppose you are managiing a call centre, and yoou have observed that, on average, you receive 10 customer service calls per hour, following an exponential distribution. you want to calculate the probability of waiting less than 5 mins before next call arrives

Rate = 10 calls/hr
scale = 60/10 = 6 mins
expon.cdf(x=5, scale=6)

Software develpment -> debug issue avg - 5min. Find probability that prob is debgged in 4 to 5 mins

p[4<=T<=5] = p[T<=5] - p[T<=4]
expon.cdf(x=5, scale=5) - expon.pmf(x=4, scale=5)

More than 6 mins to debug

p[t>6] = 1- p[t<6]
1 - expon.cdf(x=6, scale=5)

Given that you have already spent 3 mins on debugging without finding anything, what is the probability that it will take more than 9 min to debug from beginning

p[T>9 | T>3] = p[(T>9) n p(T>3)] /p(T>3) = p[(T>9)] /p(T>3)
= 1-expon.cdf(x=9,scale=5)/1-expon.cdf(x=3,scale=5)
OR
1-expon.cdf(x=6,scale=5)

Memory less - history doesn't matter

p[T>x | T>y] = p[T> x-y]

Paired T-Test

Does having a doubt clearing session improves people scores

Before vs after analysis - Paired T-test
ttest_rel(ds_ps["test_1"], df_ps["test_2"])

Feature Engineering - 1

Log normal dist

take log(data) and create distribution for the log data
If log(data) is normal distribution then original data is called log normal distribution
Why log converts non-normal to normal distribution?

log -> property -> non linear compression of axis

Parameters for Gaussian distribution

Feature engineering

Attributes of a table/csv is feature apart from the one that is to be predicted
eg: in Aerofit

Education, gender, income, fitness, usage are features, product is Target

Using Features -> predict the Target
Identify if a person is fit or not

Conduct Survey of height, weight
Submit data for expert advice
Expert marks the records as fit/unfit -> Ground truth
Machine learning is about identifying features & automating the process

Height Weight -> fitness

BMI = Weight/Height^2
BMI will be new feature to identify fitness

Feature engineering is creating new feature using given features which helps in better prediction of target.

Features can be created 2 ways

Automated -> Machine Learning
Manual -> Domain Knowledge

data=pd.read_csv('loan.csv')
Gender, marriage, income.. -> features, loan_status is Target
data=data.drop('Loan_ID', axis=1) //loan id is not required
data.describe()
data.describe(include='object')
data.isna().sum() // drop off the rows with null values if count is very less

Identifying whether feature affects final target
### univariate analysis -> Checking one feature if it is effecting final target
sns.countplot(data=data, x='Loan_Status')
data.groupby("Loan_Status")['Applicant_Income'].mean()

#use ttest to check if invoice and loan status are independent(Ho) or dependent(Ha)
a=data[data["Loan_Status"]=="Y"]["ApplicantIncome"]
b=data[data["Loan_Status"]=="N"]["ApplicantIncome"]
ttest_ind(a,b) #accept null hypothesis as pvalue is high

#Bin vs Status -> two categorical variable so use ChiSquare test
bins=[0,3000,5000,8000,81000]
group = '['Low','Average','High','Very High']
data["TotalIncome_bin"] = pd.cut(data["TotalIncome"], bins, labels=group)
vals=pd.crosstab)data["TotalIncome_bin"], data["Loan_Status"])
chi2_contingency(vals)
#Pvalue is still high so can't reject null hypothesis

#loan amount & loan term against your salary may be right comparison
data['Loan_amount_terms].value_counts()
#Domain Knowledge
loanamount/loan term => emi per year
data['Loan_Amount_per_year'] = data data['Loan_Amount']/data['Loan_term']
data['EMI'] = data['Loan_Amount_per_year'] / 12 #approximation
data['able_topay_EMI'] = (data['TotalIncome']*0.3 > data['EMI']).astype('int')
sns.countplot(x='Able_topay_EMI', data = data, hue = 'Loan_Status')
vals=pd.crosstab(data['Able_topay_EMI'], data['Loan_Status'])
chi2_contingency(vals)

#if new feature is able to reject Ho then take it else drop it

Feature Engineering - 2

Recap

Most data scientists spend 60-70% of time in identifying important features ie features that heavily impact target variables.

Missing Values

Numeric

Mean, median, Mode

Categorical variables

Mode
delete rows if count is less than 1-2%
Drop column(feature elimination if more than 50% are null

Fill it with a new value(example credit history)
Filling null values -> sklearn

from sklearn.impute import SimpleImputer
t = SimpleImputer()
SimpleImputer? #to find different functions available in class
SimpleImputer(strategy="most_frequent").fit_transform(a) #fills na with appropriate strategy evaluated value
num_missing=['EMI','Loan_Amount_per_year','LoanAmount','Loan_amount_term']
mean_imputer=SimpleImputer(Strategy="mean")

for col in num_missing:
data[col]=pd.DataFrame(mean_imputer.fit_transform(pd.DataFrame(data[col])))

Categorical vs Numerical

One hot encoding

Convert all possible values into columns eg: male/female has 2 diff columns

Label encoding

Just replace string/char with diff int values (1 for male, 0 for female)
Works well when there are only 3 categories
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data[col]=label_encoder.fit_transform(data[col])
data[col].value_counts()
The problem with Label encoding is that it gives preferences to categories. The ones with high value will be given more preference. To handle the challenge we use Target encoding

Target encoding

Based on how much impact the target has to the given variable, we assign them a value
from category_encoders import TargetEncoder
pd.crosstab(data["Self_Employed"], data["Loan_Status], normalize='index')
col="Self_Employed"
te=TargetEncoder()
data[col]=te.fit_transform(data[col],data["Loan_Status"])
data[col].value_counts()

KS Test

T-test checks mean of 2 data sets, based on result it will tell whether they belong to same or not

T-test can go incorrect when 2 diff sets have close mean

KS Test does CDF comparison

from statsmodels.distributions.empirical_distribution import ECDF
kstest(a,b) // if p value is small then they are 2 diff distributions

This is used when any other methods that compare means gives -ve results

Product Strategy & Business Acumen

Product Metrics & Design
RCA
RFM
Customer Segmentation
Guesstimates
A/B Testing
Misc case studies
Data Scientist

Helps in generating insights & reduce failures.
Ensure less chance of failure and high chance of success

Product Acumen / Business Acumen - How good understanding to develop a product dev & how to make sure product works

Analyzing Metrics & Designing Metrics

Youtube traffic went down 5% on last Sunday

Judgement criteria for interviewers

Structure - Demonstrate a systematic approach
Comprehensiveness - Covers all important aspects
Feasibility - Practical enough that it could be implemented realistically

Framework

Clarity - requirements
Plan
Conclude

Product Diagnostics - Analysis of a metric

User signup are down
Monthly active users have reduced

New Product/feature - Measure the success/performance of new product(1 hr delivery)

How is health of Amazon product

Product Design/recommending a new feature

move some section from one place to another on the same page
Should the class at scaler to be moved from 9 to 10

Product Improvement

This is more difficult
How will you improve google maps
measures in BPS -> basis points

0.01% is one BPS

Product Diagnostics -

CRIED - Clarify, rull out, Internal, External data
CTR -> Click thru rate
Case: Search Results for a facebook events - Clicks increased 15% WoW

Metric: %user clicking on each event
Change: 15% increase

Clarify

User clicking on event search means?
WoW -> for how many weeks
15% increase in Users/Clicks?
equation of metric
Increase or decrease over what(avg/time)

Ruleout

Technical glitches
outliers -> Diwali, Festivals

Time

sudden increase or gradual increase

Region

Geographically concentrated

Other related features affected

within your company to check if it is only feature or other features as well

Platform

Android/ios
Decktop/Mobile
mac/windows
app/website

Cannabilization

Your increase or decrease should not come from some other page OR from your own product
Move of traffic from one section to another section
eg: checkout icon from Product page to Search page

Segmentation

User segmentation based on Age, Gender, new/existing, Casual/Power users, language

TROPICS - for Internal data
External Data

Competitors data (public data)
new competition
Good/Bad PR

Product Success - Defining Metrics

FB introducing new feature "Save for later" -> Define metric and process to get the same

Clarify

What is the feature - are you saving image, videso,
How long are you saving
Are people going to be reminded
Grouped/All together
Who benefits from feature(user, Content creators, business users, internal teams)

Business Goal - Think from different perspectives

User goal

Marketer goal

business goal

Define metric

AAAERR ->

Awareness - Are people aware/using of your product

% of users who save at least 1 item
% of users returning to see items

Acquisition - How many users

No of users coming due to this feature
CAC - Average cost to acquire one customer - total cost/#users acquired

Adoption/Activation

% of total posts saved(saved posts/total posts)

Engagement

% of items reopened
avg time spent (increase/descrease)

Revenue

ad spent
clicks on ads

Retention

How often are people coming back

Guardrail metrics

Your product success should not make other products go down

Summarize

Business Case: Walmart Review, Yulu Intro

Strategy to be followed to solve business case:

Basic Exploration

Missing values
Outliers
Strategy to deal with missing values & outliers

Univariate/Bivariate/Multivariate analysis
Sample data -> Inferential statistics
To confirm if it is true for population -> Hypothesis testing/CLT
Recommendations/Insights

Walmart

5.5 lac transactions
7K userIds
import pandas as pd
import seaborn as sbn
df=pd.read_csv('walmart.csv')
df.groupby('Gender')['Purchase'].describe()
sbn.boxplot(x='Gender', y='Purchase', data=df) #No major diff between median of spend between male and female. Hence, we cannot say it with clarify, who is spending more

#For population check with CLT
df.sample(300).groupby('Gender')['Purchase'].describe()
male_sample_means = df[df['Gender']=='M'].sample(1000, replace=True)['Purchase'].mean() for i in range (1000)
np.mean(male_sample_means)

#upperlimit_males = means + Z-score * standard error
upperlimit_males = np.mean(male_sample_means) +1.96 * no.std(male_sample_means)
lowerlimit_males = np.mean(male_sample_means) -1.96 * no.std(male_sample_means)
upperlimit_females = np.mean(female_sample_means) +1.96 * no.std(female_sample_means)
upperlimit_females = np.mean(female_sample_means) +1.96 * no.std(female_sample_means)

#uncertain because the CIs are overlapping
#what can be done to eliminate the overlap
Increase the sample size
reduce the confidence level

Yulu is India's leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.
Yulu zones are located at all the appropriate locations(including metro stations, bus stands, office spaces, residential areas, corporate offices, etc) to make those first and last miles smooth, affodable, and convenient.
Yulu has recently suffered considerable dips in their revenues. They have contracted a consulting company to understand the factors on which the demand for these shared electric cycles depends. Specifically, they want to understand the factors affecting the demand for these shared electric cycles int he American market.

How you can help here?
The company wants to know:
Which variables are siginifcant in predicting the demand for shared electric cycles in the Indian market?
How well those variables describe the electric cycle demands.

Product Metrics - 1

Recap

Product Diagnostics: Analyze the change in the metric

CRIED framework -> Clarify, rule out, Internal data, external data

Internal data -> TROPICS(Time, Region, Other related features, Platform, Cannibalization, Segmentation)

Define metric for new product launch

Clarify (What, why, how, who)
Business goals (Customer/User persona with their goals)
Define metric -> AAAERRG (Awareness, Acquisition, Adoption/Activation, Engagement, Revenue, Retention, Guardrail)
Summarize

Pyramid

North Star Metric (top of pyramid, MOST Important)

Instagram: Monthly active users
Gaana.com: Avg time spent per user per week
India Bank: Avg transaction value per month
Whatsapp: Daily active users

L1 metric (AAAERRG) (It can be assigned to various stake holders to lead and track) - supporting metrics

Important to track the health of the product / feature and is owned by various stakeholders
Health of the product
eg: Gaana.com

Engagement - avg spent per user per week

L2 metrics (more granular metric)

Supporting metrics
eg: gaana.com

avg time spent per gender per week
avg time spent per platform per week
avg time spent per region

Business: You tube going to launch shorts feature. As a data scientist, help youtube to define metrics.

clarify

why this feature
Is this available over app or website
Timeline of shorts

Business goals

Target customers

Normal users - increase engagement
content creators - more options to create content

North Star Metric

Avg time spent on youtube shorts by active users

L1 Metric

Awareness

%age of active users using Youtube shorts
L2 metric

%age of male active users using youtube shorts

Acquisition

Avg AD spent per user(CAC)
Avg revenue share with content creators

Adoption

Once the feature is launched and we have acquired a customer, has that customer re-used the feature in short window
Avg no of user who used Youtube shorts in 24 hrs window after watching it for once

Engagement

The users are enjoying the feature or the product. If there is increase in usage in longer window(a week)

Retention

Capture if users are still using it or have stopped using it once the company has closed the marketing cycle.

Revenue

Avg no of active users per month(shorts)

Guardrail metric

#users uninstalls after feature install

Flowchart of metrics for youtube shorts

North star : Avg time spent on youtube by active users

L1 - Reach/Awareness -> Avg no of active users using Youtube shorts
L2 -> %age of male active users using Youtube shorts
-> %age of active users with between 18 to 40(genZ)

Engagement

Avg no of users who used youtube shorts in a week

#likes/shares/save in youtube shorts

Guard rail(Business specific)

Interview questions

Metric definition based
Metric change based
Process

Describe the features based on your understanding
Determine goals -> major business goal
Customer Journey -> add/define metric that are most relevant to customer journey
map and quality ->
Evaluate your metric -> get into conversation & correction of his/her feedback

Product Metrics - 2

Recap

Diagnose -> CRIED framework

TROPICS for Internal data

Defining metric

Clarify
Business Goals
Define metrics (AAAERR)
Summarize

Product Metrics & KPI

North Star (Every team in company aligns with)

Monthly active users is North Star & 20K MAU is KPI

L1(Each individual team focus on that metric, team/pillar level metric)
L2(just for figuring out how things work/working)

Fitness Industry

Purefit has an app

free - some videos & general videos
paid - customized plan, expert advice session, all video access

Retention is big challenge as it is not compulsory for life
North start metric: Consistent 2 month users
Clarity - what, why , who, how - Target audience
Buz goals

Customer goals,
Trainer goals- Revenue, Publicity/PR

Define metric

North start metric: consistent for 2 months. 3 times a week & 3 week a month
L1 metric: AAAERR

Awareness - #users per month, #app downloads
Acquisition - CAC, #customers from referral
Adoption - #users who spent > 15min after installing app
Engagement - Avg time spent, #logins, watch time, #video seen
Adoption and Engagement is about using the features
Retention is driven from how good your product is
Retention: #users coming back on Nth day
Engagement => No of classes attended/duration of class attended
Revenue =>
Gaurd rails

User ratings
feedbacks
app uninstalls

match rate between users and context

avg time spent on videos
avg time spent on recommended videos

L2 metrics

Finance/Fintech Industry

Indian Bank -> Launching a mobile app. Identify when one can call the app as Success

Goal: Design metric to check the success of app
Clarify

for whom - Existing customers + new age customers
Why - competition
what features - Banking, UPI, cards, Investments
Security: OTP, Fingerprint

Goals

Existing Bank - Convenience, transactions
New Customers - Ease of banking

Metrics (only when you have metrics, you will understand what data to be collected. Only when you have data, then one will know success/failure)

NS : Avg transaction through app/month
Awareness: #downloads, % of app users
Acquisition-> CAC
Adoption -> Existing users converted to app users, new user acquired via app
Engagement

% customers using the app/banking services
% users with >5 txns
no of txns/user

Bounce rate

left after the 1st page
left after install

Guard rail metric

% of failed txns
%app crashes
TAT - turn around time

Task Completion Rate

Cheque balance
Transfer money
Invest
Credit card bill

Abandon rate

# of users abandoned after initiating

Root Cause Analysis - 1

Root Cause: Identifying why something is broken
Root cause Analysis: Systematic process to identify root cause
Goals of RCA

Find the root cause
Understand problem - design soln - fix the root cause
Apply learnings for future -> robust testing
CRIED framework

Clarify - Define the problem properly
Ruleout

Technical glitches
Planned events
Seasonality
Data problems - missing data, duplicate logs,

Internal factors - new launch, app updates, UI changes
external factors - Govt policy change, competitor,

ECommerce

Myntra - Problem: Decline from 5% to 3% in ordering rate. Ordering rate is total orders/total visitors on the website
Data in ECommerce

Click stream data - Online activity data - how many pages visited, buttons/links clicked, search term you type, Which item in search result is clicked, Scroll, hover your cursor on image, how much time spent on page, back button, right click
Cookie - website will store something in your browser

Benefits of Click Stream

User Information help s in Personalization & identifying problems
Defining User Journey/Routes
Customer trends/Insights
UX -> Do people like design of website/app
Digital Marketing -> ads

Clarity

Ordering rate = no of orders / total sessions of the day
What is a session / Visitors

If there are 12 hrs gap then companies treat them as diff sessions
Session Cookie - expiry time

Bounce Session

You open a page/website/app and leave with out doing any thing
Session has 1 page view
Bounce rate = Bounce session / Total Sessions

Drop in Search to Product pages

Not finding right product
out of stock

Product to Check out

Extra charges
price
no offers

External factors

Competition - cheaper price on other sites
fast delivery on other sites
bad reviews
large suppliers moving out
Govt policies
Check out to Payment page

issue with Bank page
OTP issue
Payment gateway issue

Root Cause Analysis - 2

Transport/online Trave/Uber - why cancellations have increased

Clarify

What time
what area do you see cancellations
specific devices
Driver asking for money
New competition
Which type of car is having more cancellations

df=pd.read_csv('uber-data.csv', parse_dates=[4,5], dayfirst=True, na_values="NA")
Solutions

Extra incentive for driver to take airport cab in evening
for airport, increase the distance of free car searching/checking
Based on data, pro acrively put certain cars on the hot spots
Cancellation charges to be borne by the driver
cancellation based rating score
More incentive in early morning hrs

Internal data

Looked at available data
Created new features
TROPICS/Framework for slicing

Observations -> Root causes -> Solutions

Direct causes

Immediate factors impacting the problem
Addressing direct causes solves the problem immediately
look at visible effects/symptoms

Root causes

underlying factors
Addressing root causes ensures that problem do not repeat
look at reason behind a problem

Competitor Analysis

External Data

Market presence
Delivery logistics
Product Range
User Experience
Support
Marketing and Ads
Return Policy
Mobile Apps
Offers and Discounts
Payment Options
Pricing

CRM Analytics - RFM model

CRM: Customer relationship model. A system that can manage any interaction between customer and company/business
CRM Features & Functionalities

Contact management
Lead Management(Potential customers)
Opportunity Management
Sales Forecasting
Mobile CRM
Reports & Dashboards
Sales Analytics
Marketing Automation
Sales Data
Sales Force Automation
Campaign Management

Amazon - marketing team is given 50L budget to maximize revenue/profit using this budget

Customer Segmentation(RFM- Recency, Frequency, Monetary)

Age, gender, income -> Demographic segmentation
Personality, lifestyle, fashion sense, Interest - Psychographic
Purchase Profile / behavior -

Frequent buyer -> Frequency
last of day of purchase -> recency
avg amt spent -> Monetary
Brands/category
time spent on website
High margin items -> Monetary
# of orders -> Frequency

RFM values are low for all 3 then organizations consider them as Lost Customers
Create heat map with recency on x axis and Frequency&Monetary on y-axis

You would get category of from lost, price sensitive, can't loose them, loyal, Champions

Strategy for Marketing

Acquire new users

RFM with values 5 1 [4 5] -> new customers who have done high value purchase

Retention

RFM with values 1 4 5 ->

If you want to optimize revenue/profit -> always target groups with high M(4,5)
5% of Indians spend 95% of online purchase
Always choose groups where only one thing need to be improved
Should not we incentivize customers with 5 5 5

Give Early Access
Exclusive Offers
Premium Loyalty Programs
Personalized recommendation

Moderate-High R F M

Valuable customers but require moderate marketing
Discounts (Limited)
Loyalty programs

Moderate R F F

Product Bundles
Limited time deal
Re-engagement emails/messages with offers/promotions benefits

Moderate-low R F M

win-back campaigns -> targeted mails
Abandoned cart message
General Personalized/free product descriptions

Low R F M

Customer survey
Once in a moon offers(but don't spend too much money)
lottery offers(spin wheel)

R F M model doesn't work for Organizations who sell laptops like DELL
R F M model works for electronics shop
Recency definition will change nature of business(B2B vs B2C)

Customer Segmentation

Data Processing
Calculate Recency, Frequency, monetary value
Bin/Group/Quantile for R, F, M & give values between 1-5
Convert RFM subsets -> logical sub
Monetary value -> Unit price * qty
Frequency -> count/month
Recency -> last order date
--Item level details
select InvoiceNo, StockCode,InvoiceDate,CustomerId,Quantity*UnitPrice as Item amount
from crm.sales a

--Order Level
select InvoiceNo, InvoiceDate,CustomerId, sum(Quantity*UnitPrice) as Item amount
from crm.sales a group by InvoiceNo, InvoiceDate,CustomerId
---

with orders as ( select InvoiceNo, InvoiceDate,CustomerId, sum(Quantity*UnitPrice) as Item amount from crm.sales a group by InvoiceNo, InvoiceDate,CustomerId)

select customerId, a.monetary, date_diff(b.last_date_overall, a.last_order,'DAY') as recency
from (select customerId, sum(order.amount) as monetary, max(invoicedate) as last_order,
min(Invoicedate) as first_order from orders group by CustomerId) a,
(select max(invoiceDate) as last_date_overall from orders) b
sum(orders.order_amount) as monetary, max(InvoiceDate) as last_order,

from orders a,
(select max(InvoiceDate) as last_date from orders) b
group by CustomerId

--item level table

SELECT InvoiceNo,StockCode,InvoiceDate,CustomerID, Quantity*UnitPrice as item_amount from crm.sales a ;

-- order level table SELECT InvoiceNo,InvoiceDate,CustomerID, SUM(Quantity*UnitPrice) as order_amount from crm.sales a group by InvoiceNo,InvoiceDate,CustomerID;

-- order level table

with orders as ( SELECT InvoiceNo,InvoiceDate,CustomerID, SUM(Quantity*UnitPrice) as order_amount from crm.sales a group by InvoiceNo,InvoiceDate,CustomerID), customers as ( SELECT a.CustomerID,a.monetary, date_diff(b.last_date_overall,a.last_order,DAY) as recency, a.total_orders/(date_diff(DATE(a.last_order),DATE(a.first_order),month) + 1) as frequency from (select CustomerID, sum(order_amount) as monetary, count(distinct InvoiceNo) as total_orders, max(InvoiceDate) as last_order, min(InvoiceDate) as first_order from orders group by CustomerID) a , (select max(InvoiceDate) as last_date_overall from orders) b) select *, ntile(5) over (order by customers.monetary asc) as m_score, ntile(5) over (order by customers.recency desc) as r_score, ntile(5) over (order by customers.frequency asc) as f_score, from customers; with orders as ( SELECT InvoiceNo, InvoiceDate, CustomerID, SUM(Quantity*UnitPrice) as order_amount from crm.sales a group by InvoiceNo,InvoiceDate,CustomerID), customers as (SELECT a.CustomerID,a.monetary, date_diff(b.last_date_overall,a.last_order,DAY) as recency, a.total_orders/(date_diff(DATE(a.last_order),DATE(a.first_order),month) + 1) as frequency from (select CustomerID, sum(order_amount) as monetary, count(distinct InvoiceNo) as total_orders, max(InvoiceDate) as last_order, min(InvoiceDate) as first_order from orders group by CustomerID ) a , (select max(InvoiceDate) as last_date_overall from orders) b), boundaries as (select approx_quantiles(monetary,5) as m_boundary, approx_quantiles(recency,5) as r_boundary, approx_quantiles(frequency,5) as f_boundary from customers), rfm as (select a.*,case when a.monetary<= b.m_boundary[offset(1)] then 1 when a.monetary<= b.m_boundary[offset(2)] then 2 when a.monetary<= b.m_boundary[offset(3)] then 3 when a.monetary<= b.m_boundary[offset(4)] then 4 when a.monetary<= b.m_boundary[offset(5)] then 5 END as m_score, case when a.recency<= b.r_boundary[offset(1)] then 5 when a.recency<= b.r_boundary[offset(2)] then 4 when a.recency<= b.r_boundary[offset(3)] then 3 when a.recency<= b.r_boundary[offset(4)] then 2 when a.recency<= b.r_boundary[offset(5)] then 1 END as r_score, case when a.frequency<= b.f_boundary[offset(1)] then 1 when a.frequency<= b.f_boundary[offset(2)] then 2 when a.frequency<= b.f_boundary[offset(3)] then 3 when a.frequency<= b.f_boundary[offset(4)] then 4 when a.frequency<= b.f_boundary[offset(5)] then 5 END as f_score from customers a,boundaries b), rf as ( select *,ROUND((f_score+m_score)/2,0) as fm_Score from rfm ) select * , CASE WHEN (r_score = 5 AND fm_score = 5) OR (r_score = 5 AND fm_score = 4) OR (r_score = 4 AND fm_score = 5) THEN 'Champions' WHEN (r_score = 5 AND fm_score =3) OR (r_score = 4 AND fm_score = 4) OR (r_score = 3 AND fm_score = 5) OR (r_score = 3 AND fm_score = 4) THEN 'Loyal Customers' WHEN (r_score = 5 AND fm_score = 2) OR (r_score = 4 AND fm_score = 2) OR (r_score = 3 AND fm_score = 3) OR (r_score = 4 AND fm_score = 3) THEN 'Potential Loyalists' WHEN r_score = 5 AND fm_score = 1 THEN 'Recent Customers' WHEN (r_score = 4 AND fm_score = 1) OR (r_score = 3 AND fm_score = 1) THEN 'Promising' WHEN (r_score = 3 AND fm_score = 2) OR (r_score = 2 AND fm_score = 3) OR (r_score = 2 AND fm_score = 2) THEN 'Customers Needing Attention' WHEN r_score = 2 AND fm_score = 1 THEN 'About to Sleep' WHEN (r_score = 2 AND fm_score = 5) OR (r_score = 2 AND fm_score = 4) OR (r_score = 1 AND fm_score = 3) THEN 'At Risk' WHEN (r_score = 1 AND fm_score = 5) OR (r_score = 1 AND fm_score = 4) THEN 'Cant Lose Them' WHEN r_score = 1 AND fm_score = 2 THEN 'Hibernating' WHEN r_score = 1 AND fm_score = 1 THEN 'Lost' END AS rfm_segment from rf

Quartile/Percentile

Percentile -> %values less than or equal to given value
Quartile/Percentile requires sorting which is expensive. Hence Approx Quantile is introduced
Approx Quantile(G K Algorithm)

Greenworld Khanna Algorithm - It calculated approx quantile which is 10 times faster. But it has small error/delta wrt actual ntile
approx_percentile in Oracle sql
boundaries as (select approx_quantiles(monetary,5) as m_boundary,
approx_quantiles(recency,5) as recency,
approx_quantiles(frequency,5) as frequency
from customers)
select a.*,b.* from customers a, boundaries b

ntile -> sorts the data, calculatees the boundary, assign scrore/group to each row
Industry standard segmentations

Technographic Segmentation -> gadgets, online services and softwares
Behavioral Segmentation ->
Needs-based Segmentation -> budget friendly, back pain, broken leg, chronic decease
Customer Status -> leads, new customer, loyal/long time, at risk, churned

A/B Testing & Launch Recommendation

A/B Testing: Dividing sample into 2 groups randomly. This random division and testing is called A/B testing.

Suppose a company made a new drug for fever better than paracetamol
Take sample of 100 people(sample). Divide into 2 groups and give new drug and Paracetamol. Measure #daystorecover

Case Study: Facebook is planning to launch new feature where one can choose background when posting a status

Clarify

What is the objective - "More engagement"
Has this been tested before??
Is there external proof if this works?
Who will this feature be applicable for (everyone, subset)?

Product Management - Thinks about feature
Data Science - Designs the experiment + Insights
Engg - Implement
Metrics

North Star Metric - %age of engaged user(liked, comment, posted, reacted recently in a session) where user and spent 2 mins on post
Supporting Metrics - Daily active users
Guard rails: These metrics should not degrade

Avg time spent per user per week
% of users consuming rich media
revenue/user

Designing experiment

Ho => Pa = Pb, Ha is Pa != Pb
Pa= #engaged users/#total users
Choice of test

2 numeric values -> 2 sample t test

Choose experiment control/Test object
Sample size calculator

Metric for the central(base line metric)
alpha = 0.05
Minimum detectable effect -

What change is considered meaningful for the business to take action

Experiment Duration

daily 5k customers
sample size- 80K
duration should be sample size/daily customers = 80K/5 = 16 days

Pitfalls/Problem with A/B testing

Primacy effect

People are reluctant to change

Novelty effect

Due to hype, initially a lot of people use it. Which can increase the test engagement. eg: CRED upi

Due to above the results may seem to be undermined/exaggerated
Solution

Run for longer time
Conduct test only for new users

Network Effect

Happens in social media as one can see other posts
Ensure that such impact is minimized

Outcome Bias

What other factors might be causing it

Note: 90% of A/B test fails
Launch Recommendation

Calculate the monetary/revenue impact

0.4% increase in user engagement then 0.1% increase in revenue
Calculate overall revenue impact on the entire population

Check the cost of launching to everyone

Infra,(hardware, api)

Long-term Impact(10-12 down the line)
Ensure Guard rail metrics don't go down

Business Case: Yulu Review

It is Ride sharing platform. You have electical bike stored on Hotspots. Unlock bike using app and move from one place to another. Pay for the usage. Ask is to Find out the reason for rental - Whether, Season, Holiday, weekend,

import pandas as pd
pd.read_csv("bike_sharing.csv")
df['weather'].value_counts()
import seaborn as sbn
sbn.boxplot(x='workingday', y='count', data=df) #check the mean, outliers
#should you remove outliers-> No, as data will be biased
#Esp for Hypothesis, do not remove
Ho=The count of bikes on Working day <= the count on non-working day
Ha=The count of bikes on Working day > the count on non-working day
# t-test vs z-test -> Pop std dev not known, sample size big. Hence, both are same
working=df[df['workingday']==1]['count']
non_working=df[df['workingday']==0]['count']
df.groupby('wrokingday')['count'].describe()

from scripy.stats import ttest_ind
test_stats, p_val = ttest_ind(working, non_working, alternative='greater', equal_var=False)
p_val<0.05 => hence, working day /nonworking day has impact

#check the effect of Weather. One categorical and other numerical. Hence, use Anova
w1=df[df['weather']==1]['count'].sample(800)
w2=df[df['weather']==2]['count'].sample(800)
w3=df[df['weather']==3]['count'].sample(800)

#Anova
Ho=the count of bikes are independent of weather
Ha=the count of bikes is affected by weather

#assumptions of anova
#1 Normal -> QQPlod, DistPlot, SHAPIRD
#2 Should have equal variance -- No, describe, LEVENE

import seaborn as sbn
sbn.distplot(w1)
sbn.distplot(w2)
sbn.distplot(w3)
sbn.distplot(w4)
#all above are right biased
import numpy as np
from scripy.stats import shapiro
t_test, p_value = shapiro(w1)
from scipy.stats import levene
t_test, p_value = levene(w1,w2,w3)
#Kruskal Wallis Test
#Link for code

from scipy.stats import f_oneway
t_test, p_value = f_oneway(w1,w2,w3)
p_value < 0.05 -> hence, weather is impacting

Guess Estimate - 1

Estimates with sensible assumptions/guessworks

need not be perfect
Thought process is correct or not

why guestimate qns

ability to break down open ended problems into smaller chunks

Qn: Calculate how many flights depart per day from Delhi airport

Break the problem into small problems. Take guesses on each of these parts. Combine the results of the parts

Clarify

Domestic vs International -> 80:20
Passenger vs Cargo
all flight carriers

Breakdown

Domestic vs International
Passenger vs Cargo
Peak hours/normal hours/non-operational hrs
weekend vs weekday
festive vs non-festive

Make Assumptions

Peak hrs (5-9 am, 7-10pm)
normal hrs
non-operational hrs(1am - 2am)

Guess/Calculate

use beautiful number (2 & 10s)
for breaking down use %ages
Domestic -> 10 per hr, & 5 international
= 10*24*5*24=360

Conclude

Case Study

Games24*7 want to tournament where 1st prize is 1lac, 2nd is 50K. What should be the per game entry fee for the users.

Clarify

Mobile solo game
Arcade(5 small games), Tournament (paid)
one Tournament per day
Ads- free persons sees ads
Royalty cards/skin/customizations - users pay for it
Expenditure

Prize money
Operations(Server hosting, maintenance cost)
Promotions (youtube/twitch, Influenzer marketing)
Total active users vs fee users
% of total users using royalty
Per user Royalty revenue
Ads(5 free arcade games-> 30 sec)
How many ads per game
Server Cost(5 lac per month)
Maintenance cost - 5l/month
Promotions

Revenue/month

Fees = x * 5000 * 0.2 * 30
Royalty - 200 * 5/100*5K
ads: 50 * 1 * 4K * 30

Expenditure

prize money = 30 * 1.5L = 4.5 l
server maintenance= 10L
ads= 6L

Revenue = 30x + 0.5+60 => x=1.5 rs

Analyze
breakdown
Calculate
validate

How many IPhone users are there in India

Clarify

Market share - 2%
apple subscriptions

Total population - 1.5 B and mobile users 0.8B

rich vs poor
iphone vs android(20:80)

Total population - 1.5 B and mobile users 0.8B

urban vs rural

age group

income

Total population

40% population are kids & old people => 0.84 billion

70% of people in this age group lie in upper/middle class => 0.6B

10% of people prefer it => 0.06 Billion

Guess Estimate - 2

Guess how many refrigerators are sold in India every year?
Vanity metric

good to have metric but don't directly impact the overall usage

BCG case study

Sales data
Marketing
Click Stream

Estimate the revenue earned by Google via their AdSense product

Adsense -> Ad network means that google can showcase ad from any website not hosted by Google. Owner of website/app can tell google to showcase ad on their website
count of companies in India -> 3.4 lakh digital companies in India
#website holders-> 1.5 to 2 cr
Google adSense is intermediary to publish and companies
Publishers

Newsites
PodCasts
Blogger
ECommerce -> smaller uses adsense

AdSlots

Why ads ? To sell product OR acquire a customer
Basic Terms

CPC (Cost Per Click)

Cost charged to the brand for every click on the ad
Google responsibility to show correct set of ads on correct set of website
Google has Quality Score. It maintains score for both Publisher, Ads
Based on above CPC is charged

CPM(Cost per Mille)

Cost for every 1000 views/impressions
CPM is much lower than CPC

CTR(Click thru rate)

#clicks/#impressions

Framework

Ask Qn

Where ads will be published

all websites apart from google products
all type of publishers

How long

one year

Revenue means

overall amount that brand will pay

High level understanding of problem or metric

#publishers * revenue/day/publisher * 365

State assumptions

Avg cpc, cpm, ctr
80% of revenue comes from top 20%

Estimation tree

top 20% publishers * revenue/day/publisher * 365
Click ad revenue

#vistors * #clicks/visitor/day * CTR * CPC
Total ad views * CTR
Total Visitors * AdViews/vistor * CTR * CPC

#ads seen by visitor in a day

Total pages * Ads per page * % of CPC ads

Every day 1000 users with every user views 5 pages. User will see 10CPC and 5 cpm ads.

Revenue per user would be 10 * CTR * CPC // click ad revenue
Impression ads revenue

#views * total impressions/reviews * CPM/100
total impressions/reviews = #total pages * ads per page * % of cpm ads

Bottoms up(calculating/putting values)
Sanity check/validate

Similar qns on Guesstimate

website traffice
traffic on signal
people visiting a restaurant
Revenue of zomato
Items sold

Flight Overbooking problem & Data Analysis & Visualization

Why do flights overbook

No shows
late
cancel
connecting flights
Luftansa -> 4.9 million passengers didn't show up in a year 2005. They sold only 570K seats and earned 105M $

Passenger Bumping

refund
penalty
arrange next immediate flight

Optimum number of Overbookings(Maximize the profit for the company)

Approach

Num of Seats on Flight - 100
Historical data of

AirBnB (Online market place for Bed and Breakfast)

Recommend what should be the optimum/recommended and minimum photos the host has to upload
Datas given has following fields

listingId, PostingDate, posting_time, location, Images, Bookings, Host_type

Host_type has Regular, superhost types. Regular are with 1-2 listing, Super hosts are the ones who have many properties listed

Date, Open_listing_0_2, Open_listing_3_5, Open_listing_6_10, Open_listing_11_15, Open_listings

listings with number of phostos

Property_image, Total_listing, Redundant_listing, non redundant listing, %of redundant listings

redundant listings -> no reservation in last one year, Open listing- no reservation for given day

Problem Statement: Recommend the Optimum images count and minimum images count

Results

Min Images -6 & optimum range - 11-15
General trend
highest monthly avg booking for this range
lowest no of open bookings
low redundancy
for most age-bracket 11-15 is the highest booked listing

Assumptions

Image are the main reason for booking
Image quality is not considered

Python Refresher - 1

Interpreter language like JS
Line by line execution
Highlevel languages - Python, JS
mid-level language- c++, java
Assembly code
machine code(Byte code)
Every thing is Object is Python

since everything is object, it will have associated properties & behavior

Class of a object

type()

mutable vs immutable

a=4
print(id(a))
a=5
print(id(a))

Iteration Protocol

Entire process of visiting each item once is called iterations
Iterable -> Collection of itmes
Iterator -> Pointer which points to items
iteration -> process of going over all items one by one

s="hello"
itr=iter(s)
print(next(itr))

Python Refresher - 2

Data Structures
Comprehension
Strings
Memory in python

Basics of Time and Space Complexity

9.5GHz -> Operations per sec

Algorithms are analyzed with number of operations
https://www.youtube.com/watch?v=HyznrdDSSGM&list=PLowKtXNTBypGqImE405J2565dvjafglHU

OOPS - 1

OOPS - 2

Multiple Inheritance

Functional Programming - 1

Functional Programming

Paradigm of writing code
Code in functional programming can be thought as sequence of multiple functions
Why to use it

reuse the code

lambda function

One line functions
anonymous
onetime use
sq = lambda x : x**2

Higher order function

They return another function
They take input as another function

Decorators

decorate the functions
adding more functionalities
def foo():
print("Hello everyone!")
def pretty(func):
def inner():
print('-'*50)
func()
print('-'*50)
return inner
pretty(foo)

def best(func):
def inner():
print('we are the best')
func()
print('we are best')
return inner

@best
def greeter():
print("good evening")

greeter()
Output
we are best
good evening
we are best

Functional Programming - 2

Args & KWARGS

custom_sum(a,b,*args):
print(f"a - {a}")
print(f"a - {b}")
print(f"a - {args}")
custom_sum(5,6,7,8)
a-5
b-6
args {7,8}
x,y,z, *more = (2,3,4,5,6,6,7,9)

Kwargs

def create_person(name, age, gender):
Person = {
"name":name,
"age":age,
"gender":gender
}
return Person
def create_person(name, age, gender, **kwargs):
Person = {
"name":name,
"age":age,
"gender":gender
}
return Person

kwargs -> key worded arguments

Exception Handling and Modules

one or more py files make a module (eg: math)
one or more modules make a package/library (eg: pandas)
Modules

import math

Problems

it imports the entire module
one have to write math. before every function

from math import *

math. is not required before function call
Problems

collision and overriding of content incase of collisions

from math import factorial, ceil, floor, pi

pi
ceil

import math as m

best method to import

from math import factorial as fact, ceil as c, floor as f, pi as p

import random
random.seed(100)
random.randint(0.10)

Exception Handling

File Handling

import requests
url = "http://...sample.jpeg"
res = request.get(url)
with open("sample.jpeg","wb") as img: #wb -> write binary
img.write(res.content)
file = open("scaler.txt","w")
file.write("first line")
file.close()

Linear Algebra 1 - The ML context

Conceptualize -> Visualize -> Math -> Code
Fish Sorting Example
OLS -> Ordinary Least Squares
Properties is considered as Features(Independent variables) & outcome/what we predict is Target(Dependent Variable)
Input -> Model -> Output
Process of building an ML algorithm

Data Collection
Data Visualization -> Plot, PCA, TSNE -> reduces dimensions
Choosing an appropriate Geometrical structure to separate classes
Choosing a LOSS function which helps decide the best structure. (sum of distance of data from line)
Training/optimization -> Gradient descent

Coordinate Geomentry

Straight line -> y=mx+c

where m is slope
c is y intercept (when x is 0)
General equation of line is
w1x+w2y+w0 = 0
w2y = -w1x -w0
y = -w1/w2 x - w0/w2 c
slope = -w1/w2 & constant = -w0/w2
For parallel lines m1,m2 is same
for perpendicular likes m1 * m2 =1

2 dimensions - line => w1x1+w2x2+w0 = 0
3 dimensions - plane => w1x+w2x2+w3x3+w0 = 0
4 dimensions - hyper plane (higher dimensional plane)

Linear Algebra 2

Vectors

Ordered set of numbers
represented by x bar [x1 x2]
Magnitude -> distance from origin
Magnitude of x bar is sqrt(x1^2 + x2^2)
Norm of a vector -> Magnitude or length of vector

L2 norm

length of distance between two points(Euclidian distance)
sqrt((x2-x1)^2 + (y2-y1)^2)

L1 norm

Manhattan distance
(x2-x1) + (y2-y1)

Dot Product

For points (a1,b1) & (a2,b2) dot product is
a1b1 + a2b2
a.b = a * b * cos theta (theta is angle between two lines/points)

Angle between two vectors

If dot product of two vectors is -1 then they are perpendicular to each other

Matrix Multiplication
Unit Vectors

Vectors have both magnitude as well as direction
Unit vectors are the ones with magnitude as 1
represented by x hat
norm of x hat is 1
They are used to represent direction

Vector Projection

Linear Algebra 3

Norm of a vector

It is distance from origin OR length of magnitude of the vector
sqrt(x1^2 + x2^2...)
Manhattan distance = x1+x2+..xn

Dot product between two vectors

a transpose * b -> matrix multiplication
which is always equal to norm of a * norm of b * cos theta

if the angle between two vectors is acute then dot product is +ve
If dot product of two vectors is zero, then they are perpendicular to each other
when angle is between 90 and 180, it is same as case2
Relation between Weight Vector and hyper plane

Dot product of Weight vector and x vector is 0 when line passes from origin

Which means they are perpendicular/orthogonal to each other

Linear Algebra 4

Recap

Loss function

Perception Algorithm

Linear Algebra 5

Recap

neeta -> learning rate

Perception learning algorithm
Problem solving

Optimization 1 - The need for calculus in ML

Mathematical representation of classification problem

Gradient decent is an algorithm for optimization

Calculus topics

Maxima, minima
calculus in multi variable

calculus in singlevariable

derivative, slope, tangent

limits, continuity, differentiability

functions

Functions

Domain: All the possible values that the input can take
Range: Collection of all possible outputs
Sigmoid function

y=1/(1+e^-x)
Domain is (-infinity, + infinity)
Range is (0,1)

sin function

y=sin x
Domain is (-infinity, + infinity)
Range is (-1,1)

cos function

y=cos x
Domain is (-infinity, + infinity)
Range is (-1,1)

tan function

y=tan x
Domain is (-infinity, + infinity)
Range is (-infinity, + infinity)

Signum/Step function

y=1 when x>0
y=-1 when x<0
y=0 when x=0
Domain is (-infinity, + infinity)
Range is (-1,0,1)

Optimization 2 - Towards gradient descent

Limits

x^* =argmin(x-2)^2
means find value of x such that (x-2)^2 is minium
value is 2 in this case
What is value of x +2 as x approaches 1
ans: 3
limit(x^2-1)/(x-1) as x tends to 1
ans is 2

Continuity

Not a continuous function

Signum/Step function

y=1 when x>0
y=-1 when x<0
y=0 when x=0

y=x^2, for all except 0, for 0 it is 2

Condition for Continuity

At every points in its domain, RHL=LHL=f(x)
x-> xo- is same as x->xo+=f(x)

Differentiation

If there are 2 points on a straight line, (x1,y1) & (x2,y2), the slope of straight line is
tan theta=y2-y1/x2-x1

Differentiability
Rules of Differentiation
Derivatives for optimization

DSML Module Test: Advanced Python

Optimization 3 - Gradient descent in action

Rules of differentiation

use of derivatives

for minima

dx/dy is 0 and d(dx/dy)/dy>0

for maxima

dx/dy is 0 and d(dx/dy)/dy>0

for saddle point

dx/dy is 0 and d(dx/dy)/dy=0

maxima/minima/saddle point are called critical points

Intro to multi-variable calculus

Partial derivative

Intro to gradients

Optimization 4 - Constrained optimization

Gradient represents the direction with steepest increase.
Gradient descent intuition
Generalization of G.D
Gradients of some common functions

Constrained optimization

Optimization 5 - Principal component analysis

Types of Gradient descent

Batch/Vanila

We use entire dataset to update w vector at each iteration
This will be very slow process and a lot of computation will be required

Mini Batch gradient descent

Instead of taking all the data points, you take a subset of data points
We choose K data points randomly where K<N
Updates will be faster

Stochastic Gradient Descent

We only take one data point to update the weights
Batchsize k=1
epoc -> iterations to cover entire dataset

Challenges when more dimensions(50+) are present

Visualizations, Computations, trainings will be large and difficult
Curse of dimentionality

Maths become complex
Distances become sparse

Steps

Find the mean of the data and shift origin to the mean
Rotate the axis such that the x-axis is in the direction where variation is minimum

Reduces dimensions
If we plainly take subset of features then we will be losing lot of information
In PCA, we are reducing features but trying to retain as much info as possible
PCA works well when features are correlated

Implementing PCA

Standardization of data
Find direction with maximum variance

Take random vector u
sum of projection of all the vectors on the new vector
maximize the projection

Recommendation: movie suggestions, add suggestions
time series forecasting: stock price prediction, sales prediction, freight prediction, demand prediction
Supervised vs Unsupervised learning

Classification/Regression -> supervised learning -> Training data -> target value/ labels are provided. Figure out relationship between features and target value.
semi supervised learning
reinforcement learning - algorithm will create its own features and

unsupervised learning

no target data is given
no relation between features and target value
clustering
Recommendations
similarity between data points

What do you think about the nature of Car Resale price prediction?

Predicting continuous value and hence Regression. (discreet value will be classification)

Linear Regression Example

Dataset: Cars24 used car dataset
Features: year, km driven, mileage, rate, model
Task: Predict selling price of used car
Experience -> Training data
In this lecture we are not developing math.

Linear regression implementation of library Scikit-learn(sklearn)

Steps

Raw data -> preprocessing

outlier removal, missing values treatment
Categorical -> Numerical

EDA -> Exploratory data analysis
Feature Engineering -> new features from raw data
train-test split
data normalization(scaling of data)
Techniques to convert Categorical to Numerical data

one hot encoding, label encoding, target encoding

Feature normalization(Scaling)

Bring all the features to the same scale

Standardization-> z = (x-mu)/sigma
min-max scaling -> x - min/max-min

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
x=scaler.fit_transform(df.iloc[:,1:])
x=pd.DataFrame(x,columns=df.iloc[:,1:].columns )

Search This Blog

AIML

Python

Comments

Post a Comment

Popular posts from this blog

LangChain

AutoGen

Intro to AI Agents: Build an Army of Digital Workers with AI