Python
Data is the new oil -> every time you are on net, you generate data #datadrivendecisions
Raw data -> clean -> manipulate and make sense of data -> model (predictions)
Why to learn
Begginer- Data analyst
- Tableu & Excel
- sql
- python)
Intermediate -> Data analysis and Visualization
- Python Libraries(Numby, Panda, Matlab)
- Probability & stats
- Product Analytics(case studies)
Advanced (Data scientist)
- Foundations of ML & DL (Adv python, Math for ML & DL)
- Machine Learning
- Deep Learning
- ML Ops
- ADV DSA
- Puzzle: 3 boxes, apple, oranges, mix with a incorrect labels. Pick any box and single fruit from it and correct labels on each box.
- Clue: Every label on box is incorrect
- Pick a fruit from basket labelled A+O
- Lets say you pick up O. Correct the lable on the basket to O.
- Boxes we have left are lableled A & O but their correct lables can only be A OR A+O
- The box labelled A can't be A. So its A+O
- So, the box labelled O is A
- Python
- idle editor
- cmd prompt
- IDE
- visual studio
- Jupiter
- colab -> https://colab.research.google.com/?utm_source=scs-index
- v2=input("some text") -> always gives strings
- ** -> exponentional
- // -> floor division
- math.ceil(math.pi*A*A);
- print("a","b","c",sep="*"
print("a",end=" ") - case=int(input());
t=1;
while t<case:
number = int(input());
int i=1;
while i<=10:
print(i * number, end=" ");
i++;
t++;
- a=[1,2,3,4,5,6,7]
for i in a: - for i in [1,2,3,4,5,6,7]:
- for i in range(1,8):
- range(1,8)
- range(8) -> 0 to 7
- range(start, end, jump)
- pass #code for future use. Does nothing
- none
- continue
- break
- counter=0
while True:
a=input("")
counter+=1
if a=='q':
break;
print(counter)
T = int(input())
for i in range(T):
A = int(input())
B = int(input())
lcm=max(A,B)
while(1):
if ((lcm%A==0) and (lcm%B==0)):
break;
lcm+=lcm;
print(lcm);
- print("* "*6)
- chr(65) means A
Take an integer N as input, print the corresponding pattern for N.
For example if N = 5 then pattern will be like:
____* ___** __*** _**** *****
- help(range)
- doc strings
- def myfunction():
"""
documentation of function
"""
- Lambda functions(Anonymous functions)
- One line function
- lambda input: output
- v = lambda x : x+10
- v(2)
- (lambda x:x+90)(100)
- lambda x : x+10 if x>8 else x-20
- Data Structures:
- The way to structure the data.
- Store is properly
- Process it
- Retrieve the same
- list.append
- list.insert(5,55)
- list.extend([4,5,6]) //append multiple values
- list+[3,4]
N=int(input())
N1 = input().split()
list=[]
for i in N1:
list.append(int(i))
X = int(input())
Y = int(input())
list.insert(X-1,Y)
for i in list:
print(i,end=" ")
return 0
- [start:end:step] //
- runs[0:3:1]
- runs[2,5]
- oddmatches[0:len(list):2] => oddmatches[:len(list):2] => oddmatches[::2]
- a=[1,2,3,4,5]
last=a[-1]
rest=[:len(a)-1]
last+rest - last=a.pop() #removes the last element
a.pop(index) - a.remove(val) #removes element
- a.index(val)
- a.count(val)
- reverse list
- a[::-1]
- a[-1::-1]
- a.reverse() #reverses the list in same variable
- reversed(a) #for strings, tuple
- List 2D
- [[1,2,3],[4,5,6],[7,8,9]]
- rows=3
cols=3
for i in range(rows): - for j in range(cols):
- print a[i][j]
- for i in random:
for j in i:
print j - outerlist=[]
for i in range(3):
innerlist=[]
for j in range(3):
a=int(input())
innerlist.append(a)
outerlist.append(innerlist)
outerlist
- 80% of data is in string format
- ' " """ are same
- """ are used in multi line comments
- ASCII -> American standard code for information interchange
- "1"*6
- F-Strings and String Formatting
- l=3, b=2, a=l*b
print("lenght = {}, breadth ={}, area={}".format(l,b,a))
print(f"lenght = {l}, breadth ={b}, area={a}") - ord("a") // to print ASCII value of a char
- chr(97)
- a="adsf"
for i in a:
print i - Reverse a string
- a=input()
a[::-1] - list(reversed(a))
- Palindrome or not
if a[::-1].lower() == a.lower() : - String of comma separated values, convert to string of individual values
- for i in range(len(strval)):
a=strval[i];
res=chr(a)
str1=str1+res
return(str1) - a=input()
a.split(",") - for i in a.split("-"):
print(chr(int(i)) - a="adf"
j=list(reversed(a)) //gives list
"*".join(j) - join -> list to str
split -> str to list
- .find() -> find substring location in a string
- "it is a dancing doll".find("dancing") -> 8
- -1 if not found
- .index() -> location of any specific element in a list
- .replace("a","b")
- .count("a")
- .isdigit()
- .isalpha() //alphabets are not
- .isupper() & .islower()
- in operator #returns true if string is present inside other string
- "str" in "Thias asdf asdf"
- ReadOnly Lists
- t=(4,5,6,0)
- t=(4,)
- t[2]
- Packing and Unpacking
- when data type is not defined, it will be packed as tuple
- a=1,3,54 -> is a tuple
- unpacking
- a,b,c=(1,4,56)
- a= [(2,'asdf'),(3,'asdf'),(4,'asdf'),(5,'asdf')]
for i in a:
print i #prints tuple - for i,j in a
print i #prints id - list(tuple) #convert tuple to list
- Unique data
- similar to lists, tuples
- {1,2,3}
- No order & hence no indexing
- {} is dictionary
- b=set() #is empty set
- {1,2,3}
- .add(5)
- .remove(2)
- .pop() # remove any one element
- .update({4,5,6}) #append/update
- list("Venket") -> ["V","e","n","k","e","t"]
- set("Venket")-> {"V","e","n","k","t"}
- Symmetric Difference
- A^B => (A-B) + (B-A)
- SETS Operations
- voda={"Abc","Def","Ehj"}
air={"Abc","Def","Ehj"} - voda.intersection(air)
{"Abc","Def","Ehj"} - voda.union(air)
- voda.difference(air)
- voda.symmetric_difference(air)
- Sets can only store immutable objects
- s={(1,2,3,4)}
- Data structure is to organize data so that we can retrieve data and store data
- Word: Meaning OR {key:value}
- {a: "meaning of a"}
- a={"first":"first val","second":"second value","third":"third value"}
- Dictionalries
- Not ordered
- duplicate keys not allowed
- not subscritable
- Values can be any data structure & can exists duplicates
- a[first]="val2"
- a.update(b) #add to dictionaries
- res=a.get("z","not found")
print(res) - a.pop("key")
- Iterating
- for i in a:
print(i)
print(a[i]) - a.keys()
- a.values()
- a.items() #gives tuples
- for k,v in a.items():
print(k,v) - How to check if key exists
- "ads" in a.keys()
- Allowed datatypes as a key in disctionary
- any data type that is immutable
- set
- strings
- boolean
- int
- Take a String as input. Create a dictionary using following criteria
- There will be one key for each unique key
- key will be charter name and value will be count of occurence
- Get all unique chars
- count that using count method of string
- str1="adfasdfasdf"
d={}
for i in set(str1):
d[i]=str1.count(i);
- It is like a Utility in Python
- List Comprehension
- [<output> for loop ]
- [i for i in range(1,100) ]
- [i for i in range(1,100) if i%2==0]
- ["Even" if i%2==0 else "Odd" for i in range(1,100)]
- students=["one","two","three"]
marks=[2,3,4]
{students[i]:marks[i] for i in range(len(students))} - Memory
- Stacks -> store variable and refer to address of actual value
- Heaps -> store actual values
- a=5
print(id(a)) - Garbage Collection
- Mutable vs Immutable
- shallow copy
- a=b #same address
- Deep Copy
- b=a[:]
- DAV -> Data Analysis and Visualization
- NumPy -> Numerical Python
- Pandas ->
- Matplotlib/seaborn for visualization
- NPS -> Net Promoter Score
- 1-6 is sad (Detractors)
- 7-8 is neutral (Passive)
- 9-10 happy (Promoter)
- % of Promoters - % of Detractors
- If NPS is > 70% -> Great performance
- Range of NPS-> -100 to 100
- Huge volume of data
- adv features
- pypi.org
- !pip install numpy
- import numpy as np
- Why Numpy
- Numpy is like List built using C code
- List data is not contiguous where as Numpy data is contigous. Hence, Numpy is faster.
- It takes less space as the memory is contigous
- List can contain hetrogenous data and numby is homogeneous data
- It makes mathematical operations in one line instead of multi line
- a=[1,2,3]
np.array(a) - [i**2 for i in range(1,100) ]
- a=[1,2,3]
arr=np.array(a)
arr**2 - %timeit [i**2 for i in range(1,100) ]
- %timeit arr**2
- arr.ndim
- a=array([[1,2,3], [1,2,3]])
- np.arange(10)
- np.arange(1, 10, 0.5)
- np considers as single data type
- np.array([1,2,3],dtype='float')
- a.dtype
- np.array(a)[3:5]
- m1=np.array([1,2,3,4,5,6,7,8,9])
new=m1>6 #returns true/false for each element in an array
m1[new] #returns values satisfied - m1[[2,5]] #multiple index in a list
- m1[true,false,true,false] #returns values when true and leaves false
- m1 = np.arange(1,20)
filter = m1%2 ==0 #true and false
m1[filter] - score = np.loadtxt('survey.txt',dtype='int')
score[:5] #first 5 elements
score.shape #count of elements
len(score) #count of elements
score.min()
score.max()
detractors = score[score <= 6]
len(detractors)
detractors.shape[0] #first element
promoters = score[score >= 6]
len(promoters)
total=len(score)
perc_dect=(len(detractors)/total)*100
perc_promo=(len(promoters)/total)*100
perc_promo - perc_dect
- Case study on fitbit
- date, step_count, mood, calories_burned, hours_of_sleep, active
- data = np.loadtxt('survey.txt',dtype='str')
data.shape #rows & cols
data[:5] #first 5 rows - a=np.array(range(16))
a.shape #elements
type(a.shape) #tuple - Reshape
- a=np.array(range(16))
a.reshape(4,4) #rows, cols - a=np.array(range(10,91,10))
a.shape
a.reshape(2,-1) #arrange with 2 rows - a[1:3,2:4] #no errors even if index exceeds size
- a[: , 1] #gives 1D
- a[: , 1:2] #gives in 2 D
- a1=np.arange(10,91,10)
a1 #array([10, 20, 30, 40, 50, 60, 70, 80, 90])
a1[[2,3]] #gives array([30,40]) - 2D
- a=a1.reshape(-1,3)
- a[[0],[0]]
- a[[0,1,2],[0,1,2]] #diagnoal elements
- Transpose
- data_t=data.T
- date,step_count,mood, calories_burned, hours_of_sleep,activity_status=data.T
- step_count.astype('int')
- calories_burned = np.array(calories_burned, dtype='int')
- np.unique(mood)
- Filter
- m1=np.arange(12)
- m1>6
- m1=np.arange(12).rearrange(3,4)
- m1>6
- m1[m1>6]
- filter = mood=='Happy'
- step_count[filter] #steps on happy day
- Aggregate functions
- a=np.arange(1,5)
np.sum(a)
np.mean(a)
np.min(a)
np.count_nonzero(a) - a=np.arange(12).reshape(3,4)
np.sum(a) - axis 0 -> column
axis 1 -> rows - np.max(a, axis=0)
- a=np.array([1,2,3,4])
a=np.array([2,5,4,1])
a>b - np.any(a>b) #checks if there is any true
np.all(a>b) - arr = np.array([1,2,3,4,-4,8,5,-2])
np.where(arr<0,"wrong value","correct value") - step_cnt_happy_or_neutral = step_count[(mood=='Neutral' | (mood=='Happy')]
len(step_cnt_happy_or_neutral)
np.mean(step_cnt_happy_or_neutral.astype('int')) - step_count.astype('int')>4000
- mood[step_count.astype('int')>4000)
np.unique(mood[step_count.astype('int')>4000], return_counts=True)
- import numpy as np
data=np.loadtxt("../fit.txt", dtype='str')
date,step_count,mood, calories_burned, hours_of_sleep,activity_status=data.T - max steps from dataset
- step_count = np.array(step_count, dtype = 'int')
np.max(step_count) - Index of max count & get date for the max count record
- step_count.argmax() #gives the index of the record with max steps
- date[step_count.argmax()]
- np.ones
- np.ones((5,2)) #matrix with 1s
- np.ones(4, dtype='int')
- Multiplication
- a=np.array([1,2,3,4])
b=np.array([1,2,3,4])
a+b #performs operation element by element
a*b #performs operation element by element - Matrix Multiplication
- np.dot(a,b)
- np.matmul(a,b) #cannot work with matrix and scaler value.
- a@c
- np.dot(a,3)
- Argmax
- np.argmin([2,6,7,348,3,,1,4]) -> returns index of min value
- np.argmin(b, axis=0) -> min in columns
- def vdesu(x):
if x%2==0;
x+=2;
else:
x+=3;
return x;
a=np.arange(1,13)
v1=np.vectorize(vdesu)
v1(a) - f1=np.vectorize(math.log)
f1(np.array([1,2])) - https://colab.research.google.com/drive/1leiKIvtZg-Lc7EMBceyCZ4NgnnbhVCrB?usp=sharing
- Array Broadcast
- a=np.arange(0,40,10)
np.tile(a,3,1) - a=arange(0,4)
- Image App
- import matplotlib.pyplot as plt
kaju=plt.imread("dog.jpeg")
plt.imshow(kaju)
kaju.show
kaju[1,1,1] #row, column, channel/color
plt.imshow(kaju[::-1,:,:]) #reverse picture
plt.imshow(kaju[:,::-1,:]) #mirror image - zoom the face
- plt.imshow(kaju[20:250,20:450,:]) # crop
- plt.imshow(kaju[::20,::20,:]) #blured image by jumping pixels
- Contrast
- np.where(kaju>150,255,0)
- plt.imshow(np.array([[[0,255,0,]]])) //R G B for each pixel with 0 as darkest and 255 is lightest
- inverse color
plt.imshow(kaju[:,:,::=1])
- 1D array is called Vector
2D array is called Matrix
more than 2D are called as Tensors - B = np.arange(24).reshape(2,3,4) //2 matrices of 3*4 rows and cols
- import matplotlib.pyplot as plt
img=np.array(plt.imread('fruits.png'))
plt.imshow(img) - Shallow Copy
- Only copies header with change in shape.
- b=a
- c=a.view()
- Deep Copy
- c=a.copy #doesn't work with type is Object
- Math functions by default creates deep copy
- np.shares_memory(a,c)
- Deep copy for Object data type
- import copy
copy = copy.deepcopy(arr) - https://colab.research.google.com/drive/11PCLFO4MR_nKeM4QqFbIslq7sqJedmsX?usp=sharing
- Splitting
- x=np.arange(9)
np.split(x,3)
np.split(x,[4,6]) #split on index
np.split(x,[2,5,8]) #split on index - np.hsplit(x,2) #horizontal split - splits on column to get multiple arrays on 2D matrices
- np.vsplit(x,2) #horizontal split
- VStack
- a=np.arange(10)
b=np.arange(11,21)
c=np.vstack([a,b])
d=np.vstack([a,c]) - a=np.arange(10)
b=np.arange(11,21)
c=np.hstack([a,b]) - z=np.array([[2,4]])
zz=np.concatenate([z,z], axis=0) #vstack
flat=np.concatenate([z,z], axis=None) - arr = np.arange(6)
a = np.expand_dims(arr, axis=0) #used to increase the dim. axis denotes where it should have 1 dim
a.shape #returns (1,6)
a = np.expand_dims(arr, axis=1)
a.shape #return (6,1) - arr=np.arange(6)
arr[np.newaxis,:] #new axis at first index
arr[:,np.newaxis] #new axis on column - arr=np.arange(9).reshape(1,1,9)
1*1*9
np.squeeze(arr) #removes all 1dimentions and returns original arr
arr=np.arange(9).reshape(2,1,5)
np.squeeze(arr) #returns 2*5 matrix
- Works on top of NumPy
- Can load any datatype
- Same as excel or csv
- can write sqls on it
- pip install pandas
import numpy as np
import pandas as pd
df=pd.read_csv("mckinsay.csv") - type(df) -> pandas.core.frame.DataFrame
- when csv has one column it is called Series & multiple columns is DataFrame
- df.info() #complete info about the DataFrame
- String is Object in Pandas
- df.head(5) #first 5 rows
- df.tail(10)
- df.shape #gives (rows, cols)
- df.columns #to get column names
- df.keys #to get column names
- df['country']
- df[['country','name']]
- df['country'].unique()
- df['country'].nunique()
- df['country'].value_counts() #similar to groupby count
- df.rename({"country":"Country","population":"Population"}, axis=1) #columns
- df.drop('country',axis=1)
- df['new year'] = df['year']+2
- Create DataFrame from Scratch
- pd.DataFrame(ll, columns=['first','second'..]) #pass the lists of lists matrix
- pd.DataFrame({'country',['Afg','Afg','Afg'],
'year',[1900, 1950,1980]
}) - https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing
- Working with Rows
- df[index] = [i for i in range(1, df.shape[0]+1)] #change the default/implicit index to explicit
- Loc/ILoc
- ILoc - index/implicit based location
- df.loc[2] #row with explict index of 2
df.iloc[2] #row with index3 will be returned - df.iloc[1:3,:]
- import numpy as np
import pandas as pd
df.read('mckinsey.csv') - Set column as index
- df.set_index("country")
temp.iloc[0:5,1:3] #
temp.reset_index(inplace=True) - Add New row
- new_row={'col':val,'col2':val2....} #using dictionary
df.append(new_row,ig) - df.loc[len(df)] = ['value1','valu2'..] #
- Drop rows
- df.drop(1) #explict drop
- df.duplicated()
- df.drop_duplicates()
- df.drop_duplicates(keep='first/last/False')
- df.drop_duplicates(subset='<column name>')
- Mathematical
- df['life_emp'].mean()/.sum()
- Sort
- df.sort_values("life_exp",ascending=False)
- df.sort_values(['year','life_exp'],ascending=[False,True])
- Join
- users = pd.DataFrame({"userid":[1,2,3], "name":["a","b","c"]})
- msgs= pd.DataFrame({"userid":[1,1,2,3], "msg":["hmm","acha","ok","asdf"]})
- pd.concat([users,msgs],axis=1)
- users.merge(msgs,on='userrid') #inner join
users.merge(msgs,on='userrid',how='outer') #outer join - users.merge(msgs, left_on='id', right_on='userid')
- IMDB
- movies = pd.read_csv('movies.csv',index_col=0) #remove unamed col
movies.shape - directors = pd.read_csv('directors.csv',index_col=0)
directors.shape - movies['director_id'].value_counts() #directors making movies
movies['director_id'].nunique() - movies['director_id'].isin(directors['id'])
np.all(movies['director_id'].isin(directors['id']) - data=movies.merge(directors,how='left', left_on='director_id', right_on='id')
data.drop(['director_id','id_y'],axis=1,inplace=True) - data.info()
- data.describe() # gives count, mean, std, min, 25%, 50%, 75%. max for each int columns
- data.describe(include=object) #gives values applicable to Object type columns
- data['revenue'] = data['revenue']/1000000.round(
- data[data['vote_average']>7]
- data.loc[data['vote_average']>7,['title','director_name']]
- def encode(text):
if text='Male':
return 0
else:
return 1 - df.iloc[0]['gender']
encode(df.iloc[0]['gender'])
data['gender'].apply(encode)
df['gender_mapping'] = data['gender'].apply(encode) - How to find sum of revenue and budget per movie
- data[['revenue','budget']].apply(np.sum, axis=1)
- How can I find profit per move(revenue - budget)?
- def prof(x):
return x['revenue'] - x['budget'] - data['profit'] = data[['revenue','budget']].apply(prof,axis=1)
data - Group By
- data.loc[data['director_name']=='Raja Mouli','title'].count()
- data['director_name'].nunique()
data['director_name'].value_counts() - data.groupby('director_name').ngroups
data.groupby('director_name').groups
data.groupby('director_name').get_group('Venkat')
data.groupby('director_name').get_group('Venkat')['title']
data.groupby('director_name').get_group('Venkat')['title'].count() - How can we find multiple aggregations of any feature
- data.groupby('director_name')['year'].aggregate(['min','max'])
- Highest budget movie for every director
- data.groupby('director_name')['budget'].max()
- Filter out director names with max budget >100Million
- data_dir_budget = data.groupby('director_name')['budget'].max().reset_index() #to get normal data frame
names = data_dir_budget.loc[data_dir_budget["budget"] >= 10000000, "director_name"]
data['director_name'].isin(names)
data.loc[data['director_name'].isin(names)] - Lambda Function
- x = lambda a : a+10
x = lambda a,b : a+b
x(2,6) - which director is getting max vote_average
- data.groupby('director_name')['vote_average'].max().sort_values(ascending=False)
- data.groupby('director_name').filter(lambda x:x['vote_average'].max()>=8.3)
- Filter Risky Movies
- def func(x):
x['risky'] = x['budget'] - x['revenue'].mean() >=0
return x
data_risky = data.groupby('director_name').apply(func)
data_risky.loc[data_risky['risky']] - lambda a,b: <True> if a>b else <False>
lambda a,b: "Dancing" if a>b else "Cooking" - Filter only the ages that are greater than 18
- ages = [13, 12,17, 19, 56,7]
ages[lambda a: True if a>18 else False]
filter(lambda a: a>18,ages) #filter helps pass list to lambda
list(filter(lambda a: a>18,ages)) - Square of every number in list
- a = [13, 12,17, 19, 56,7]
list(map(lambda a: a**2,a))
- data.groupby('director_name')['title'].count().sort_values(ascending=False)
- data_agg = data.groupby('director_name')[['year','title']].aggregate({"year":['min','max'],'title':'count'})
data_agg.columns
[i for i in data_agg.columns]
["_".join(i) for i in data_agg.columns]
data_agg.columns = ["_".join(i) for i in data_agg.columns] - data_agg.reset_index()
- data_agg['years_active'] = data_agg['year_max'] - data_agg['year_min']
data_agg['movies_per_year'] = data_agg['title_count']/data_agg['yrs_active']
data_agg
- data = pd.read_csv('pfizer_1.csv')
- MELT #convert few columns to rows
- pd.melt(data,id_vars=['Data','Drug_Name','Parameter'])
- pd.melt(data,id_vars=['Data','Drug_Name','Parameter'], var_name='time', value_name='reading')
- PIVOT #to reshape the data
- import numpy as np
df = pd.read_csv('weather.csv')
df.pivot(index = 'city', columns='date') - df.pivot(index = 'city', columns='date', values='humidity') #to get humidity alone
- df.pivot(index = 'date', columns='city')
- Pivot Table
- df.pivot_table(index = 'city', columns='city', aggregate='mean')
- pd.pivot_table(data_tidy, index='Drug_Name', columns='Data',values=['Temperature'],aggfunc=np.mean]
- Handling Missing Values
- type(None) => NoneType #used for non-number entries. It is object data type
- type(np.nan) => float #used for numbers
- pd.series([1,2,np.nan, None]) => None will be converted to nan
- data.isna() # to check null values
- data.isnull() # to check null values
- data.isna().sum() #sum of null values in the data set
- data.dropna() #
- data['2.30'].fillna(0)
- data['2.30'].fillna(data['2.30']).mean())
- data_melt = pd.melt(data, id_vars = ['Date','Drug_name','Parameter'], var_name = 'time', value_name = 'reading')
data_tidy = data_melt.pivot(index=['Date','time','Drug_name'], columns = 'Parameter', values='reading')
data_tidy = data_tidy.reset_index() - def temp_mean(x):
x['Temperature_avg'] = x['Temperature'].mean()
return x - data_tidy.groupby(['Drug_Name']).apply(temp_mean)
- data_tidy.groupby(['Drug_Name'])['Temperature'].mean()
- data_tidy.isnull().sum()
- Display the rows where temp is missing
- data_tidy['temperature'].isnull()
data_tidy[data_tidy['Temperature'].isnull()]
data_tidy['Temperature'].fillna(data.tidy['Temperature.avg']) - Pandas Cut
- tem_points = [5,20,35,50,60]
temp_labels = ['low','medium','high','very high']
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'],bins=tem_points, labels=temp_labels) - String Function and motivation for datetime
- data_tidy['Drug_Name].str.replace("hydrochloride","asdf")
data_tidy[data_tidy['Drug_Name].str.contains("hydrochloride")] - pd.to_datetime[data_tidy['timestamp'])
- data_tidy['timestamp'][0].year
- data_tidy['timestamp'].dt.year
- data_tidy['timestamp'][0].strftime['%y']
- Exploratory -> Python is good
- Understanding data/what are characteristics of data.
- Explanatory -> Tableau is good
- Story telling for others
- Why data visualization in Python
- Quick Analysis
- unstructured data(tableau, excel, PowerBI requires structured data)
- Easy and wide manipulation options
- Science behind data visualization
- Anatomy of chart
- how to use the right plot
- Art in data visualization
- Color, scale, labels
- highlighting something
- Libraries
- matplotlib
- seaborn(wrapper on matplotlib to make simpler and beautiful)
- !pip install matplotlib
!pip install seaborn
import seaborn as sns
import matplotlib.pyplot as plt - Terminologies
- Columns are called as Features/Variables
- Rows are records/data points/samples
- Data Types
- Numerical
- Categorical
- Ordinal -> has order like low, medium, high which have inherent order
- Non-ordinal -> Male, female where both are same and no order
- Choose right plot
- How many variables/features are involved
- Variables type -> Numerical/Categorical
- Types of variables
- 1 variable -> univariate
- 2 variable -> bi variate
- 3 or more -> multi variate
- Univariate
- Numerical
- Categorical
- Bivariate
- Numerical - Numerical
- Categorical - Categorical
- Numerical - Categorical
- Multivariage
- Num - Num - Num
- C-C-C
- N-N-C
- N-C-C
- Anatomy of matplotlib
- Figure -> entire visualization
- suptitle -> Title of entire visualization
- Axes -> Charts1..n
- Title
- Major tick, Minor tick
- Axis
- plot
- xlabel, ylabel
- legend
- x_val=[0,1,2]
y_val=[3,5,9]
plt.plot(x_val,y_val) - data=pd.read_csv('final_vg.csv')
data.head() - Univariate
- Categorical
- Count -> bar #Ideal categories should be around 5
cat_counts = data['genre'].value_counts()
x_bar = cat_counts.index
y_bar = cat_counts
plt.figure(figsize=(12,8))
plt.bar(x_bar[:5], y_bar[:5], color="red",width=0.2)
plt.xticks(rotation=90)
plt.xlabel("genre")
plt.show() #ensures only final chart is shown - #seaborn
sns.countplot(x="Genre",data=data, order=data['Genre'].value_counts().index, color='blue')
plt.xticks(rotation=90)
plt.show() - %age -> pie #
- plt.pie(y_bar, labels=x_bar,startangle=90,explode=(0.2,0,0,0,0,0,0,0,0,0,0))
plt.show() - Numerical
- how is data distributed
- outliers
- is it skewed
- special numbers -> min, max, range..
- histogram -> divide data into bins and depict the frequency
- Histogram - Popularity of video games over the years. which year has max popularity
- plt.hist(data['year'])
plt.show() - count, bins, _ = plt.hist(data['year'])
count
bins - KDE === Kernel Density Estimate Plot
- sns.kdeplot(data['Year'])
- Box Plot
- outlier
- lower whisker #Q3 - 1.5*IQR - min score
- Q1/lower quartile #25th percentlie
- Q2 #50th percentlie
- Q3/upper quartile #75th percentlie
- upper whisker #Q3 + 1.5*IQR - max score
- outlier
- Inter quartile range - Q3-Q1
- plt.figure(figsize=12,5))
sns.boxplot(y=data["Global_Sales"])
- Revision
- Univariate
- Categorical - Bar, Pie
- Numerical - Hist, KDE, Box Plot
- Bivariate Analysis
- Numerical-Numerical(continuous - continuous)
- sales, year
- how does sales change over years
- how are features associated(correlation)
- Line Plot
- ih=data.loc[data['Name']=='Ice Mockey']
sns.lineplot(x='Year', y='Global_sales', data=ih)
plt.grid() - rank, sales
- Line chart fails if you have too many points at same x-axis
- Scatter plot helps in understanding grouping & co-relation
- Categorical-Categorial
- publisher, platform
- Preferred platform for publisher
- distribution of publisher for top 3 platforms
- Distribution of one wrt other category
- Stacked bar # platform on x-axis & publisher on y-axis
- Dodged bar chart #platform on x-axis & publisher for each platform as bar
- top3_pub=data['Publisher'].value_counts().index[:3]
top3_gen=data['Genre'].value_counts().index[:3]
top3_plat=data['Platform'].value_counts().index[:3]
top3_data=data.loc[((data['Publisher'].isin(top3_pub) & data['Genre'].isin(top3_gen)) & data['Platform'].isin(top3_pub))] - Compare the top3 platforms these publishers use
- plt.figure(figsize=(12,8))
sns.countplot(x='Publisher', hue='Platform', data=top3_data, dodged=False) - stacked bar vs dodged
- If total is of more importance - Use Stacked Bar chart
- If comparison is of more importance - use dodge bar chart
- For 2 categorical variables - best representation is dodge chart
- Categorical - Numerical
- What qns can be asked
- What is avg/sum for every publisher
- Sales distribution for top3 publisher
- Multi box Plots
- sns.boxplot(x='Publisher', y='Global_Sales', data=top3_data
- Bar chart
- sns.barplot(x='Publisher', y='Global_Sales', data=top3_data, estimator=np.mean)
- Revision
- subplots
- plt.figure(figsize(12,8))
plt.subplot(2,3,1)
sns.barplot(x='Publisher', y='Global_Sales', data=top3_data, estimator=np.mean)plt.subplot(2,3,3)
sns.barplot(x='Publisher', y='Global_Sales', data=top3_data, estimator=np.mean) - fig, ax split.subplots(2,2,figsize=(12,8))
ax[0,0].scatter((top3_data['NA_sales'],top3_data['EU_Sales'])) - muti variate
- C-C-C
- Not a practical case and hence not covering
- N-N-N
- N-N
- scatter -> size
- Line -> can't be used
- C-C-N
- C-C -> Stacked/Dodged bar -> add hue/color
- C-N
- Box ->multi-box with color
- multi-bar
- C-N-N
- eg: compare NA sale with EU sale for each Genre
- scatter plot with hue to show color for Genre
- line plot with Hue
- Advance charts
- Joint plot
- Scatter + Histogram -> sns.jointplot(x='NA_Salses', y='EU_Sales', data=top3_data)
Scatter + Density -> sns.jointplot(x='NA_Salses', y='EU_Sales', data=top3_data, hue='Genre') - pair plot
- K numerical variables.
- Every single paid of them to be compared
- N1 with N2
N2 with N3
N1 with N3 - sns.pairplot(data=top3_data)
- Used for correlation analysis
- heatmap
- top3_data.corr()
sns.heatmap(top3_data.corr())
sns.heatmap(top3_data.corr(), cmap="coolwarm")
sns.heatmap(top3_data.corr(), cmap="coolwarm",annot=True)
- Where do we use probability in real world
- online shopping -> landing page
- Text recommendation using the dictionary created for each user
- chatgpt -> LLM models is probability behind it
- online/ott platforms
- Probability
- favorable outcomes/total outcomes
- Sample Space
- set of all possible outcomes
- Event
- subset of sample space
- Any collection of outcomes
- p(event)
- size of event/size of sample space
- Set Operations
- Intersection
- Union
- Complemen
- opposite
- 80% like cappacuno, 40% espresso, 30% both
How many like cappachuno and not espresson - 50
- Mutually Exhaustive - If given events cover all possible outcomes
- Gather data (http://www.kaddle.com/shivamb/netflix-shows) -> netflix_titles.csv
- Data Preprocessing
- Cleaning data
- missing values
- Transformation
- Scaling
- encoding
- Exploratory data Analysis(EDA)
- Explore
- Visualize
- Modeling
- Predict
- Evaluate & Monitor
df.describe(include='object').T
df.isnull().sum()/len(df)*100 #missing data by %age
df['type'].value_counts(normalize=True)
df['type'].value_counts().plot(kind='pie',autopct="%.2f")
missing values
- Numerical -> replace with medium, mean
- Categorical -> replace with Mode(for country, Genre, showtype), unknown, other, na
duration col
- Separate TV show and movies
- perform analysis on both the data sets separately
unnest
- stack()
- explode()
date col
- year, month, week, week of day
duplicity
inconsistencies
- Conditional probability
- Calculating probability after a specific condition is met eg: probability of students attending class on Friday
- auto complete/recommendation system -> How are you [doing, things, liking]
- p[Xb='You' | Xa="how are') #after pipe is event that already occurred, first one is the event for which probability is being calculated
- It is known that - 60% people use Swiggy, 50% use Zomato, 20% use both. Among those who use Swiggy, what fraction also use Zomato
- p[swiggy]=0.6
p[zomato]=0.5
p[swiggy & Zomato]=0.2
p[zomato | swiggy] = 20/60 #p(z&s)/p(s) - It is known that - 30% of emails are spam, and 70% are not spam. The word "Purchase" occurs in 80% of spam mails. It is also occurs in 10% of non-spam emails. Overall, in what percentage of emails would we see the work "purchase"?
- p(spam)=0.3
p(Not spam)=0.7
p(purchase|spam)=0.8
p(purchase|non-spam)=0.1 - Tree method
- 24 spam emails have "purchase"
7 non-spam emails have "purchase"
31 mails have "purchase" and total 100 => hence 31% - A = spam & purchase
B = not spam & Purchase
p(purchase|spam) = P(p & S)/p(S)
p(purchase|notspam) = P(p & NS)/p(NS) - It is known that 5% of all LinkedIn users are premium users. 10% of premium users are actively seeking new job opportunities. Only 2% of non-premium users are actively seeking new job opportunities. Overall, what percentage of people are actively seeking new job opportunities.
- p[js|prem] = p[js & prem]/p[prem]
P(Premium users)=0.05
P(Non Premium users)=0.95
p(seeking job|Premium users)=0.1
p(seeking job|non-Premium users)=0.02 - Tree method
- Premium users & seeking job -> 0.5
- Non-Premium users & seeking job -> 1.9
- p[JS]=p[js & prem] + p[js & non-prem]
- p[JS]=p[js|prem] * p[prem] + p[js|non prem] * p[non-prem] #total law of probability
- Q: An e-commerce website shows two types of ads: Type A and Type B. 60% of the visitors see Type A ads, and 40% visitors see Type B ads. The click-through rate for Type A ads is 5% and the click-through rate for Type B ads is 3%. What is the overall click through rate?
- Ans: (3+1.2) = 4.2
- p[click]=p[click&A] + p[click&B] = (3+1.2) = 4.2
- p[click]=p[A] * p[click|A] + p[B] * p[click|B]
=0.6 * 0.05 + 0.4 * 0.03 = 0.042 - Conditional Probability Formula
- p[A|B] = p[A & B]/p[B]
- Total Probability
- p[C] = p[C|A] * p[A] + p[C|A'] * p[A'] #Total Probability
- Multiplication Rule
- p[A & B] = p[A|B] * p[B] #called multiplication rule
- Q: In an NPS survey, it is seen that 70% are promoters, 20% are neutral, 10% are detractors.
90% of promoters, 40% of neutral, and 5% of detractors recommend the product to a friend. What is the overall percentage of people who recommend the product. - 71.5
- A disease affects 10% of the population. Among those who have the disease, 80% get "Positive" test results. Among those who don't have the disease, 5% get "Positive" test result. Overall, what percentage of people tested "Positive"?
- p[D] = 10% = 0.1
p[+ve] = p[+ve|D] + p[+ve|ND]
p[+ve|D] = 80%
p[+ve|ND] = 5% = 0.05 - p[+ve] = p[D] * p[+ve|D] + p[ND] * p[+ve|ND] = 0.1 * 0.8 + .90*.05 =0.08+4.5
- 12.5
- what is P(+ve & decease) = p[+ve|D] * p[D] = 0.8*.1 = .08
- what is P(+ve & no decease) = p[A|B] * p[B] = 0.9*0.05 =0.45
- Given a test is +ve, what are the chances that I am actually infected?
- p[D|+ve]?
- Total +ve= 125 of 1000
+ve & D = 80
p[D | +ve] = 80/125 = 0.64 - p[D | +ve] = p[D & +ve] / p[+ve]
p[+ve | D] = p[D & +ve] / p[D]
p[D | +ve] = p[+ve | D] * p[D] / p[+ve]
p[+ve] = p[D] * p[+ve|D] + p[ND] * p[+ve|ND] = 0.1 * 0.8 + .90*.05 =0.08+4.5 =
p[D | +ve] = p[+ve | D] * p[D] / (p[D] * p[+ve|D] + p[ND] * p[+ve|ND]) - For a new cohort in DSML, we have the following information:
30% of the people know SQL
80% of the people know SQL and also Excel
40% of the people who do not know SQL, also know Excel - p[sql] = 0.3
p[no sql]=0.7
p[excel | sql] = 0.8
p[excel | no sql] = 0.4
p[excel] = .3 * .8 + .7 * .4 = .52 - Among those who know Excel, what percentage know sql
p[sql|excel] = p[excel | sql] * p[sql] / (p[excel | sql] * p[sql] + p[excel | no sql] * p[no sql]) = 0.8 *0.3 / (0.8 *0.3 + 0.7 * 0.4) = .24 / .52 = 46.15
p[sql & excel] = 2.4
p[Nsql & excel] = 2.8 - In a city, 7% of people are on Twitter.
5% on Linkedin
4% on both - A random person is choosen, what is the probability that he is on twitter?
- 7%
- A random person on Linked is choosen, what is the probability that he is on twitter?
- p[T | L] = p[T & L] / p[L] = 0.04/0.05 =0.8
- If providing a information about event B changes the probability of event A then we call them dependent events
- if p[T|L] <> p[T] then they are dependent
- A website has noticed the following stats. Among those who saw the ad
70% saw on youtube
50% saw on Amazon
35% saw on both - A random person is chosen, what is the probability that he saw it on Youtube
- p(y) = 0.7
- A random person who saw the ad on Amazon is chosen. What is the probability that he also saw the ad on Youtube
- p(y|A) = p(y)/p(A) = 0.35/0.5 = 0.7
- p(y|A) = p(y) then the are independent events
- Independent Events
- p[y/a] = p[y]
p(y & A)/p(A) = p[y]
p(y & A) = p[y] * p(A) - Interview Questions
- A and B are two independent events, where it is known that $P(A u B) = 0.5$ and $P(A) = 0.3$. What is $P(B)
- $P(A u B) = p(A) + p(B) - p(A n B)
Given they are independent $P(A u B) = p(A) + p(B) - p(A) * p( B)
0.5 = 0.3 + p(B) - 0.3 * p(B)
0.2 = 0.7 p(B)
p(B) = 2/7 - Amit can solve a math problem with probability of 0.7, and Bharat can solve it with a probability of 0.5. Both of them attempt this problem independently.
- What is the probability that both of them will solve it.
- p(A n B) = p(A) * p(B) = 0.7 * 0.5 = 0.35 # they are independent events
- What is the probability that neither of them solve it
- p(A u B)' = 1 - P(A u B) = 1 -( p(A) + p(B) - p(A n B))
= 1 - (0.7 + 0.5 - 0.35) = 0.15 - Mutually exclusive events are dependent
- disjoint event are always dependent
- 50% of people who gave the first round of an interview were called back for 2nd round.
95% of the people who got involved for second round, felt that they had a good first round.
75% of the people who did not get invited for 2nd round also felt that they had a good first round. - Given that a person felt good about the first round, what is the probability that he cleared the first round
- p[cleared | felt good] = (50*.95)/.(50*.95 + .50*.75) = 0.5 /(0.475 + 0.375) =.56
- A city has 2 taxi companies A & B. A has 60 % of taxies and B has 40% of taxies. A taxis are involved in 3% of accidents and B taxis are involved in 6% of accidents. If a taxi is involved in accident what is the probability that it is B taxi
- A=60%, B=40%
p(acc | A)=0.03 & p(acc | B)=0.06
p(B | acc)=?
p(B | acc) =p(acc | B) * p(B) / ( p(acc | A) * p(A) + p(acc | B) * p(B))
=(0.06 * 0.4)/((0.06 * 0.4) * (0.03 * 0.6)) = ~0.57 - It is known that 30% of emails are spam and 70% are not spam. The word "purchase" occurs in 80% of spam emails. It also occurs in 10% of non-spam emails. A new mail does not have the word "purchase" what is the probability that it is spam?
- p(spam | not purchase) = p(not purchase| spam ) * p(spam ) / ( p(not purchase| spam ) * p(spam ) + p(Not purchase| Not spam ) * p(Not spam )) =~0.086
- 5% of all LinkedIn users are premium users. 10% of premium users are seeking new jobs.
2% of non-premium users are seeking new jobs. A randomly chosen person is NOT seeking new jobs. What is the probability that he is a premium user? - p[seeking job | premium] = .1* 0.05 , p[not seeking job | premium] = .9* 0.05
p[seeking job | non premium] = .02* 0.95, p[not seeking job | non premium] = .98 * .95 - p[premium | not seeking]= p[not seeking job | premium] / ( p[not seeking job | premium] + p[not seeking job | non premium]) = .045/(0.045+.931)=0.046
- An website shows two types of ads:
60% of the visitors see Type A ads, and 40% visitors see Type B ads. The click-through rate for A is 5%, and for B is 3%. A visitor to the website does not click the ad. What is the probability that he saw Type A ad? - p[no click|typeA] = p[typeA|no click]*(p[typeA|no click] + p[typeB|no click])
= .6*.95/(.6*.95 + .4*.97) = 0.594 - Facebook has a content team that labels pieces of content on the platform as either spam or not spam. 90% of them are diligent raters and will mark 20% of the content as spam and 80% as non spam. The remaining 10% are not diligent rater and will mark 0% of the content as spam and 100% as non spam. Assume the pieces of content are labelled independently of one another for every rater. Given that the rater has tabled four pieces of content as good, what is the probability that the rater is diligent.
- p(4 good content|non diligent)=1
p(4 good content|diligent)=0.8^4 =0.9 - p[diligent|4 good content] = p[4 good content|diligent]/([4 good content|diligent] + [4 good content|not diligent])
= .9 / (.9 + 1) = .72/.73 = .473 - Suppose 5 percent of men and 0.25 percent of the women are color-blind.
A random color-blind person is chosen. What is the probability of this person being male? Assume there are equal number of men and women overall. - p(male | color blind) = p(color blind | men)/(p(color blind | men) + p(color blind | women))
= .5 * .05 +(.5*.05 + .5 * .0025)=95% - A gambler has in his pocket a fair coin and a two-headed coin. He selects one of the coins at random, and flips it. It lands heads. Compute probability that it is fair coin.
- p[fair|heads]=p[heads|fair] *p[fair]/(p[heads|fair] *p[fair] + p[heads|unfair] *p[unfair])
=0.5 *0.5 / (0.5 *0.5 + 0.5 = 0.33 - A gambler has in his pocket a fair coin and a two-headed coin. He selects one of the coins at random, and flips it twice. It shows heads both the times. What is the probability that it is fair coin?
- p[fair|2heads]=p[2heads|fair] *p[fair]/(p[2heads|fair] *p[fair] + p[2heads|unfair] *p[unfair])
= 0.5 * 0.5 * 0.5 / (0.5 * 0.5 * 0.5 + 1 * 1*0.5)
=0.2 = 1/5 - Toss coin 3 times & you get HHT. What are the chances that it is a fair coin
- p[fair|2H 1T]=p[2H 1T|fair] *p[fair]/(p[2H 1T|fair] *p[fair] + p[2H 1T | unfair] *p[unfair])
=0.5^3 * 0.5 /(0.5^3 * 0.5 + 0) = 1 - A family has 2 children, at least one of them is a girl. What is probability that both are girls.
- p(B) = 0.5 p(G) = 0.5
sample space = {BG,GB,BB,GG}
p(atleast 1 girl) = 3/4
p(2 girls) = 1/2
p(both are girls | atleast one of them is girl) =
p(B/A) = p(A n B)/(p(A) =
p(2 girls | atleast 1 girls) =p(2 girls N atleast 1 girls)/ p(atleast 1 girls)
1/4 / 3/4 = 1/3
- Describe data in detail
- Speed meter -> describes speed (tells you the speed)
Google maps -> tells you how long it takes to reach destination. It calculates based on avg speed, past data & traffic on road. It is derived/inferred info. - Describing data - this is avg, max, min, mean. as a fixed quantity. It is called descriptive Statistics.
- If we are using given data to infer some other information, it is called inferential statistics.
- Descriptive means Summarizing. Driving at X km/hr.
- Inferential means drawing conclusions from data. You will reach in 1hr.
- Hypothesis testing, Predictive analytics, Confidence Interval, Recommended systems.
- Glass Door -> Salary at FAANG
mean say 35L and max is 40L from the users on Glass door for same position and experience
What salary will you ask for -> expectation is avg or abv avg. - Median vs Mean
- Mean gets impacted with outlier. Hence go for Median
- Median is more robust to outliers.
- Mode
- Most frequently occurring number or data
- Weighted mean
- sum(Wi * Vali)/sum(Wi)
- The mean weight of 2 children in a family is 40 Kgs. If the weight of the mother is included, the mean becomes 45. What is the weight of the mother?
- A+B /2= 40
A+B+M/3=45
80 + M=135
M=55 - Range
- Highest - lowest (max-min)
- Percentile
- %age of values less than or equal to given value
- 30, 30, 35, 40, 45 -> %tile of 40 is 4/5*100= 80%
- Median - 50th percentile
- Q1 - 25th
- Q3 - 75th
- Inter Quartile range
- Q3 - Q1(75th - 25th)
- upper whisker = Q3 + 1.5 IQR #stop at logical max
- lower whisker = Q3 + 1.5 IQR #stop at logical min
- Case Study: Sehwag vs Dravid - who is more consistent?
- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/035/130/original/sehwag.csv?1684996594
- import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sehwag=pd.read_csv("sehwag.csv")
p_25 = np.percentile(sehwag['Runs'],25) #25 percentile
p_50 = np.percentile(sehwag['Runs'],25) #25 percentile
p_75 = np.percentile(sehwag['Runs'],25) #25 percentile
iqr_sehwag= p_75 - p_25
sns.boxplot(data=sehwag['Runs'],orient="h")
upper = p_75 + 1.5 * iqr_sehwag
lower = p_25 - 1.5 * iqr_sehwag #assume to be 0 if it is going below 0
#count of outliers
len(sehwag[sehwag['Runs'] > upper])
#repeat above with 'dravid.csv'
25% for Dravid is 10 and 8 for Sehwag
75% for Dravid is 54 and 46 for Sehwag
Dravid has 1% outliers and 6% for Shewag - For consistency - less outliers should be there. Hence, Dravid is more consistent than Shewag
- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/035/126/original/weight-height.csv?1684995383
df_hw = pd.read_csv("weight-height.csv")
df_hw.head()
#Plot of Value against its percentile
#Plot of percentile for every value -> Cumulative Distribution Function
- CDF - Cumulative distribution function
- Plot with #number of people <than given height on Y axis & height on X axis
- Plot with % of people < than given height on Y axis & height on X axis #is CDF
- from statsmodels.distributions.empirical_distribution import ECDF #empirical means from data
- e=ECDF(df_hw['Height'])
plt.plot(e.x, e.y) - Standard Deviation
- variance = (h1-mu)^2 +(h2-mu)^2+(h3-mu)^2+(h4-mu)^2....+(hn-mu)^2/n
- standard deviation = sqrt(variance)
- When any normal data is plotted in most number of cases in histogram format. When mean and standard deviation is calculated and take range from mu+sigma to mu-sigma, the total number of values in this range corresponds to 68% of data.(it is emperical data/experimental data)
mu - 2 *sigma to mu+ 2*sigma corresponds to 95% of data
mu - 3 *sigma to mu+ 3*sigma corresponds to 99.7% of data
This is called 68/95/99 rule
Any curve that follows 68/95/99 rule is called NORMAL/Gaussian Curve - The height of people in Gaussian with mean 65 inches and standard deviation 2.5 inches. What fraction of people are shorter than 67.5?
- 50 + 34 = 84%
- The height of people is Gaussian with mean 65 inches and standard deviation 2.5 inches. What fraction of people whose height between 60 and 72.5?
- mu-2sigma TO mu is 95/2 = 47.5
mu to mu+3sigma is 99.7/2 = 48.85
Total = 97.35 - How many standard deviation away is 69.1 from 65?
- x-mu/sigma = 4.1/2.5 = 1.64
- mu + z* sigma = X #X is Standard deviation away
z = (X - mu)/sigma - Z-score table
- For a given value of Z how many % values are less than that in the data
- How many std deviation away from the mean the value is
- For a given value of Z what % of data is less than the corresponding item
- for any value X, calculate Z -> X -mu / sigma
using z score table -> value from the table is P
P% values are less than X in the original data - How many people have height less than 69.1
- Z = (69.1 - 65)/2.5 = 1.64
69.1 is 1.64 sigma away from mean(65)
mu + 1.64*sigma is 69.1
94.95% of the data <= 69.1 - Z score table -> CDF for a given z value
- Libraries scipy and statsmodel for probability stats
- from scipy.stats import norm
norm.cdf(-1) #gets values less than one sigma mu -sigma - How many people are having height less than 69.1
- z = (69.1 - 65)/2.5
norm.cdf(z) - Cricket ball manufacturer. Mean of the ball size is 50mm. Standard deviation is 2mm
What is the corresponding value to z-score of 1.5. - 1.5 = (x - 50)/2 = 53
- What fraction of bass have diameter smaller than 53mm?
- z=1.5 sigma away
norm.cdf(1.5) = 93.3% less than 53mm - How many balls have diameter >= 53mm?
- 1 - norm.cdf(2)
- The height of people is Gaussian with mean 65 inches and standard deviation 2.5 inches. What fraction of people are shorter than 67.5?
- norm.cdf((67.5-65)/2.5) = 84.13
- mu = 65 inches, sigma =2.5
96% people are shorter than me. What is my height - norm.ppf(0.96)*2.5 = X - 65
X=69.37 - MS Interview qn
- Skaters take a mean of 7.42 seconds and std dev of 0.34 seconds for 500 meters. What should his speed be such that he is faster than 95% of his competitors?
- I take less time than 95% of the competitors
95% of people have higher time than me
5% of people have less time than me
z=norm.ppf(0.05)
X= z*sigma +mu =norm.ppf(0.05) * 0.34 +7.42 = 6.86 seconds
speed = 500/6.86 = 72.88 Meters per sec - A retain outlet sells around 1000 toothpastes a week, with a std dev = 200.
If we have 1300 stock units as our inventory then what fraction of weeks will we go out of stock. - 1-norm.cdf(300/200) = 6.7%
- df=pd.read_csv('netflix_titles.csv')
- df.isna().sum() #checking for null values
df.isna().sum() #%age of missing vlaues - Major Challenges with the data set
- Clubbed/Nested data
- Director, cast, country, listed_in
- Missing Values
- Drop
- Replace
- MODE based imputation for Categorical values
- Smarter idea: MODE with extra context like Genre when cast is missing
- preprocessing required
- When numeric and string is combined. Split the same to find avg like values on numeric
- For separate analysis of movies & TV shows, it is important to first distinguish them. Find out what percentage of titles present in the dataset are TV shows and what percentage of them are movies?
- df['type'].value_counts(normalize=True) * 100
- Clubbed/Nested data
- Business Problem
- Market research team at AeroFit wants to identify the characteristics of the target audience for each type of treadmill offered by the company, to provide a better recommendation of the treadmills to the new customers. The team decides to investigate whether there are differences across the product with respect to customer characteristics.
- Perform descriptive analysis to create a customer profile for each AeroFit treadmill product by developing appropriate tables and charts.
- For each Aerofit treadmill product, construct two-way contingency tables and compute all conditional and marginal probabilities along with their insights/impact on the business.
- Ask: For each type of Treadmill, one have to answer which type of people are more likely to purchase it.
- why: Identify target audience
- what does good looks like?
- Import the dataset and do usual data analysis steps like checking the structure & characteristics of the dataset
- Detect Outliers (using boxplot, "describe" method by checking the difference between mean and median)
- check if features like marital status, age have any effect on the product purchased(using countplot, hist plots, boxplots etc)
- Representing the marginal probability like - what percent of customers have purchased KP281, KP481 or KP781 in a table(can use pandas.crosstab here)
- check correlation amount different factors using heatmaps or pairplots
- with all the above steps you can answer questions like: what is the probability of a male customer buying a KP781 threadmill?
- Customer Profiling - Categorization of users
- Probability - marginal, conditional probability
- Some recommendations and actionable insights based on the inferences
- Guassian distribution
- It is called normal distribution - For any random variable in real life, if we plot a graph of their distribution it will ideally be similar to Gaussian distribution
- Most random variable follows Gaussian
- It is centered around mean
- Standard Distribution: Way to measure, how far away normal values are from center
- Z-score : How many standard deviation away is a value from the center
- CDF -> how many % values are less than the given element
- norm.dist(z)
- z table
- Sampling in Pandas
- df_height.sample(10)
- sample_mean_10 = [np.mean(df_height.sample(10)) for i in range(10000)]
- sns.histplot(sample_mean_10)
- A distribution of sample means will always be a normal/Gaussian distribution
- Central Limit Theorem
- Mean of distribution of sample means of any dataset is equal to the Population Mean
- Number of items in sample decreases, standard deviation of sample increases
- Std Dev of sample mean distribution is proportional to 1/sqrt(n) #n is Number of items in single sample
- Std Dev of sample mean distribution is called Standard Error.
- Std Err = sigma / sqrn(n) #sigma is population std dev
- mu sample ~ mu population
- x' ~ N(mu, sigma/sqrt(n))
- Examples
- Systolic blood pressure of a group of people is known to have an average of 122 mmHg and a standard deviation of 10 mmHg. Calculate the probability that the average blood pressure of 16 people will be greater than 125 mmHg.
- std error = 2.51-norm.cdf(125-122/2.5) = ~ 0.115
- Weekly toothpaste sales have a mean 1000 and std dev 200. What is the probability that the average weekly sales next month is more than 1110.
- Ask: Calculate mean of sample of 4. sigma of sample mean = std error = sigma population/sqrt(n)= 200/sqrt(4) = 100 1-norm.cdf(1100-1000/100) = ~ 0.135
- In an e-commerce website, the average purchase amount per customer is $80 with a standard deviation of $15. if we randomly select a sample of 50 customers, what is the probability that the average purchase amount in the sample will be less than $75.
- std error = 15/sqrt(50)=2.1213 norm.cdf(-5/(1.1213)) = 0.00921
- Recap
- %tile & CDF are equivalent
- z-score is how many standard deviations away from mean
- Question
- The average time taken for customers to complete a purchase is 4 minutes with a std dev of 1 min. Find the probability that a randomly selected customer will complete a purchase within 6 minutes? Assume Gaussian
- z = 6-4 /1 = 2
norm.cdf((6-4)/1) = 0.97 - The average time taken for customers to complete a purchase is 4 minutes with a std dev of 1 min. Find the probability that the average time of the next 5 customers is less than 6 minutes
- std error =1/sqrt(5)import numpy as npfrom scipy.stats import normnorm.cdf((6-4)/(1/np.sqrt(5))) => 0.9999
- The average order value of an e-commerce website is 50, with a standard deviation of 5.
What is the probability that the average of the next 3 orders exceeds 60? - import numpy as npfrom scipy.stats import norm1-norm.cdf((60-50)/(5/np.sqrt(3)))
- Confidence Interval
- Few sample data -> can we predict a range for population. What is the probability that value will lie in the range.
- Avg age of content creators of Instagram
- 90%(Confidence) sure that age lie between 13-20(Interval).
- For any norm distribution, take Za & Zb at 5% and 95% of the curve
Za=norm.ppf(0.05) = -1.64
Zb=norm.ppf(0.95) = 1.64
90% of all values will like between mu -1.64*sigma & mu +1.64*sigma
Take a random X(mean of random sample). There is 90% chance that X will fall between
mu -1.64*sigma & mu +1.64*sigma
mu -1.64*sigma > X & X < mu +1.64*sigma
mu -1.64*sigma-population/sqrn(N) > X & X < mu +1.64*sigma-population/sqrn(N)
from many experiments it is known/assumed that sigma -sample mean ~ sigma population
hence mu -1.64*sigma-samplemean/sqrn(N) > X & X < mu +1.64*sigma-samplemean/sqrn(N) - Given a sample with muSample & sigmaSample. 90% chance that population mean will lie between muSample +- 1.64 sigmaSample/sqrt(n)
muSample +- norm.ppf(0.05) * sigmaSample/sqrt(n)
muSample +- norm.ppf(0.025) * sigmaSample/sqrt(n) #95% confidence - As sample size increases the range decreases
- Examples
- The sample mean recovery time of 100 patients after taking a drug was seen to be 10.5 days with a standard deviation of 2 days. Find the 95% confidence interval of the true mean.
- muSample +- norm.ppf(0.025) * sigmaSample/sqrt(n)import numpy as npfrom scipy.stats import normsamplemean=10.5samplesigma=2conflevel=0.025size=100print(samplemean + norm.ppf(conflevel) * samplesigma/np.sqrt(size))print(samplemean - norm.ppf(conflevel) * samplesigma/np.sqrt(size))10.11 to 10.89
- CI using Bootstrap
- survey_1 = [35,36,33,34,35]
np.mean(survey_1)
- Probability = count of favorable events/total events
- Questions
- India and Pakistan play a 3-match series. How many results are possible? Note that we consider(Ind, Ind, Pak) different from (Ind, Pak, Ind) etc.
- Tree -> with all possibilties = 8
- Possible outcomes 2 * 2* 2 = 8
- In a bowl-out, for a specific ball you have to choose a bowler and a wicketkeeper. Suppose you have 5 bowlers and 3 wicketkeepers. How many ways you can select for a ball?
- 5 ways for bowler and 3 ways for wicketkeeper = 15
- Tree -> Each bowler, select a keeper - there will be 15 combinations
- There are 3 ways to move from Chennai to Bangalore, and 4 ways to move from Bangalore to Delhi. There are 2 ways to move from Chennai to Hyderabad, and 3 ways to move from Hyderabad to Delhi. In how many ways can we move from Chennai to Delhi?
- 2 trees/maps
- Permutations
- N Objects to Arrange in R Slots
- N!/(N-R)! = nPr
- 5 letters -> A, A, B, C, D and 2 slots. How many possible ways can you arrange these
- 13(manual process by creating trees using a1,, a2 and eliminate a2)
- Combinations
- Counting when Order doesn't matter
- Total possible ways of chooseing = nPr/r!
- Question
- You have to choose top 3 order of batsmen for India from (Rohit, Kohli, Shreyas, Rahul).
How many possible ways you can choose those players =4 * 1=4
How many possible ways to arrange them? 4*3*2 = 24 - A Maruti showroom has 3 colour in their "Baleno" model and 3 colours in "Swift" model. In how many ways can they place it such that Baleno and Swift are kept in alternate slots.
- 3*3*2*2*1*1 + 3*3*2*2*1*1 = 36+36 = 72
- Arrange A A B C D in 2 slots
- If no common elements -> 4C2 = 6
- Both are same - 1
- Ways to arrange - 4C2 * 2 = 12
- Both are same -1
- Total = 13
- Coin Tosses: When coin is tossed 100 times, what are the chances that we get 52 heads
- 100c52 =
- Binomial -> criteria is
- Fixed number of interview - say n
- For each interview success rate is say P
- Each interview is independent of other interview
- X : number of successes
- if n=1 then X will be {0,1}
- probability of success say is 0.1 then
- p(x=0) =0.9 & p(x=1) =0.1
- PMF:
- Probability Mass function: Distribution where X-axis is actual event values and Y-axis is probability
- Plot between probability against values is called PMF
- p=0.1
x_vals=[0,1]
probs=[1-p,p]
sns.barplot(x=x_vals, y=probs) - x_vals=[0,1]
probs=binom.pmf(x_vals,n=3,p=0.1)
sns.barplot(x=x_vals, y=probs) - when n=1 the same is called Bernoulli distribution
- when n=50
- x_vals=np.arange(0,51)
probs=binom.pmf(x_vals,n=50,p=0.1)
sns.barplot(x=x_vals, y=probs) - When N interviews are given
- Total possible outcomes is 2^N
- p(x=k) => nck * p^k * (1-p)^n-k
- The above is similar to choose K boxes out of the n
when k=1 it will be nc1
so similarly for k boxes it is nck - nCk => math.comb(n,k)
- Geometric Progression
- Keep giving interviews till you get success
- {S}
{F,S}
{F,F,S}
{F,F,F,S}
{F,F,F,F,S} - on what interview will you get success?
- p(x=1)=0.1
p(x=2)=(0.9)* 0.1
p(x=3)=(0.9)*(0.9)*0.1
p(x=k)=(0.9)**(k-1) * 0.1 - p(x=k)=(1-p)**(k-1) * p
- code
- p=0.1
x_vals = np.arange(1,20)
probs_geom = geom.pmf(x_vals,p) - question
- Probability of player Messi to goal penalty short is 0.8. What is the proability that he will have 7 or less success in 10 chances
- x_vals = np.arange(0, 11)
probs=binom.pmf(x_vals, n=10, p=0.8)
sns.barplot(x=x_vals, y=probs) - np.sum([binom.pmf(k=1, n=n, p=p) for i in np.arange(0,8)])
- binom.cdf(k=7,n=10,p=0.8) #cumulative probability till the given K
- What is the probability that we will score 8 or more
- 1- binom.cdf(k=7,n=10,p=0.8)
- Suppose we float 10 quizzes with four options each. Calculate the probability that a student, who randomly guesses, answers 2 questions correctly.
- p(x=2) = nC2 p^2*(1-p)^n-2
- binom.pmf(k=2,n=10,p=0.25)
- n=10
k=2
n p=0.25
math.comb(n,k) = p^k * (1-p)^(n-k) - Types of Probability
- Marginal Probability
- Probability that Sachin score a century
- Probability that Sachin team wins
- Conditional probability
- df_sachin[["century","Won"]].value_counts() #results groupby century and won
- Given that Sachin has scored a century what are the chances that India wins
- 30/46 #for given set
- Given that India wins what are the chances that Sachin has scored a century
30/184 - Joint Probability
- Probability that sachin scores a century and India wins
- pd.crosstab(index=df_sachin["century"], columns=df_sachin["won"]] #contingency table
- what kind of ditribution
- N is fixed and 2 possible outcomes and probability of one of outcome is known-> Binominal distribution
- Geometric
- search on google/netflix
- dictionary lookup
- Assume something & test the assumption
- Statistically prove the assumption is correct or not. This essentially called Hypothesis testing.
- Default assumption is called NULL Hypothesis
- Suspect is accused of murder. (criminal or not criminal)
- Innocent until proven guilty -
- Assumption is Innocent and asked to prove the assumption is incorrect
- Burden of proof is on those who want to reject the default assumption.
- who are introducing the new claim
- Cricket
- Third empire - When there is dispute on field - decision of on field is challenged
- on field gives a soft call for lbw. - what is default assumption for the 3rd empire(not out, out) OR onfield empire is correct vs not correct
- Covid test(+ve or -ve). -ve default assumption
- Null Hypothesis(H0) -> Default Assumption
- Alternate Hypothesis(Ha) -> your assumption when you reject the null hypothesis
- p[extreme data | Ho] is low -> this is called P-value
- If P-Value is low - should you accept your hypothesis or reject
- Reject null hypothesis
- Default behavior -> Null hypothesis
- Collect evidence/data opposite to null hypothesis
- p(evidence | if Ho is true) is low -> reject the null hypothesis
- A Juice brand claims that its new manufacturing process has reduced the sugar content in its juice boxes to 8 gms. Now, food safety and standards authority of India(FSSAI) wants to test the claim of the juice brand, and choose the correct option:
- Ho: The new manufacturing process has reduced the sugar content in its juice to 8gms H1:
- Default Assumption should not be CLAIM. rather it should be default behavior
- If P-value is less than alpha(significance level) then we reject the null hypothesis
- When test says - NO Virus and reality also have No Virus then TRUE NEGATIVE
When test says - Virus and reality also have Virus then TRUE POSITIVE
When test says - Virus and reality NO Virus then FALSE POSITIVE
When test says - NO Virus and reality have Virus then FALSE NEGATIVE(Type 2 error) - Test is Negative -> unable to reject NULL hypothesis
- Left handed or left tail test
right handed or right tail test
Two/double tail test
- How to decide Ho
- Default behavior
- fair coin
- innocent
- based on who is testing a claim
- Hypothesis testing framework
- NULL and alternate hypothesis
- Identify the distribution(gaussian, distribution)
- Left, right, Two tailed
- Compute p-value[Probability of Seeing observed values GIVEN Ho is True]
- Compare p-value with alpha
- Central Limit Theorem (CLT)
- D.Mart - avg weekly sale is 1800 with std dev 100. They hired a marketing team to improve sales. Marketing team started with 50/2000 stores to test the strategy. Firm says the avg sale of 50 stores increased to 1850.
- NULL and alternate hypothesis
- Ho -> sale is same
- Ha -> sale increased (claim of marketing team)
- Identify the distribution(gaussian, distribution)
- gaussian
- Left, right, Two tailed
- right tailed
- Compute p-value[Probability of Seeing observed values GIVEN Ho is True]
- mu population =1800 & std dev = 100
mu sample =1800 & std dev sample = 100/sqrt(50) - p[sales>1800 | Ho is true]
- z= (1850-1800)/(100/sqrt(50))
1-norm.cdf(z) = 0.0002 - Compare p-value with alpha
- alpha=0.01(if observed value lies in extreme 1% then we can reject Ho. 99% confidence that the claim made is true)
alpha>p-value => reject Ho - Yes, marketing has increased the sales
- alpha -> significance
- 1-alpha -> confidence level
- The weights of apples in a fruit market are normally distributed with a mean of 150 grams and standard deviation of 20 grams. If an apple weighs 140 grams, what is its z-score
- mu=150, stddev=20
z-score = 140-150/20 = -0.5 - A coffee shop claims that their coffee cups contain, on average, at least 12 ounces of coffee. A random sample of 36 coffee cups showed an average of 11.8 ounces with a std dev of 1.5 ounces. Conduct a z-test to determine if the coffee shops claim is supported. What is the p-value?
- Notes
- There are 3 distributions, population, sample distribution and single sample
mu pop = mu sample distribution
sigma pop/sqrt(n)=sigma sample distribution - if pop std dev is not given assume it to be same as single sample std dev
- mu = 12 std dev = when not given it is same as sample = 1.5
single sample size = 36, mean =11.8 and std dev =1.5
sample mean distribution, mean = 12, stddev = 1.5/sqrt(36) = 0.25 - z=(11.8-12)/(1.5/sqrt(36)) = -0.8
- Ho = avg content is 12 ounces
Ha = avg is >= 12 - right tailed
- % values >= 11.8
- A fitness app claims that its users walk an average of 8000 steps per day. A random sample of 30 users showed an average of 7600 steps per day with standard deviation of 1200 steps. Conduct a right tailed Z-test at a 5% significance level to determine if the app's claim is supported. What is the p-value?
- claim 8000 steps
- mu population = 8000 sigma population=1200
alpha = 0.05
Observed value = 7600 and n=30 - Sample mean distribution
mean =8000, std dev = 1200/sqrt(30)
z-score = (7600-8000)/(1200/sqrt(30))= - Critical value
- A french cake shop claims that the average number of pastries they can produce in day exceeds 500. The average number of pastries produced per day over a 70 day period was found to be 530. Assume that the population standard deviation for the pastries produced per day is 125. Test the claim using a z-test with the critical z-value = 1.64 at the alpha(significance level)=0.05, and state your interpretation.
- framework
- NULL and alternate hypothesis
- Identify the distribution(gaussian, distribution)
- Left, right, Two tailed (decided by Ho and Ha values NOT by observed and mean)
- Compute p-value[Probability of Seeing observed values GIVEN Ho is True]
- Compare p-value with alpha
- Ho -> Cake shop produces 500 cakes per day, mu =500, Ha>500
- right tailed test
- 70 day sample -> std dev = sigma pop/sqrt(70)
- z-score = (530-500)/(125/sqrt(70)) = 2.01
- p-value = 1-norm.cdf(z) = 0.022
- alpha = 0.05
- p-value < alpha -> hence reject Ho
- https://www.scaler.com/instructor/meetings/i/t-test-24/
- iq_score = [110,105,98,102,99,104,115,95]
population avg =100
Ho = pill has no effect, avg is still 100
H1 = Avg>100 - z-score = (x-mu)/(sigma population/sqrt(n))
- t-score/t statistics = (x-mu)/(sigma sample/sqrt(n))
- from scipy.stats import ttest_lsamp,ttest_ind
t_statistic , pvalue = ttest_lsamp(iq_scores, mean for Ho, alternative='greater/less/default)
t_statistic , pvalue = ttest_lsamp(iq_scores, 100) - When Population mean not available -> 2 tests/samples
- compare general people with people took medicine
- T test using 2 samples
- we try to find if the samples belong to same population
- T Test
- 1 sample -> Sample against population
- 2 sample -> sample against sample
- For Number vs Categorical (only 2 categories at a time)
- Applicable when we compare numeric data of 2 categories
- Innings1 vs Innigs2 runs
- Drug1 vs Drug2 recovery
- >2 sample -> Anova test
- recap
- z-test
- world follows normal distribution
- Calculate p-value using z-score
- norm.cdf
- Requires population mean and std dev which is a challenge
- t-test
- doesn't follow normal distribution. It follows diff distribution
- Requires population mean and std dev of sample
- Use when population std dev is not provided OR when N is small
- t-test function from python lib
- Numeric values with 2 categories - t-test
- Numeric values with 2+ categories - Annova-test
- Numeric with Numeric - Correlation
- Category with Category -> Chisquared test
- compare if gender makes a diff in Product Sales
- Degree of Freedom
- Given a cumulative/aggregate value, how many values are required to be known to calculate all values in the data.
- (#row-1)*(#col-1) -> for matrix
- (#row-1) + (#col-1) -> for 2 arrays
- Coin Toss
- 50 times toss of fair coin
- Expected 25 heads and 25 tails. Actual/observed is 28 heads and 22 tails. Is the coin fair. Degree of freedom is 1
- Ho -> coin is fair, Ha -> coin is unfair
- X^2 = sum(observed - expected)^2/expected => chi-statistics
- If X^2 is low, high chance that Ho being true
- Here we compare Observed, Expected vs Head and Toss (Categorical vs Categorical)
- import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare
from scipy.stats import chi2
alpha = 0.05
#Ho : coin is fair
#H1: coin is unfair
#chisquare(Observed values, Expected values)
#returns chi statiscs & p value -> p value indicates the number of values on right side in graph
chisquare(Observed values, Expected values) - chi2_contingency( -> Python to calculate Expected values
[[],[]
]) -> Returns chi_stats, P_value, Df(degree of freedom), expected table - Test of Independence
OR
Are 2 categorical variables dependent on each other or not - A marketing manager wants to determine if there is a relationship between the type of advertising (online, print or TV) and the purchase decision(buy or not buy) of a product. The manager collects data from 300 customers and records their advertising exposure and purchase decisions. What statistical test should the manager use to analyze this data?
- Chi Square- independence test
- Assumptions in Chi Square tests
- Variables are category - category
- Observations are independent
- Each cell is mutually exclusive
- Chi square only works if each cell >=5
- What is Hypothesis testing
- Testing a claim
- Observed data is different/significantly different than Ho
- Is the diff between Observed data & Normal Ho) mean is by chance(high P-value) or significantly(low p-value)
- Z-test
- t-test
- Chi Square test
- df_aerofit.head()
- Product, Age, Gender, Education, MartialStatus, Usage, Fitness, Income, Miles
- Does Gender affects the income
- T-test(2 sample) as one is Categorical and other is numeric
- Does Gender has any impact on Product
- Chi-square test
- Product impact on Income
- sns.boxplot(x='Product', y='Income', data=df_aerofit)
- Which test to use when
- Num vs Num -> Correlation
- Cat vs Cat -> Chi square
- Cat vs Num
- 2 Categories -> t-test
- > 2 categories -> Anova
- Does the product get impacted by income
- Ho -> No impact of income over product purchased
- Divide data into 3 equal parts
- Create one column -> which randomly distributes the data in 3 parts
- df_aerofit["random_group"] = np.random.choice{
["g1","g2","g3"],
size=len(df_aerofit)
} - Variance between groups -> If Ho is true this should be low
- Variance within groups -> If Ho is true this should be high
- F_ration = Variance between groups/Variance within groups -> If Ho is true this should be low & If Ha is true this should be high
- F_score = Variance between groups/Variance within groups
- Coding
- income_g1 = df_aerofit[df_aerofit["random_group"]=="g1"]["Income"]
income_g2 = df_aerofit[df_aerofit["random_group"]=="g2"]["Income"]
income_g3 = df_aerofit[df_aerofit["random_group"]=="g3"]["Income"]
income_product1 = df_aerofit[df_aerofit["product"]=="product1"]["Income"]
income_product2 = df_aerofit[df_aerofit["product"]=="product2"]["Income"]
income_product3 = df_aerofit[df_aerofit["product"]=="product3"]["Income"]
from scripy.stats import f_oneway
f_oneway(income_g1,income_g2,income_g3)
f_oneway(income_product1,income_product2,income_product3) - Compare impact of a categorical column to another numeric column
Product aginst income - Assumptions of Anova
- Data is gaussian
- Data is independent across each record
- Equal variance in diff groups
- The above assumptions are incorrect in "Kruskal wallis Test"
- >= 2 categories vs numeric column
- from scripy.stats import f_oneway,kruskal
kruskal(income_product1,income_product2,income_product3) - Test if a distribution is Gaussian OR not
- sns.histplot(height) // visually using may not be accurate
68/95/99 -> mu+sigma -> 68% - Check 1 percentile of your data vs 1 percentile of gaussian distribution(Theoretical quartile)
do same for every percentile - Draw graph for your quantile vs gaussian quantile data
- from statsmodels.graphics.gofplots import qqplot
qqplot(height) // will be diagonal line - This is z-score on x-axis and your data in graph
- Food delivery
- path='waiting_time.csv'
df_wt=pd.read_csv(path)
sns.histplot(df_wt["time"])
qqplot(df[wt["time"],line='s')
plt.show() -> not a straight line in diagonal direction. Hence, not a Gaussian - We are checking if some data is Gaussian or not
- Ho -> data is Gaussian
H1 -> not Gaussian - Shapiro Test -> To test if data is Gaussian or not
- You take 50-200 sample points
run Shaprire test
p-value is low -> reject Ho - from scripy.stats import shapiro
heigh_samp = height.sample(100)
shapiro(heigh_samp) - Equal Variance Assumption
- Ho -> Variance is equal
H1 -> not equal - Levene Test
- from script.stats import ttest_ind,levene
ttest_ind(height_men,height_women)
height_men.var()
height_women.var()
levene(height_men,height_women) - Reject Ho if pvalue is low
- Numeric Data
- Z Test -> if sigma population is known
- T Test
- Category vs Numeric
- 2 Categories -> T Test 2 samples
- > 2 Categories -> Anova
- > 2 Categories -> Krushkal if Anova fails
- Category vs Category
- Chi Square tests
- To Check Gaussian
- QQ Plot
- Shapiro Test
- To Check if Variance is same or not
- Levene test
- Numeric vs Numeric
- Correlation test
- df_hw=pd.read_csv("weight-height.csv")
sns.scatterplot(x=df_hw["Height"],y=df_hw["Weight"]) - Co-Variance
- 1/n(sigma(hi-hmu) * (wi*wmu))
- If co-variance is +ve, then it is called +vely co-related
- Ice cream sales VS Amount of Rainfall
- co-variance will be negative. It is called -vely co-related
- Height vs Rainfall
- Net area will be 0 => uncorrelated
- Co-variance value changes based on unit of height/weight. Hence, it is decided to divide with certain metric to standardize data. The metric is Standard Deviation. The value that is obtained after dividing by standard deviation is Co-relation co-efficient.
- rho = 1/n(sum((hi-hmu)/stddev) * ((wi*wmu)/stddev))
- Value of rho will be between -1 & 1
- 0 ~ uncorrelated, 1 is positive correlation, -1 is negative correlation
- df_hw(["Height","Weight"]).corr()
- Why co-variance is not good quantitive measure to check correlation.
- Because it changes based on scale
- Salary vs Experience
- Counter Intutive
- Intuition -> the relation is +ve
mathematically -> the relation is neutral - Reason
- Data is not along a line
- Non-linear relationship
- Non linear relationships are not captured properly by Correlation Co-efficient
- Pearson Correlation: is the correlation that we learnt until now
- Works only with Linear relationships
- Spearman Correlation
- sort both the values(salary, experience) and get the rankx and ranky
- rho = 1/n(sum((rankx-rankxmu)/stddev) * ((ranky*rankymu)/stddev))
- from scripy.stats import pearsonr, spearmanr
pearsonr(df_hw["Height"],df_hw["Weight"]) - spearmanr(df_hw["Height"],df_hw["Weight"])
- Spearman works extremely well in monotonic increase/decrease data
- p-value tells you how many random numeric variable with the given data points were actually related
- Pearson is used in general. When we know the graph is not linear use spearman
- If Pearson is giving near 0 values, then verify with Spearman as well
- import pandas as pd
import seaborn as sbn
df=pd.read_csv("aerofit.csv")
sbn.boxplot(y='income', x='Product', data=df) //to check outliers & get insights
sbn.boxplot(y='income', x='Gender', hue='Product', data=df) //get insights based on Gender
df['Product'].value_counts()
sbn.heatmap(df.corr(), annot=True) //correlation between different attributes
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True]
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True, normalize='columns')*100]
What is famous among females(conditional probability)
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True, normalize='index')*100]
Out of 100 products sold for KP781, how many are being bought by females -
pd.crosstab(index=df['Gender'], columns=df['Product'], margins=True, normalize=True)*100] - Marginal Probability -> no conditions -> simple probability
- Conditional Probability -> p(4|'Red') -> card is 4 given it is read = 2/26
- Joint Probability -> p(4 of blacks) = p(AnB) = p(A|B) * p(B) = 2/26 * 26/52 = 2/52
- Business Problem
- The management team at Walmart Inc. wants to analyze the customer purchase behavior(specifically, purchase amount) against the customer's gender and the various other factors to help the business make better decisions. They want to understand if the spending habits differ between male and female customers: Do women spend more on Black Friday than men? The company collected the transactional data of customers who purchased products from the Walmart stores during Black Friday. The dataset has the following features:
user_id, Product_id, Gender, Age, Occupation - When sample data is given and asked the projection for Population, do the bootstrapping and find the Confidence Interval for Population.
- CDF - Cumulative distributed function. Sum of all probabilities till the given value
- upto 5 heads
- prob that atmost 5 interviews are cracked
- p(x<= given value)
- PPF
- Value of X at a given value.
- p(x=a)
- PMF
- Discreet distributions(Binomial|Geometric)
- Poisson Distribution
- Will Messi score a goal in match
- How many goals will Messi score in the match
- Rate(lambda/mu) -> Average number of events in a given time interval
- Avg no of goals in a match(90 mins) eg: say 2.5 goals
- Avg no of customer visiting a place
- Rate will change with interval
- Rules
- It should be countable
- Independence - Occurrence of one event doesn't impact other events
- Rate -> It is constant
- No simultaneous events -> 2 goals can't happen at same time
- Question
- Bangalore has 3 accidents per day on average. What are the chances that Bangalore will have atmost 5 accidents tomorrow or 2 accidents tomorrow
- p(x<=5) -> cdf function
- poissons.cdf(rate=3,k=5)
- p(x=4) -> poissons.pmf(rate=3,k=4)
- Binomial PMF
- nCk p^k (1-p)^n-k
- Poissons PMF
- lambda ^ k * e^-lambda/k!
- import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import poisson, binom
pisson.cdf(k=5, mu=3) //atmost 5 accidents
pisson.pmf(k=4, mu=3) //exactly 4 accidents - Question
- On an average there are 3 typos per page. What is the probability that a random page has atmost one typo
- pisson.cdf(k=1, mu=3) //atmost
- Restaurant opens for 8 hrs. Avg number of customers for 8 hrs is 74. What is the probability that in next 2 hrs there will be at most 15 people.
- poisson.cdf(k=15, mu=74/8*2) //atmost 15 customers
- 1- poisson.cdf(k=6, mu=74/8*2) //atleast 7 customers
- You receive 240 messages per hour on average - assume Poisson distributed. What is the probability of the one message arriving over a 30-second time interval
- pisson.pmf(k=1, mu=240/3600*30) //exactly 1
- No message in 15 seconds?
- pisson.pmf(k=0, mu=240/3600*15) //exactly 0
- There are 80 students in a kindergarten class. What is the probability that exactly 3 of them will forget their lunch today? Each one of them has 0.015 probability of forgetting their lunch on any given day.
- pisson.pmf(k=3, mu=0.015*80) //exactly 3
- using binomial
- p=0.015, n=80
binom.pmf(k=3, n=80, p=0.015) - BootStrapping
- When we want to calculate central value. In real world we only get 6-10 values. How to ensure that the values makes sense.
- select 5 of the values. Repeat it for 10K time
The graph will be normal distribution
- Recap Poissons dist
- Criteria to use
- Constant rate
- Counting / countable in a limited quantity between 50 - 100
- when lambda is small & n is large, both poission and binomial gives same results
- Independent events
- No simultaneous events
- Questions
- Avg accidents in the city = 3/ day. Probability that there will be 5 accidents tomorrow
- pisson.pmf(k=5, mu=3) //exactly 5
- Whats app -> 240 messages/hr on avg. How many messages will you get on avg in 30secs
- 240/3600*30 = 2 msgs/30sec.
- Probability of 1 msg in next 30 secs
- pisson.pmf(k=1, mu=2) //exactly 1
- Probability of 0 msg in next 15 secs
- pisson.pmf(k=0, mu=1) //exactly 0
- 3 messages in 20 secs
- pisson.pmf(k=1, mu=240/3600*20)
- What is the avg time between 2 messages(Scale)
- 3600/240
- avg messages per second
- 240/3600
- Rate - avg occurrences in a timeframe(lambda/mean)
- Scale - avg time between 2 occurrences
- Probability of no messages in 10 secs
- p(t<=10)=poisson.pmf(k=0, mu=240/3600*10)
- Probability of waiting more than 10 seconds for the next message
- p(t>10) = 1-p(t<=10)= 1 - poisson.pmf(k=0, mu=240/3600*10)
- this is unknown distribution - CDF
- This unknown distribution is called as Exponential distribution
- Exponential Distribution
- p(x=0) = 1-p(t<=10)
e^-lambda * lambda ^0/0! = e^-lambda
e^-lambda = 1-p(t<=10)
expon.cdf = 1-e^-lambda - Distribution of events in 10 secs -> Poissons distribution
- Distribution of time till next event -> Exponential distribution
- Probability of no events in next 10 seconds
- poisson.pmf(k=0, mu=240/3600*10)
- Time till next events follow a exponential distribution. Probability of time being greater than 10 seconds
- 1 - dist.cdf(10)
1 - expon.cdf(x=10, scale=15) - Question
- 7 days -> 490 trains per day
rate = 490/7 per day = 70/24 per hr - Avg time between 2 trains = 24/70 hr - Scale
- Calculate probability that it will take atleast 30 mins for next train to come
- p(t>30) = 1-p(t<=30)
= 1- cdf(t=30, scale=24/70) - Suppose you are managiing a call centre, and yoou have observed that, on average, you receive 10 customer service calls per hour, following an exponential distribution. you want to calculate the probability of waiting less than 5 mins before next call arrives
- Rate = 10 calls/hr
scale = 60/10 = 6 mins
expon.cdf(x=5, scale=6) - Software develpment -> debug issue avg - 5min. Find probability that prob is debgged in 4 to 5 mins
- p[4<=T<=5] = p[T<=5] - p[T<=4]
- expon.cdf(x=5, scale=5) - expon.pmf(x=4, scale=5)
- More than 6 mins to debug
- p[t>6] = 1- p[t<6]
- 1 - expon.cdf(x=6, scale=5)
- Given that you have already spent 3 mins on debugging without finding anything, what is the probability that it will take more than 9 min to debug from beginning
- p[T>9 | T>3] = p[(T>9) n p(T>3)] /p(T>3) = p[(T>9)] /p(T>3)
= 1-expon.cdf(x=9,scale=5)/1-expon.cdf(x=3,scale=5) - OR
- 1-expon.cdf(x=6,scale=5)
- Memory less - history doesn't matter
- p[T>x | T>y] = p[T> x-y]
- Paired T-Test
- Does having a doubt clearing session improves people scores
- Before vs after analysis - Paired T-test
- ttest_rel(ds_ps["test_1"], df_ps["test_2"])
- Log normal dist
- take log(data) and create distribution for the log data
If log(data) is normal distribution then original data is called log normal distribution - Why log converts non-normal to normal distribution?
- log -> property -> non linear compression of axis
- Parameters for Gaussian distribution
- Feature engineering
- Attributes of a table/csv is feature apart from the one that is to be predicted
- eg: in Aerofit
- Education, gender, income, fitness, usage are features, product is Target
- Using Features -> predict the Target
- Identify if a person is fit or not
- Conduct Survey of height, weight
- Submit data for expert advice
- Expert marks the records as fit/unfit -> Ground truth
- Machine learning is about identifying features & automating the process
- Height Weight -> fitness
- BMI = Weight/Height^2
- BMI will be new feature to identify fitness
- Feature engineering is creating new feature using given features which helps in better prediction of target.
- Features can be created 2 ways
- Automated -> Machine Learning
- Manual -> Domain Knowledge
- data=pd.read_csv('loan.csv')
Gender, marriage, income.. -> features, loan_status is Target
data=data.drop('Loan_ID', axis=1) //loan id is not required
data.describe()
data.describe(include='object')
data.isna().sum() // drop off the rows with null values if count is very less
Identifying whether feature affects final target
### univariate analysis -> Checking one feature if it is effecting final target
sns.countplot(data=data, x='Loan_Status')
data.groupby("Loan_Status")['Applicant_Income'].mean()
#use ttest to check if invoice and loan status are independent(Ho) or dependent(Ha)
a=data[data["Loan_Status"]=="Y"]["ApplicantIncome"]
b=data[data["Loan_Status"]=="N"]["ApplicantIncome"]
ttest_ind(a,b) #accept null hypothesis as pvalue is high
#Bin vs Status -> two categorical variable so use ChiSquare test
bins=[0,3000,5000,8000,81000]
group = '['Low','Average','High','Very High']
data["TotalIncome_bin"] = pd.cut(data["TotalIncome"], bins, labels=group)
vals=pd.crosstab)data["TotalIncome_bin"], data["Loan_Status"])
chi2_contingency(vals)
#Pvalue is still high so can't reject null hypothesis
#loan amount & loan term against your salary may be right comparison
data['Loan_amount_terms].value_counts()
#Domain Knowledge
loanamount/loan term => emi per year
data['Loan_Amount_per_year'] = data data['Loan_Amount']/data['Loan_term']
data['EMI'] = data['Loan_Amount_per_year'] / 12 #approximation
data['able_topay_EMI'] = (data['TotalIncome']*0.3 > data['EMI']).astype('int')
sns.countplot(x='Able_topay_EMI', data = data, hue = 'Loan_Status')
vals=pd.crosstab(data['Able_topay_EMI'], data['Loan_Status'])
chi2_contingency(vals)
#if new feature is able to reject Ho then take it else drop it
- Recap
- Most data scientists spend 60-70% of time in identifying important features ie features that heavily impact target variables.
- Missing Values
- Numeric
- Mean, median, Mode
- Categorical variables
- Mode
- delete rows if count is less than 1-2%
- Drop column(feature elimination if more than 50% are null
- Fill it with a new value(example credit history)
- Filling null values -> sklearn
- from sklearn.impute import SimpleImputer
t = SimpleImputer()
SimpleImputer? #to find different functions available in class
SimpleImputer(strategy="most_frequent").fit_transform(a) #fills na with appropriate strategy evaluated value - num_missing=['EMI','Loan_Amount_per_year','LoanAmount','Loan_amount_term']
mean_imputer=SimpleImputer(Strategy="mean")
for col in num_missing:
data[col]=pd.DataFrame(mean_imputer.fit_transform(pd.DataFrame(data[col]))) - Categorical vs Numerical
- One hot encoding
- Convert all possible values into columns eg: male/female has 2 diff columns
- Label encoding
- Just replace string/char with diff int values (1 for male, 0 for female)
- Works well when there are only 3 categories
- from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data[col]=label_encoder.fit_transform(data[col])
data[col].value_counts() - The problem with Label encoding is that it gives preferences to categories. The ones with high value will be given more preference. To handle the challenge we use Target encoding
- Target encoding
- Based on how much impact the target has to the given variable, we assign them a value
- from category_encoders import TargetEncoder
pd.crosstab(data["Self_Employed"], data["Loan_Status], normalize='index')
col="Self_Employed"
te=TargetEncoder()
data[col]=te.fit_transform(data[col],data["Loan_Status"])
data[col].value_counts() - KS Test
- T-test checks mean of 2 data sets, based on result it will tell whether they belong to same or not
- T-test can go incorrect when 2 diff sets have close mean
- KS Test does CDF comparison
- from statsmodels.distributions.empirical_distribution import ECDF
- kstest(a,b) // if p value is small then they are 2 diff distributions
- This is used when any other methods that compare means gives -ve results
- Product Metrics & Design
- RCA
- RFM
- Customer Segmentation
- Guesstimates
- A/B Testing
- Misc case studies
- Data Scientist
- Helps in generating insights & reduce failures.
- Ensure less chance of failure and high chance of success
- Product Acumen / Business Acumen - How good understanding to develop a product dev & how to make sure product works
- Analyzing Metrics & Designing Metrics
- Youtube traffic went down 5% on last Sunday
- Judgement criteria for interviewers
- Structure - Demonstrate a systematic approach
- Comprehensiveness - Covers all important aspects
- Feasibility - Practical enough that it could be implemented realistically
- Framework
- Clarity - requirements
- Plan
- Conclude
- Product Diagnostics - Analysis of a metric
- User signup are down
- Monthly active users have reduced
- New Product/feature - Measure the success/performance of new product(1 hr delivery)
- How is health of Amazon product
- Product Design/recommending a new feature
- move some section from one place to another on the same page
- Should the class at scaler to be moved from 9 to 10
- Product Improvement
- This is more difficult
- How will you improve google maps
- measures in BPS -> basis points
- 0.01% is one BPS
- Product Diagnostics -
- CRIED - Clarify, rull out, Internal, External data
- CTR -> Click thru rate
- Case: Search Results for a facebook events - Clicks increased 15% WoW
- Metric: %user clicking on each event
Change: 15% increase - Clarify
- User clicking on event search means?
- WoW -> for how many weeks
- 15% increase in Users/Clicks?
- equation of metric
- Increase or decrease over what(avg/time)
- Ruleout
- Technical glitches
- outliers -> Diwali, Festivals
- Time
- sudden increase or gradual increase
- Region
- Geographically concentrated
- Other related features affected
- within your company to check if it is only feature or other features as well
- Platform
- Android/ios
- Decktop/Mobile
- mac/windows
- app/website
- Cannabilization
- Your increase or decrease should not come from some other page OR from your own product
- Move of traffic from one section to another section
- eg: checkout icon from Product page to Search page
- Segmentation
- User segmentation based on Age, Gender, new/existing, Casual/Power users, language
- TROPICS - for Internal data
- External Data
- Competitors data (public data)
- new competition
- Good/Bad PR
- Product Success - Defining Metrics
- FB introducing new feature "Save for later" -> Define metric and process to get the same
- Clarify
- What is the feature - are you saving image, videso,
- How long are you saving
- Are people going to be reminded
- Grouped/All together
- Who benefits from feature(user, Content creators, business users, internal teams)
- Business Goal - Think from different perspectives
- User goal
- Marketer goal
- business goal
- Define metric
- AAAERR ->
- Awareness - Are people aware/using of your product
- % of users who save at least 1 item
- % of users returning to see items
- Acquisition - How many users
- No of users coming due to this feature
- CAC - Average cost to acquire one customer - total cost/#users acquired
- Adoption/Activation
- % of total posts saved(saved posts/total posts)
- Engagement
- % of items reopened
- avg time spent (increase/descrease)
- Revenue
- ad spent
- clicks on ads
- Retention
- How often are people coming back
- Guardrail metrics
- Your product success should not make other products go down
- Summarize
- Strategy to be followed to solve business case:
- Basic Exploration
- Missing values
- Outliers
- Strategy to deal with missing values & outliers
- Univariate/Bivariate/Multivariate analysis
- Sample data -> Inferential statistics
- To confirm if it is true for population -> Hypothesis testing/CLT
- Recommendations/Insights
- Walmart
- 5.5 lac transactions
- 7K userIds
- import pandas as pd
import seaborn as sbn
df=pd.read_csv('walmart.csv')
df.groupby('Gender')['Purchase'].describe()
sbn.boxplot(x='Gender', y='Purchase', data=df) #No major diff between median of spend between male and female. Hence, we cannot say it with clarify, who is spending more
#For population check with CLT - df.sample(300).groupby('Gender')['Purchase'].describe()
male_sample_means = df[df['Gender']=='M'].sample(1000, replace=True)['Purchase'].mean() for i in range (1000)
np.mean(male_sample_means)
#upperlimit_males = means + Z-score * standard error
upperlimit_males = np.mean(male_sample_means) +1.96 * no.std(male_sample_means)
lowerlimit_males = np.mean(male_sample_means) -1.96 * no.std(male_sample_means)
upperlimit_females = np.mean(female_sample_means) +1.96 * no.std(female_sample_means)
upperlimit_females = np.mean(female_sample_means) +1.96 * no.std(female_sample_means)
#uncertain because the CIs are overlapping
#what can be done to eliminate the overlap
Increase the sample size
reduce the confidence level - Yulu is India's leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.
Yulu zones are located at all the appropriate locations(including metro stations, bus stands, office spaces, residential areas, corporate offices, etc) to make those first and last miles smooth, affodable, and convenient.
Yulu has recently suffered considerable dips in their revenues. They have contracted a consulting company to understand the factors on which the demand for these shared electric cycles depends. Specifically, they want to understand the factors affecting the demand for these shared electric cycles int he American market.
How you can help here?
The company wants to know:
Which variables are siginifcant in predicting the demand for shared electric cycles in the Indian market?
How well those variables describe the electric cycle demands.
- Recap
- Product Diagnostics: Analyze the change in the metric
- CRIED framework -> Clarify, rule out, Internal data, external data
- Internal data -> TROPICS(Time, Region, Other related features, Platform, Cannibalization, Segmentation)
- Define metric for new product launch
- Clarify (What, why, how, who)
- Business goals (Customer/User persona with their goals)
- Define metric -> AAAERRG (Awareness, Acquisition, Adoption/Activation, Engagement, Revenue, Retention, Guardrail)
- Summarize
- Pyramid
- North Star Metric (top of pyramid, MOST Important)
- Instagram: Monthly active users
- Gaana.com: Avg time spent per user per week
- India Bank: Avg transaction value per month
- Whatsapp: Daily active users
- L1 metric (AAAERRG) (It can be assigned to various stake holders to lead and track) - supporting metrics
- Important to track the health of the product / feature and is owned by various stakeholders
- Health of the product
- eg: Gaana.com
- Engagement - avg spent per user per week
- L2 metrics (more granular metric)
- Supporting metrics
- eg: gaana.com
- avg time spent per gender per week
- avg time spent per platform per week
- avg time spent per region
- Business: You tube going to launch shorts feature. As a data scientist, help youtube to define metrics.
- clarify
- why this feature
- Is this available over app or website
- Timeline of shorts
- Business goals
- Target customers
- Normal users - increase engagement
- content creators - more options to create content
- North Star Metric
- Avg time spent on youtube shorts by active users
- L1 Metric
- Awareness
- %age of active users using Youtube shorts
- L2 metric
- %age of male active users using youtube shorts
- Acquisition
- Avg AD spent per user(CAC)
- Avg revenue share with content creators
- Adoption
- Once the feature is launched and we have acquired a customer, has that customer re-used the feature in short window
- Avg no of user who used Youtube shorts in 24 hrs window after watching it for once
- Engagement
- The users are enjoying the feature or the product. If there is increase in usage in longer window(a week)
- Retention
- Capture if users are still using it or have stopped using it once the company has closed the marketing cycle.
- Revenue
- Avg no of active users per month(shorts)
- Guardrail metric
- #users uninstalls after feature install
- Flowchart of metrics for youtube shorts
- North star : Avg time spent on youtube by active users
- L1 - Reach/Awareness -> Avg no of active users using Youtube shorts
- L2 -> %age of male active users using Youtube shorts
-> %age of active users with between 18 to 40(genZ) - Engagement
- Avg no of users who used youtube shorts in a week
- #likes/shares/save in youtube shorts
- Guard rail(Business specific)
- Interview questions
- Metric definition based
- Metric change based
- Process
- Describe the features based on your understanding
- Determine goals -> major business goal
- Customer Journey -> add/define metric that are most relevant to customer journey
- map and quality ->
- Evaluate your metric -> get into conversation & correction of his/her feedback
- Recap
- Diagnose -> CRIED framework
- TROPICS for Internal data
- Defining metric
- Clarify
- Business Goals
- Define metrics (AAAERR)
- Summarize
- Product Metrics & KPI
- North Star (Every team in company aligns with)
- Monthly active users is North Star & 20K MAU is KPI
- L1(Each individual team focus on that metric, team/pillar level metric)
- L2(just for figuring out how things work/working)
- Fitness Industry
- Purefit has an app
- free - some videos & general videos
- paid - customized plan, expert advice session, all video access
- Retention is big challenge as it is not compulsory for life
- North start metric: Consistent 2 month users
- Clarity - what, why , who, how - Target audience
- Buz goals
- Customer goals,
- Trainer goals- Revenue, Publicity/PR
- Define metric
- North start metric: consistent for 2 months. 3 times a week & 3 week a month
- L1 metric: AAAERR
- Awareness - #users per month, #app downloads
- Acquisition - CAC, #customers from referral
- Adoption - #users who spent > 15min after installing app
- Engagement - Avg time spent, #logins, watch time, #video seen
- Adoption and Engagement is about using the features
Retention is driven from how good your product is - Retention: #users coming back on Nth day
- Engagement => No of classes attended/duration of class attended
- Revenue =>
- Gaurd rails
- User ratings
- feedbacks
- app uninstalls
- match rate between users and context
- avg time spent on videos
- avg time spent on recommended videos
- L2 metrics
- Finance/Fintech Industry
- Indian Bank -> Launching a mobile app. Identify when one can call the app as Success
- Goal: Design metric to check the success of app
- Clarify
- for whom - Existing customers + new age customers
- Why - competition
- what features - Banking, UPI, cards, Investments
- Security: OTP, Fingerprint
- Goals
- Existing Bank - Convenience, transactions
- New Customers - Ease of banking
- Metrics (only when you have metrics, you will understand what data to be collected. Only when you have data, then one will know success/failure)
- NS : Avg transaction through app/month
- Awareness: #downloads, % of app users
- Acquisition-> CAC
- Adoption -> Existing users converted to app users, new user acquired via app
- Engagement
- % customers using the app/banking services
- % users with >5 txns
- no of txns/user
- Bounce rate
- left after the 1st page
- left after install
- Guard rail metric
- % of failed txns
- %app crashes
- TAT - turn around time
- Task Completion Rate
- Cheque balance
- Transfer money
- Invest
- Credit card bill
- Abandon rate
- # of users abandoned after initiating
- Root Cause: Identifying why something is broken
- Root cause Analysis: Systematic process to identify root cause
- Goals of RCA
- Find the root cause
- Understand problem - design soln - fix the root cause
- Apply learnings for future -> robust testing
- CRIED framework
- Clarify - Define the problem properly
- Ruleout
- Technical glitches
- Planned events
- Seasonality
- Data problems - missing data, duplicate logs,
- Internal factors - new launch, app updates, UI changes
- external factors - Govt policy change, competitor,
- ECommerce
- Myntra - Problem: Decline from 5% to 3% in ordering rate. Ordering rate is total orders/total visitors on the website
- Data in ECommerce
- Click stream data - Online activity data - how many pages visited, buttons/links clicked, search term you type, Which item in search result is clicked, Scroll, hover your cursor on image, how much time spent on page, back button, right click
- Cookie - website will store something in your browser
- Benefits of Click Stream
- User Information help s in Personalization & identifying problems
- Defining User Journey/Routes
- Customer trends/Insights
- UX -> Do people like design of website/app
- Digital Marketing -> ads
- Clarity
- Ordering rate = no of orders / total sessions of the day
- What is a session / Visitors
- If there are 12 hrs gap then companies treat them as diff sessions
- Session Cookie - expiry time
- Bounce Session
- You open a page/website/app and leave with out doing any thing
- Session has 1 page view
- Bounce rate = Bounce session / Total Sessions
- Drop in Search to Product pages
- Not finding right product
- out of stock
- Product to Check out
- Extra charges
- price
- no offers
- External factors
- Competition - cheaper price on other sites
- fast delivery on other sites
- bad reviews
- large suppliers moving out
- Govt policies
- Check out to Payment page
- issue with Bank page
- OTP issue
- Payment gateway issue
- Transport/online Trave/Uber - why cancellations have increased
- Clarify
- What time
- what area do you see cancellations
- specific devices
- Driver asking for money
- New competition
- Which type of car is having more cancellations
- df=pd.read_csv('uber-data.csv', parse_dates=[4,5], dayfirst=True, na_values="NA")
- Solutions
- Extra incentive for driver to take airport cab in evening
- for airport, increase the distance of free car searching/checking
- Based on data, pro acrively put certain cars on the hot spots
- Cancellation charges to be borne by the driver
- cancellation based rating score
- More incentive in early morning hrs
- Internal data
- Looked at available data
- Created new features
- TROPICS/Framework for slicing
- Observations -> Root causes -> Solutions
- Direct causes
- Immediate factors impacting the problem
- Addressing direct causes solves the problem immediately
- look at visible effects/symptoms
- Root causes
- underlying factors
- Addressing root causes ensures that problem do not repeat
- look at reason behind a problem
- Competitor Analysis
- External Data
- Market presence
- Delivery logistics
- Product Range
- User Experience
- Support
- Marketing and Ads
- Return Policy
- Mobile Apps
- Offers and Discounts
- Payment Options
- Pricing
- CRM: Customer relationship model. A system that can manage any interaction between customer and company/business
- CRM Features & Functionalities
- Contact management
- Lead Management(Potential customers)
- Opportunity Management
- Sales Forecasting
- Mobile CRM
- Reports & Dashboards
- Sales Analytics
- Marketing Automation
- Sales Data
- Sales Force Automation
- Campaign Management
- Amazon - marketing team is given 50L budget to maximize revenue/profit using this budget
- Customer Segmentation(RFM- Recency, Frequency, Monetary)
- Age, gender, income -> Demographic segmentation
- Personality, lifestyle, fashion sense, Interest - Psychographic
- Purchase Profile / behavior -
- Frequent buyer -> Frequency
- last of day of purchase -> recency
- avg amt spent -> Monetary
- Brands/category
- time spent on website
- High margin items -> Monetary
- # of orders -> Frequency
- RFM values are low for all 3 then organizations consider them as Lost Customers
- Create heat map with recency on x axis and Frequency&Monetary on y-axis
- You would get category of from lost, price sensitive, can't loose them, loyal, Champions
- Strategy for Marketing
- Acquire new users
- RFM with values 5 1 [4 5] -> new customers who have done high value purchase
- Retention
- RFM with values 1 4 5 ->
- If you want to optimize revenue/profit -> always target groups with high M(4,5)
- 5% of Indians spend 95% of online purchase
- Always choose groups where only one thing need to be improved
- Should not we incentivize customers with 5 5 5
- Give Early Access
- Exclusive Offers
- Premium Loyalty Programs
- Personalized recommendation
- Moderate-High R F M
- Valuable customers but require moderate marketing
- Discounts (Limited)
- Loyalty programs
- Moderate R F F
- Product Bundles
- Limited time deal
- Re-engagement emails/messages with offers/promotions benefits
- Moderate-low R F M
- win-back campaigns -> targeted mails
- Abandoned cart message
- General Personalized/free product descriptions
- Low R F M
- Customer survey
- Once in a moon offers(but don't spend too much money)
- lottery offers(spin wheel)
- R F M model doesn't work for Organizations who sell laptops like DELL
- R F M model works for electronics shop
- Recency definition will change nature of business(B2B vs B2C)
- Data Processing
- Calculate Recency, Frequency, monetary value
- Bin/Group/Quantile for R, F, M & give values between 1-5
- Convert RFM subsets -> logical sub
- Monetary value -> Unit price * qty
- Frequency -> count/month
- Recency -> last order date
- --Item level details
select InvoiceNo, StockCode,InvoiceDate,CustomerId,Quantity*UnitPrice as Item amount
from crm.sales a
--Order Level
select InvoiceNo, InvoiceDate,CustomerId, sum(Quantity*UnitPrice) as Item amount
from crm.sales a group by InvoiceNo, InvoiceDate,CustomerId
---
with orders as ( select InvoiceNo, InvoiceDate,CustomerId, sum(Quantity*UnitPrice) as Item amount from crm.sales a group by InvoiceNo, InvoiceDate,CustomerId)
select customerId, a.monetary, date_diff(b.last_date_overall, a.last_order,'DAY') as recency
from (select customerId, sum(order.amount) as monetary, max(invoicedate) as last_order,
min(Invoicedate) as first_order from orders group by CustomerId) a,
(select max(invoiceDate) as last_date_overall from orders) b - sum(orders.order_amount) as monetary, max(InvoiceDate) as last_order,
from orders a,
(select max(InvoiceDate) as last_date from orders) b
group by CustomerId
--item level table
SELECT InvoiceNo,StockCode,InvoiceDate,CustomerID,
Quantity*UnitPrice as item_amount
from crm.sales a ;
-- order level table
SELECT InvoiceNo,InvoiceDate,CustomerID,
SUM(Quantity*UnitPrice) as order_amount
from crm.sales a
group by InvoiceNo,InvoiceDate,CustomerID;
-- order level table
with orders as ( SELECT InvoiceNo,InvoiceDate,CustomerID,
SUM(Quantity*UnitPrice) as order_amount
from crm.sales a
group by InvoiceNo,InvoiceDate,CustomerID),
customers as ( SELECT a.CustomerID,a.monetary,
date_diff(b.last_date_overall,a.last_order,DAY) as recency,
a.total_orders/(date_diff(DATE(a.last_order),DATE(a.first_order),month) + 1) as frequency
from
(select CustomerID,
sum(order_amount) as monetary,
count(distinct InvoiceNo) as total_orders,
max(InvoiceDate) as last_order,
min(InvoiceDate) as first_order
from orders group by CustomerID) a ,
(select max(InvoiceDate) as last_date_overall from orders) b)
select
*,
ntile(5) over (order by customers.monetary asc) as m_score,
ntile(5) over (order by customers.recency desc) as r_score,
ntile(5) over (order by customers.frequency asc) as f_score,
from customers;
with orders as ( SELECT InvoiceNo, InvoiceDate, CustomerID,
SUM(Quantity*UnitPrice) as order_amount
from crm.sales a group by InvoiceNo,InvoiceDate,CustomerID),
customers as (SELECT a.CustomerID,a.monetary,
date_diff(b.last_date_overall,a.last_order,DAY) as recency,
a.total_orders/(date_diff(DATE(a.last_order),DATE(a.first_order),month) + 1) as frequency
from (select CustomerID,
sum(order_amount) as monetary,
count(distinct InvoiceNo) as total_orders,
max(InvoiceDate) as last_order,
min(InvoiceDate) as first_order
from orders group by CustomerID ) a ,
(select max(InvoiceDate) as last_date_overall from orders) b),
boundaries as (select
approx_quantiles(monetary,5) as m_boundary,
approx_quantiles(recency,5) as r_boundary,
approx_quantiles(frequency,5) as f_boundary
from customers),
rfm as (select a.*,case when a.monetary<= b.m_boundary[offset(1)] then 1
when a.monetary<= b.m_boundary[offset(2)] then 2
when a.monetary<= b.m_boundary[offset(3)] then 3
when a.monetary<= b.m_boundary[offset(4)] then 4
when a.monetary<= b.m_boundary[offset(5)] then 5
END as m_score,
case when a.recency<= b.r_boundary[offset(1)] then 5
when a.recency<= b.r_boundary[offset(2)] then 4
when a.recency<= b.r_boundary[offset(3)] then 3
when a.recency<= b.r_boundary[offset(4)] then 2
when a.recency<= b.r_boundary[offset(5)] then 1
END as r_score,
case when a.frequency<= b.f_boundary[offset(1)] then 1
when a.frequency<= b.f_boundary[offset(2)] then 2
when a.frequency<= b.f_boundary[offset(3)] then 3
when a.frequency<= b.f_boundary[offset(4)] then 4
when a.frequency<= b.f_boundary[offset(5)] then 5
END as f_score
from customers a,boundaries b),
rf as ( select *,ROUND((f_score+m_score)/2,0) as fm_Score from rfm )
select * ,
CASE
WHEN (r_score = 5 AND fm_score = 5) OR (r_score = 5 AND fm_score = 4) OR (r_score = 4
AND fm_score = 5) THEN 'Champions'
WHEN (r_score = 5 AND fm_score =3) OR (r_score = 4 AND fm_score = 4) OR (r_score = 3
AND fm_score = 5) OR (r_score = 3 AND fm_score = 4) THEN 'Loyal Customers'
WHEN (r_score = 5 AND fm_score = 2) OR (r_score = 4 AND fm_score = 2) OR (r_score = 3
AND fm_score = 3) OR (r_score = 4 AND fm_score = 3) THEN 'Potential Loyalists'
WHEN r_score = 5 AND fm_score = 1 THEN 'Recent Customers'
WHEN (r_score = 4 AND fm_score = 1) OR (r_score = 3 AND fm_score = 1) THEN
'Promising'
WHEN (r_score = 3 AND fm_score = 2) OR (r_score = 2 AND fm_score = 3) OR (r_score = 2
AND fm_score = 2) THEN 'Customers Needing Attention'
WHEN r_score = 2 AND fm_score = 1 THEN 'About to Sleep'
WHEN (r_score = 2 AND fm_score = 5) OR (r_score = 2 AND fm_score = 4) OR (r_score = 1
AND fm_score = 3) THEN 'At Risk'
WHEN (r_score = 1 AND fm_score = 5) OR (r_score = 1 AND fm_score = 4) THEN 'Cant
Lose Them'
WHEN r_score = 1 AND fm_score = 2 THEN 'Hibernating'
WHEN r_score = 1 AND fm_score = 1 THEN 'Lost'
END AS rfm_segment
from rf
Quartile/Percentile
- Percentile -> %values less than or equal to given value
- Quartile/Percentile requires sorting which is expensive. Hence Approx Quantile is introduced
- Approx Quantile(G K Algorithm)
- Greenworld Khanna Algorithm - It calculated approx quantile which is 10 times faster. But it has small error/delta wrt actual ntile
- approx_percentile in Oracle sql
- boundaries as (select approx_quantiles(monetary,5) as m_boundary,
approx_quantiles(recency,5) as recency,
approx_quantiles(frequency,5) as frequency
from customers)
select a.*,b.* from customers a, boundaries b - ntile -> sorts the data, calculatees the boundary, assign scrore/group to each row
- Industry standard segmentations
- Technographic Segmentation -> gadgets, online services and softwares
- Behavioral Segmentation ->
- Needs-based Segmentation -> budget friendly, back pain, broken leg, chronic decease
- Customer Status -> leads, new customer, loyal/long time, at risk, churned
- A/B Testing: Dividing sample into 2 groups randomly. This random division and testing is called A/B testing.
- Suppose a company made a new drug for fever better than paracetamol
- Take sample of 100 people(sample). Divide into 2 groups and give new drug and Paracetamol. Measure #daystorecover
- Case Study: Facebook is planning to launch new feature where one can choose background when posting a status
- Clarify
- What is the objective - "More engagement"
- Has this been tested before??
- Is there external proof if this works?
- Who will this feature be applicable for (everyone, subset)?
- Product Management - Thinks about feature
Data Science - Designs the experiment + Insights
Engg - Implement - Metrics
- North Star Metric - %age of engaged user(liked, comment, posted, reacted recently in a session) where user and spent 2 mins on post
- Supporting Metrics - Daily active users
- Guard rails: These metrics should not degrade
- Avg time spent per user per week
- % of users consuming rich media
- revenue/user
- Designing experiment
- Ho => Pa = Pb, Ha is Pa != Pb
Pa= #engaged users/#total users - Choice of test
- 2 numeric values -> 2 sample t test
- Choose experiment control/Test object
- Sample size calculator
- Metric for the central(base line metric)
- alpha = 0.05
- Minimum detectable effect -
- What change is considered meaningful for the business to take action
- Experiment Duration
- daily 5k customers
sample size- 80K - duration should be sample size/daily customers = 80K/5 = 16 days
- Pitfalls/Problem with A/B testing
- Primacy effect
- People are reluctant to change
- Novelty effect
- Due to hype, initially a lot of people use it. Which can increase the test engagement. eg: CRED upi
- Due to above the results may seem to be undermined/exaggerated
- Solution
- Run for longer time
- Conduct test only for new users
- Network Effect
- Happens in social media as one can see other posts
- Ensure that such impact is minimized
- Outcome Bias
- What other factors might be causing it
- Note: 90% of A/B test fails
- Launch Recommendation
- Calculate the monetary/revenue impact
- 0.4% increase in user engagement then 0.1% increase in revenue
- Calculate overall revenue impact on the entire population
- Check the cost of launching to everyone
- Infra,(hardware, api)
- Long-term Impact(10-12 down the line)
- Ensure Guard rail metrics don't go down
- It is Ride sharing platform. You have electical bike stored on Hotspots. Unlock bike using app and move from one place to another. Pay for the usage. Ask is to Find out the reason for rental - Whether, Season, Holiday, weekend,
- import pandas as pd
pd.read_csv("bike_sharing.csv")
df['weather'].value_counts()
import seaborn as sbn
sbn.boxplot(x='workingday', y='count', data=df) #check the mean, outliers
#should you remove outliers-> No, as data will be biased
#Esp for Hypothesis, do not remove
Ho=The count of bikes on Working day <= the count on non-working day
Ha=The count of bikes on Working day > the count on non-working day
# t-test vs z-test -> Pop std dev not known, sample size big. Hence, both are same
working=df[df['workingday']==1]['count']
non_working=df[df['workingday']==0]['count']
df.groupby('wrokingday')['count'].describe()
from scripy.stats import ttest_ind
test_stats, p_val = ttest_ind(working, non_working, alternative='greater', equal_var=False)
p_val<0.05 => hence, working day /nonworking day has impact
#check the effect of Weather. One categorical and other numerical. Hence, use Anova
w1=df[df['weather']==1]['count'].sample(800)
w2=df[df['weather']==2]['count'].sample(800)
w3=df[df['weather']==3]['count'].sample(800)
#Anova
Ho=the count of bikes are independent of weather
Ha=the count of bikes is affected by weather
#assumptions of anova
#1 Normal -> QQPlod, DistPlot, SHAPIRD
#2 Should have equal variance -- No, describe, LEVENE
import seaborn as sbn
sbn.distplot(w1)
sbn.distplot(w2)
sbn.distplot(w3)
sbn.distplot(w4)
#all above are right biased
import numpy as np
from scripy.stats import shapiro
t_test, p_value = shapiro(w1)
from scipy.stats import levene
t_test, p_value = levene(w1,w2,w3)
#Kruskal Wallis Test
#Link for code
from scipy.stats import f_oneway
t_test, p_value = f_oneway(w1,w2,w3)
p_value < 0.05 -> hence, weather is impacting
- Estimates with sensible assumptions/guessworks
- need not be perfect
- Thought process is correct or not
- why guestimate qns
- ability to break down open ended problems into smaller chunks
- Qn: Calculate how many flights depart per day from Delhi airport
- Break the problem into small problems. Take guesses on each of these parts. Combine the results of the parts
- Clarify
- Domestic vs International -> 80:20
- Passenger vs Cargo
- all flight carriers
- Breakdown
- Domestic vs International
- Passenger vs Cargo
- Peak hours/normal hours/non-operational hrs
- weekend vs weekday
- festive vs non-festive
- Make Assumptions
- Peak hrs (5-9 am, 7-10pm)
- normal hrs
- non-operational hrs(1am - 2am)
- Guess/Calculate
- use beautiful number (2 & 10s)
- for breaking down use %ages
- Domestic -> 10 per hr, & 5 international
= 10*24*5*24=360 - Conclude
- Case Study
- Games24*7 want to tournament where 1st prize is 1lac, 2nd is 50K. What should be the per game entry fee for the users.
- Clarify
- Mobile solo game
- Arcade(5 small games), Tournament (paid)
- one Tournament per day
- Ads- free persons sees ads
- Royalty cards/skin/customizations - users pay for it
- Expenditure
- Prize money
- Operations(Server hosting, maintenance cost)
- Promotions (youtube/twitch, Influenzer marketing)
- Total active users vs fee users
- % of total users using royalty
- Per user Royalty revenue
- Ads(5 free arcade games-> 30 sec)
- How many ads per game
- Server Cost(5 lac per month)
- Maintenance cost - 5l/month
- Promotions
- Revenue/month
- Fees = x * 5000 * 0.2 * 30
- Royalty - 200 * 5/100*5K
- ads: 50 * 1 * 4K * 30
- Expenditure
- prize money = 30 * 1.5L = 4.5 l
- server maintenance= 10L
- ads= 6L
- Revenue = 30x + 0.5+60 => x=1.5 rs
- Analyze
- breakdown
- Calculate
- validate
- How many IPhone users are there in India
- Clarify
- Market share - 2%
- apple subscriptions
- Total population - 1.5 B and mobile users 0.8B
- rich vs poor
- iphone vs android(20:80)
- Total population - 1.5 B and mobile users 0.8B
- urban vs rural
- age group
- income
- Total population
- 40% population are kids & old people => 0.84 billion
- 70% of people in this age group lie in upper/middle class => 0.6B
- 10% of people prefer it => 0.06 Billion
- Guess how many refrigerators are sold in India every year?
- Vanity metric
- good to have metric but don't directly impact the overall usage
- BCG case study
- Sales data
- Marketing
- Click Stream
- Estimate the revenue earned by Google via their AdSense product
- Adsense -> Ad network means that google can showcase ad from any website not hosted by Google. Owner of website/app can tell google to showcase ad on their website
- count of companies in India -> 3.4 lakh digital companies in India
- #website holders-> 1.5 to 2 cr
- Google adSense is intermediary to publish and companies
- Publishers
- Newsites
- PodCasts
- Blogger
- ECommerce -> smaller uses adsense
- AdSlots
- Why ads ? To sell product OR acquire a customer
- Basic Terms
- CPC (Cost Per Click)
- Cost charged to the brand for every click on the ad
- Google responsibility to show correct set of ads on correct set of website
- Google has Quality Score. It maintains score for both Publisher, Ads
- Based on above CPC is charged
- CPM(Cost per Mille)
- Cost for every 1000 views/impressions
- CPM is much lower than CPC
- CTR(Click thru rate)
- #clicks/#impressions
- Framework
- Ask Qn
- Where ads will be published
- all websites apart from google products
- all type of publishers
- How long
- one year
- Revenue means
- overall amount that brand will pay
- High level understanding of problem or metric
- #publishers * revenue/day/publisher * 365
- State assumptions
- Avg cpc, cpm, ctr
- 80% of revenue comes from top 20%
- Estimation tree
- top 20% publishers * revenue/day/publisher * 365
- Click ad revenue
- #vistors * #clicks/visitor/day * CTR * CPC
- Total ad views * CTR
- Total Visitors * AdViews/vistor * CTR * CPC
- #ads seen by visitor in a day
- Total pages * Ads per page * % of CPC ads
- Every day 1000 users with every user views 5 pages. User will see 10CPC and 5 cpm ads.
- Revenue per user would be 10 * CTR * CPC // click ad revenue
- Impression ads revenue
- #views * total impressions/reviews * CPM/100
- total impressions/reviews = #total pages * ads per page * % of cpm ads
- Bottoms up(calculating/putting values)
- Sanity check/validate
- Similar qns on Guesstimate
- website traffice
- traffic on signal
- people visiting a restaurant
- Revenue of zomato
- Items sold
- Why do flights overbook
- No shows
- late
- cancel
- connecting flights
- Luftansa -> 4.9 million passengers didn't show up in a year 2005. They sold only 570K seats and earned 105M $
- Passenger Bumping
- refund
- penalty
- arrange next immediate flight
- Optimum number of Overbookings(Maximize the profit for the company)
- Approach
- Num of Seats on Flight - 100
- Historical data of
- AirBnB (Online market place for Bed and Breakfast)
- Recommend what should be the optimum/recommended and minimum photos the host has to upload
- Datas given has following fields
- listingId, PostingDate, posting_time, location, Images, Bookings, Host_type
- Host_type has Regular, superhost types. Regular are with 1-2 listing, Super hosts are the ones who have many properties listed
- Date, Open_listing_0_2, Open_listing_3_5, Open_listing_6_10, Open_listing_11_15, Open_listings
- listings with number of phostos
- Property_image, Total_listing, Redundant_listing, non redundant listing, %of redundant listings
- redundant listings -> no reservation in last one year, Open listing- no reservation for given day
- Problem Statement: Recommend the Optimum images count and minimum images count
- Results
- Min Images -6 & optimum range - 11-15
- General trend
- highest monthly avg booking for this range
- lowest no of open bookings
- low redundancy
- for most age-bracket 11-15 is the highest booked listing
- Assumptions
- Image are the main reason for booking
- Image quality is not considered
- Interpreter language like JS
- Line by line execution
- Highlevel languages - Python, JS
mid-level language- c++, java
Assembly code
machine code(Byte code) - Every thing is Object is Python
- since everything is object, it will have associated properties & behavior
- Class of a object
- type()
- mutable vs immutable
- a=4
print(id(a))
a=5
print(id(a)) - Iteration Protocol
- Entire process of visiting each item once is called iterations
- Iterable -> Collection of itmes
Iterator -> Pointer which points to items
iteration -> process of going over all items one by one - s="hello"
itr=iter(s)
print(next(itr))
- Data Structures
- Comprehension
- Strings
- Memory in python
- CPU
- 9.5GHz -> Operations per sec
- Algorithms are analyzed with number of operations
- https://www.youtube.com/watch?v=HyznrdDSSGM&list=PLowKtXNTBypGqImE405J2565dvjafglHU
- Multiple Inheritance
- Functional Programming
- Paradigm of writing code
- Code in functional programming can be thought as sequence of multiple functions
- Why to use it
- reuse the code
- lambda function
- One line functions
- anonymous
- onetime use
- sq = lambda x : x**2
- Higher order function
- They return another function
- They take input as another function
- Decorators
- decorate the functions
- adding more functionalities
- def foo():
print("Hello everyone!") - def pretty(func):
def inner():
print('-'*50)
func()
print('-'*50)
return inner - pretty(foo)
- def best(func):
def inner():
print('we are the best')
func()
print('we are best')
return inner
@best
def greeter():
print("good evening")
greeter()
Output
we are best
good evening
we are best
- Args & KWARGS
- custom_sum(a,b,*args):
print(f"a - {a}")
print(f"a - {b}")
print(f"a - {args}") - custom_sum(5,6,7,8)
a-5
b-6
args {7,8} - x,y,z, *more = (2,3,4,5,6,6,7,9)
- Kwargs
- def create_person(name, age, gender):
Person = {
"name":name,
"age":age,
"gender":gender
}
return Person - def create_person(name, age, gender, **kwargs):
Person = {
"name":name,
"age":age,
"gender":gender
}
return Person - kwargs -> key worded arguments
- one or more py files make a module (eg: math)
one or more modules make a package/library (eg: pandas) - Modules
- import math
- Problems
- it imports the entire module
- one have to write math. before every function
- from math import *
- math. is not required before function call
- Problems
- collision and overriding of content incase of collisions
- from math import factorial, ceil, floor, pi
- pi
- ceil
- import math as m
- best method to import
- from math import factorial as fact, ceil as c, floor as f, pi as p
- import random
random.seed(100)
random.randint(0.10)
- import requests
url = "http://...sample.jpeg"
res = request.get(url)
with open("sample.jpeg","wb") as img: #wb -> write binary
img.write(res.content) - file = open("scaler.txt","w")
file.write("first line")
file.close()
- Conceptualize -> Visualize -> Math -> Code
- Fish Sorting Example
- OLS -> Ordinary Least Squares
- Properties is considered as Features(Independent variables) & outcome/what we predict is Target(Dependent Variable)
- Input -> Model -> Output
- Process of building an ML algorithm
- Data Collection
- Data Visualization -> Plot, PCA, TSNE -> reduces dimensions
- Choosing an appropriate Geometrical structure to separate classes
- Choosing a LOSS function which helps decide the best structure. (sum of distance of data from line)
- Training/optimization -> Gradient descent
- Coordinate Geomentry
- Straight line -> y=mx+c
- where m is slope
- c is y intercept (when x is 0)
- General equation of line is
w1x+w2y+w0 = 0
w2y = -w1x -w0
y = -w1/w2 x - w0/w2 c
slope = -w1/w2 & constant = -w0/w2 - For parallel lines m1,m2 is same
for perpendicular likes m1 * m2 =1 - 2 dimensions - line => w1x1+w2x2+w0 = 0
3 dimensions - plane => w1x+w2x2+w3x3+w0 = 0
4 dimensions - hyper plane (higher dimensional plane)
- Vectors
- Ordered set of numbers
- represented by x bar [x1 x2]
- Magnitude -> distance from origin
- Magnitude of x bar is sqrt(x1^2 + x2^2)
- Norm of a vector -> Magnitude or length of vector
- L2 norm
- length of distance between two points(Euclidian distance)
- sqrt((x2-x1)^2 + (y2-y1)^2)
- L1 norm
- Manhattan distance
- (x2-x1) + (y2-y1)
- Dot Product
- For points (a1,b1) & (a2,b2) dot product is
a1b1 + a2b2 - a.b = a * b * cos theta (theta is angle between two lines/points)
- Angle between two vectors
- If dot product of two vectors is -1 then they are perpendicular to each other
- Matrix Multiplication
- Unit Vectors
- Vectors have both magnitude as well as direction
- Unit vectors are the ones with magnitude as 1
- represented by x hat
- norm of x hat is 1
- They are used to represent direction
- Vector Projection
- Norm of a vector
- It is distance from origin OR length of magnitude of the vector
- sqrt(x1^2 + x2^2...)
- Manhattan distance = x1+x2+..xn
- Dot product between two vectors
- a transpose * b -> matrix multiplication
- which is always equal to norm of a * norm of b * cos theta
- if the angle between two vectors is acute then dot product is +ve
- If dot product of two vectors is zero, then they are perpendicular to each other
- when angle is between 90 and 180, it is same as case2
- Relation between Weight Vector and hyper plane
- Dot product of Weight vector and x vector is 0 when line passes from origin
- Which means they are perpendicular/orthogonal to each other
- Recap
- Loss function
- Perception Algorithm
- Recap
- neeta -> learning rate
- Perception learning algorithm
- Problem solving
- Mathematical representation of classification problem
- Gradient decent is an algorithm for optimization
- Calculus topics
- Maxima, minima
- calculus in multi variable
- calculus in singlevariable
- derivative, slope, tangent
- limits, continuity, differentiability
- functions
- Functions
- Domain: All the possible values that the input can take
- Range: Collection of all possible outputs
- Sigmoid function
- y=1/(1+e^-x)
Domain is (-infinity, + infinity)
Range is (0,1) - sin function
- y=sin x
Domain is (-infinity, + infinity)
Range is (-1,1) - cos function
- y=cos x
Domain is (-infinity, + infinity)
Range is (-1,1) - tan function
- y=tan x
Domain is (-infinity, + infinity)
Range is (-infinity, + infinity) - Signum/Step function
- y=1 when x>0
y=-1 when x<0
y=0 when x=0
Domain is (-infinity, + infinity)
Range is (-1,0,1)
- Limits
- x^* =argmin(x-2)^2
means find value of x such that (x-2)^2 is minium
value is 2 in this case - What is value of x +2 as x approaches 1
ans: 3 - limit(x^2-1)/(x-1) as x tends to 1
ans is 2 - Continuity
- Not a continuous function
- Signum/Step function
- y=1 when x>0
y=-1 when x<0
y=0 when x=0 - y=x^2, for all except 0, for 0 it is 2
- Condition for Continuity
- At every points in its domain, RHL=LHL=f(x)
- x-> xo- is same as x->xo+=f(x)
- Differentiation
- If there are 2 points on a straight line, (x1,y1) & (x2,y2), the slope of straight line is
tan theta=y2-y1/x2-x1 - Differentiability
- Rules of Differentiation
- Derivatives for optimization
- Rules of differentiation
- use of derivatives
- for minima
- dx/dy is 0 and d(dx/dy)/dy>0
- for maxima
- dx/dy is 0 and d(dx/dy)/dy>0
- for saddle point
- dx/dy is 0 and d(dx/dy)/dy=0
- maxima/minima/saddle point are called critical points
- Intro to multi-variable calculus
- Partial derivative
- Intro to gradients
- Gradient represents the direction with steepest increase.
- Gradient descent intuition
- Generalization of G.D
- Gradients of some common functions
- Constrained optimization
- Types of Gradient descent
- Batch/Vanila
- We use entire dataset to update w vector at each iteration
- This will be very slow process and a lot of computation will be required
- Mini Batch gradient descent
- Instead of taking all the data points, you take a subset of data points
- We choose K data points randomly where K<N
- Updates will be faster
- Stochastic Gradient Descent
- We only take one data point to update the weights
- Batchsize k=1
- epoc -> iterations to cover entire dataset
- Challenges when more dimensions(50+) are present
- Visualizations, Computations, trainings will be large and difficult
- Curse of dimentionality
- Maths become complex
- Distances become sparse
- Steps
- Find the mean of the data and shift origin to the mean
- Rotate the axis such that the x-axis is in the direction where variation is minimum
- PCA
- Reduces dimensions
- If we plainly take subset of features then we will be losing lot of information
- In PCA, we are reducing features but trying to retain as much info as possible
- PCA works well when features are correlated
- Implementing PCA
- Standardization of data
- Find direction with maximum variance
- Take random vector u
- sum of projection of all the vectors on the new vector
- maximize the projection
- Recommendation: movie suggestions, add suggestions
- time series forecasting: stock price prediction, sales prediction, freight prediction, demand prediction
- Supervised vs Unsupervised learning
- Classification/Regression -> supervised learning -> Training data -> target value/ labels are provided. Figure out relationship between features and target value.
- semi supervised learning
- reinforcement learning - algorithm will create its own features and
- unsupervised learning
- no target data is given
- no relation between features and target value
- clustering
- Recommendations
- similarity between data points
- What do you think about the nature of Car Resale price prediction?
- Predicting continuous value and hence Regression. (discreet value will be classification)
- Linear Regression Example
- Dataset: Cars24 used car dataset
- Features: year, km driven, mileage, rate, model
- Task: Predict selling price of used car
- Experience -> Training data
- In this lecture we are not developing math.
- Linear regression implementation of library Scikit-learn(sklearn)
- Steps
- Raw data -> preprocessing
- outlier removal, missing values treatment
- Categorical -> Numerical
- EDA -> Exploratory data analysis
- Feature Engineering -> new features from raw data
- train-test split
- data normalization(scaling of data)
- Techniques to convert Categorical to Numerical data
- one hot encoding, label encoding, target encoding
- Feature normalization(Scaling)
- Bring all the features to the same scale
- Standardization-> z = (x-mu)/sigma
- min-max scaling -> x - min/max-min
- from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
x=scaler.fit_transform(df.iloc[:,1:])
x=pd.DataFrame(x,columns=df.iloc[:,1:].columns )
Comments
Post a Comment