INTRODUCTION TO STATISTICS IN DATA SCIENCE | More Info | Notesale | Buy and Sell Study Notes Online | Extra Student Income | University Notes

Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

Buy These Notes

You have nothing in your shopping cart yet.

Title: INTRODUCTION TO STATISTICS IN DATA SCIENCE
Description: INTRODUCTION TO STATISTICS IN DATA SCIENCE

Buy These Notes Preview

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above

INTRODUCTION TO STATISTICS IN DATA SCIENCE
The topics will be divided into seven days and the materials will be uploaded here after the
session
...
This week
we are focusing on stats next week we 'll be having continuous live session on blockchain then
android then dsa devops python power bi this projects what you have written is both on machine
learning and deep learning everything
...
The entire statistics with respect to data science is divided into this two
concept okay so in descriptive stats some of the topics that i really want to mention is measure
of central tendency
...
In fourth or fifth day you will be solving
some amazing complex problem okay
...
We'll try
to find out how how to determine whether a distribution is a normal distribution or not that all
things we will try to discuss these are some of the topics that i have written
...
What is statistics
is the science of collecting organizing and analyzing data
...
The most intrinsic meaning of data is that it can be measured so we can basically
measure and the example is iq of a class right suppose i i want to give one more example okay
the age the ages of student
...
Inferential stats is a technique where we use data that we have
measured to form conclusions
...
Inferential stats basically consists of it is a technique
wherein we use the data that we have measured to form conclusions okay conclusions now i 'll
definitely go for first two days little bit slow but afterwards since i need to complete this in seven
days i "ll go a little bit fast" Population and sample is basically population and sample and i 'll be
coming more on making you understand about descriptive when we are deep diving into various
topics okay now it is time that we really understand about population and samples so coming
over here is basicallyPopulation and sample okay now population basically means let's consider
one example again
...
There are different
different different kinds of sampling techniques
...

The first sampling techniques which is most of the time used is called as simple random
sampling
...
The
second type of sampling is called called as stratified sampling
...
Stratified
sampling basically means layering like that we basically say okay so this is what a stratified
sampling means let me give you one example
...
strata strata basically
means
...
based on different different different
...
sampling
satisfies all the conditions everybody clear till here this was with respect to the second example
of sampling technique now coming to the third one
...

Okay so please make sure that you enroll for this free community session
...

The first topic that probably we are going to discuss is something called as arithmetic mean for
population and sample
...
Population is basically
given by capital N
...

Okay now coming to the first thing whenever we are probably discussing about mean you need
to remember that we are trying to find out the average
...
are three main things
...
I can use mean median mode
...
If
...
want to know what is central tendency or so what is
measure of central tendency
...

Average and mean mean are one in the same guys understand average mean mean mean
...
So we consider
these numbers and outliers outliers really have a adverse impact on the entire distribution
...
The first thing that you really need to do is sort
...
different techniques to
remove the outliers which also I 'll be discussing today in front of you
...
So remember outliers has a major impact because here
you can see that the entire distribution of the central data is basically moving right and the
difference is quite huge
...
I keep on increasing the number of outliers then the distribution will
become normal right
...
Because of this outlier, my mean was how much 12
...
There is highly any difference a very
less difference, so that is the reason why we use median
...
17 divided
by 6 is nothing but 2
...
The next thing is that from this equation, I will try to calculate X minus
Mu right, so what is x minus Mu over here just do the calculation, so here I get minus one point
eight three
...
I will
may show you some of the images to make you understand, but I hope you 're getting it or not

clear
...
Okay spread is definitely high right spread when when we say spread is
basically high that basically means the elements that is present in the central
...

With standard deviation
...
49 to 2
...
83 to 4
...
The dispersion becomes high because you have more number of values
inside it right so understand one thing
...
Till now everything will be getting covered in those
practical things right so yes
...
We'll see some examples
...
Like mean
median mode everything will get covered over here
...
THis is a continuous data
...
We already know that this is a
discrete continuous data in this particular case Age
...
Okay and these graphs can really play a very important role
whenever probably we are discussing about Uh we are creating reports where we are doing
exploratory data analysis and many things
...
Okay so here you can basically see that exactly this forms a bell
curve
...
We
can definitely say this has a normal or Gaussian distribution
...
This distribution is
very much important because from this we can derive lot of conclusions okay
...
7 percentage rule now
what does this basically mean
...
Around 68 percentage
...

distribution distribution is called as a bell cup that specific region in that central area
...
Like this kind
of curve
...
If you talk about petal length sepal length, it actually follows Gaussian
distributed data at that time it will follow this 68 95 99
...
IN
...

The domain expert is a doctor doctor Have taken various samples from different different
places
...
I
'll complete all the advanced thing
...
THe last
thing tell me should I directly show you the last day
...
Zscore is equal to 3
...
25 standard deviation
...
Zscore formula X of I

minus mu divided by standard deviation is x
...
of I
...
divided by
...
formula
so if I apply Z score to everything initially my distribution was like this 1, 2, 3, 4, 5, 6 7
...
After applying a z score
...
One of the most important property with respect
to standard normal
...

One practical application one practical application and we do this in machine learning
...
Now Let 's consider that I am solving a machine learning
problem Statement
...
I can take up this entire
data and apply Z score
...
Okay now how do we do
normalization
...
You just have to provide 0 to 1 and automatically
...
Every time in last year Also OdI series happened this year Also it
...
The series average of 2021 was somewhere around 250 and
the standard deviation of the score was somewhere about [ Music ] 10 and Rishabh final score
was 17
...
The
second topic we are going to discuss about is probability the third thing that we are
...

discuss about something called as permutation and combination
...
we will discuss about
...
The next thing that probably we will be discussing about is that let 's define
our data set
...
what I am actually going to do over
...
import numpy as Np, import matplotlib dot pi plot that 's plt and then
...

import mat plotlib inline okay, so I 'll be executing this now
...
We usually also write this formula by root n, but I 'll talk about it why
specifically I 'm not specifying root n over here
...
I 'm finding for every data which is in the data set
...
It 's okay whether it is normally distributed or not, but I
am actually trying it for the first time
...
IF
...
You can drop off okay because you 'll not be able to
understand
...
understand the
previous one
...
This is one way how we can use Z score
...
These are the steps that I will be
...
WE have seen how to calculate Q1 and q3 already in our previous
session
...
So these are my steps
...

Box plot is based on the lower lower fence and higher pins, and you can write a condition and
remove all the elements that is required
...
If the lower fence is negative, then what you can
do is that based on that condition, any value lesser than that you can remove all those things
right and here also you can see 7
...
5
...
Also many places
...
[UNK]
[UNK] is the probability when i roll a dice I get a six so how many events can occur it can only
occur as one and what is
...
So this is how we
basically find out right
...
want
to toss a coin toss a
...
Obviously, I know what are my sample space head and tail what
is
...
You 'll just say that 1 by 2
...
2
...

Multiple events that can occur at the same time can occur in non-mutual exclusive scenarios
...
Obviously,
you understood that what is mutual exclusive
...
Probably for this we can
write probability of a or B where A and B are events is equal to probability of A plus probability
of B
...
IF you have a mutual exclusive event
at that point of time
...
The Probability of getting a queen is nothing but 4 by 52 because in every deck there will
be four queen cards
...
probability of
queen and heart is only 1
...
So these are the possible things that can
occur
...
This is the thing now if I come to the formula
...
error very super important probably in machine
learning you will be discussing about confusion matrix
...
The first topic that we are probably going to discuss about is type 1 and
type 2 error
...
The
fourth topic is something called as z test t test and if we get time we will also finish up chi square
test
...
He testifies that his voice can be heard clearly audible and that he has
no problem with his voice
...

This kind of decisions is specifically called as type one error okay so this decision is basically
called as Type one error right so this is my outcome two okay
...
Next week i will also be announcing seven days
machine learning algorithm live session where we will focus on understanding all the machine
learning algorithms that are present in the data science world okay good news for you all yeah
so here you can basically check out okay okay so one tail and two tail test now let 's try to
understand what is one tailed and two-tailed test
...
True positive and true negative are
always right so just check this confusion matrix do n't worry i i 'll be teaching you in machine
learning everything this topic also
...
05 obviously it is 95
confidence interval but this is only focused in finding greater right so this entire value i 'll put
over here and this region will be my 5 value this region is my 95 value so this becomes one one
tailed test that also in the right hand side because here the important keyword is something
called as greater
...
We will be estimating something for the
population data right in this particular example let's consider that i will try to if i have the sample
meal i 'll try to estimate the population
...
9 okay and
probably my population mean is mu is equal to 3 right this may be equal to mean but may be
less than mean it may be greater than me so in this case we define something called as
confidence intervals so that we will be able to come towards the population mean so confidence
interval is usually given by the formula which is nothing but point estimate plus or minus margin
of error
...
next
thing is that i will take a sample
...
sample of 25 test takers 25 t stickers has a
mean of
...
What is your small n size it is nothing but 25 what is your confidence
interval with respect to this alpha i will get 0
...
your mean what's your mean
over here mean is
...
To find out the confidence interval the thing next thing is that over
here you will be able to see that i have taken a sample of 25 but usually the sample size will be
greater than or equal to 30
...
i have to
check in the z table so for this what i will do is that i will go to my browser and go and check it
where is 0
...
The lower bound it is nothing but pi 20 minus 1
...
2 59
...
8
...
Please hit the like button before joining as usual
...
I have taken entire course two days
back which entire course are you talking about tamar guys
...
I can hear this
in my Youtube channel It 's quite high okay, so I do n't think so
...
Chi square test is a non-parametric
...
This is very very important
...
So using alpha is equal
...
05 would you
conclude the population distribution has changed in the last 10 years
...
They found out this data now right now after 10 years when they took the
sample of N is equal to 500
...
[UNK] reject the null hypothesis," he
says
...
It may be more right, so i'm actually talking about
the data
...
You are in the right part okay now
you know your degree of freedom
...
My alpha value is 0
...

Chi square is equal to summation of f 0 minus f e whole square divided by f e
...
THe output of the test is 494 and 232 is
greater than 5
...
The population distribution has changed
...
Nero Sarkar just search for chi square table
...
05 Okay of freedom is 2 so I am getting 5
...
025 sorry
...
The p-value basically means that based on
this p value
...
If
...
Then obviously
this we reject the null hypothesis
...
IF, the p value is 0
...
We are going to reject it suppose if we get 0
...
Let
's say that in this particular case, I am going to use the mean as 110
...
002
...
Let me start about covariance now shall I continue
guys He 'll hear clear now like that you can do for t test what all data you require see whatever a
question I am writing with respect to this that kind of data you need
...
05
...
In this case, we accept the null hypothesis
...
We accept null hypothesis, we reject it only no okay
...
Let 's go ahead and discuss about the next topic, which is called as covariance
...
This
relationship is basically used over here right so here you can see these two conditions right this
two condition
...
In that
particular case, I can use a formula which is called as covariance
...
WE will be covering up all the machine learning
algorithms that is what we are going to do Okay
...
Then
...
As
...
If
...
The average weight of
all residents in Bangalore City is 168 pounds with a standard deviation 3
...
We need to
check whether whether the sample is being able to
...
So what test we
can definitely use is the Z test
...
I hope everybody is clear because we have already
done this in our previous session
...
025
...
9750 right so we are going to check this area of curve and usually we get 1
...

96
...
The area under the curve of this particular curve is basically 0
...
IF I subtract
this with this how much I will be getting how much of this is going to get I'm actually getting point
zero zero one seven seven eight right is it not 0
...
Because both the both
the area are symmetrical understand one thing
...
If I am getting
one value over there probably I will be able to see that specific part right because this part is
symmetrical to this part
...
If
...
This means we have to
reject the null hypothesis
...
WE fail to or accept the
...
It failed to reject
...
WE can also
try out in every problem that we have probably discussed this many days now is it clear
everyone
...
So if i add this probably then i
will be getting some value and then check whether this is less than significance value less than
alpha less than or equal to alpha
...
5 okay so this is a college over here
...
Your life
...
05 and confidence interval do the age where I okay
...
You have won a nobel prize by telling me that I have done a mistake
...
It can be less than 24 and equal to 24 so it becomes a
two--tailed test
...
This so tough guys so tough daily
...
The course request too much funny you know like this any kind of
course request will come some guy will come and say that okay
...
There are very less number of people who will be having huge wealth

and they will be
...
There will be very less people who write big comments right big
comments, so this is one example okay and again
...
I have uploaded a detailed
video in my stats playlist
...
There are three important things that will come probability
density function probability Mass function and cumulative density function
...
Third
one that we specifically discuss is something called as cumulative distribution function
non-cumulative distribution function
...

The probability of head it is nothing but 0
...
05 what
does this basically mean now if I consider 45 it falls somewhere here okay and just there before
yesterday guys I uploaded a video regarding derivatives right
...
The probability of tail is nothing but 0
...
The
probability 53 in the y axis over here I will be having all the outcomes and the
...
In
every distribution you will be having a PDF function will have a CDF function how to derive this
PDF function there will be a formula
...

The technique is that at the end of the day whatever technique you use this is same the PDF
now over here you have probability Mass function with some Lambda value
...

Then for this the function is this one do n't need to buy at it and the other CDF function
...
The next video will be related to inferential statistics where I 'll give you
another trick where you can probably solve every problem and you can ask in the interviews

Title: INTRODUCTION TO STATISTICS IN DATA SCIENCE
Description: INTRODUCTION TO STATISTICS IN DATA SCIENCE

Buy These Notes Preview

Notesale: Turn your study into money

Already a Member? >

Search for notes by fellow students, in your own course and all over the country.

My Basket

Document Preview