Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Title: Maths and Stats for Data Science
Description: this notes are all you need to understand the maths and stats knowledge for data science
Description: this notes are all you need to understand the maths and stats knowledge for data science
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Statistics And Mathematics for Data Science
...
Descriptive Statistics
In descriptive statistics, you conduct your experiment on the entire
population or dataset
...
Another example is analyzing the sales growth
of a clothing store with locations in North America; you would need to
consider all three stores, not just some of them, to calculate the
average sales growth of 5%
...
For instance, Moderna, Pfizer, and J&J developed a
vaccine and tested its efficacy on 40,000 patients, concluding that the
vaccine is 95% effective
...
This is inferential
statistics
...
When you take into account
the entire population in your data analysis process, it is called
descriptive analysis
...
Mean, Median, Mode, and Percentile in Data Science and
Machine Learning
In this article, we will discuss the concepts of mean, median, mode,
and percentile in data science and machine learning
...
Example Scenario
Suppose you want to open a luxury car showroom in Monroe
Township, New Jersey
...
One measure you can use is the
average income of people in the area
...
In such
cases, the median can be a better measure to use instead of the
mean
...
If there are an
even number of data points, the median is calculated by taking the
average of the two middle values
...
Percentile and Outlier Removal
The percentile is a measure that indicates the percentage of values
that are less than or equal to a given value
...
The
interquartile range (IQR) is the range between the 25th and 75th
percentiles and is useful in outlier removal
...
Python Code
Below is Python code to calculate the median and remove the 99th
percentile outlier:
# Calculate median
def median(data):
n = len(data)
s = sorted(data)
return (s[n//2] if n % 2 != 0 else (s[n//2-1]
+ s[n//2])/2)
# Remove 99th percentile outlier
def
remove_outlier(data):
p = np
...
Try removing outliers using the percentile method
...
The
formula for calculating percentile can be found on the website
statisticshowto
...
On the other hand, mode is the most frequently
occurring value in a dataset
...
Using Python Code to Work with Income Data
If you have income data in a CSV file, you can load it into a pandas
data frame and use percentile to remove outliers
...
The quantile function can be used to find any
percentile, and the interpolation parameter can change the result
...
Missing values can be filled using the mean or
median, with the latter being more appropriate when outliers are
present
...
A solution link is provided,
but it is recommended to try on your own first
...
One metric to
represent this concept is mean absolute deviation (MAD), which is the
mean of the differences between individual data points and the
average
...
Another metric to consider is standard deviation, which is calculated
by taking the square root of the average of the squared differences
between individual data points and the average
...
It's important to note that standard deviation is related to the L2 norm,
while MAD is related to the L1 norm
...
We will also discuss the use of Z-score and do
some Python coding to understand how these concepts are applied in
real-life data analysis
...
The remaining data samples are located on
the left and right-hand side of the average value
...
Normal distribution occurs naturally in many data sets, such as test
scores, apartment prices, or employee performance
...
The histogram shows the frequency distribution, while the bell
curve indicates the normal distribution of data samples around the
average value
...
Why is Normal Distribution Important?
Normal distribution is crucial for data scientists and machine learning
engineers since it helps in the data cleaning process
...
Outliers are data points that are very different from the
average values and can lead to skewed machine learning models
...
Statisticians found that 68
...
Similarly, 95
...
7% fall under plus and minus three standard
deviation range
...
Python Coding to Understand Z-Score and Outlier Removal
We can use Python to apply the standard deviation formula and
remove outliers from the data set
...
We will first use the Seaborn library to plot
the histogram and bell curve, and then apply the standard deviation
formula to remove the outliers
...
The dataset
contains ten thousand data points of different people's heights
...
read_csv('height
...
sns
...
height, kde=True)
Step 3: Calculate Standard Deviation
We will use the describe function to calculate the standard deviation
of the height column
...
height
...
df = df[(df
...
height
...
height
...
By understanding these concepts
and applying them through Python coding, we can remove outliers and
clean the data set, leading to better machine learning models
...
To do this, you
first need to calculate the mean and standard deviation
...
84
inches, you can use plus and minus 3 standard deviations to remove
the outliers
...
82 and 77
...
● If a height is less than 54
...
91 inches, then
it's an outlier
...
Z-Score
Z-Score is a measure of how many standard deviations away a data
point is from the mean
...
Once you have the Z-Score for each data point, it's
even easier to remove the outliers
...
Working on exercises like
this is important for understanding and retaining the concepts
...
It helps in finding the
time or number of years required for an investment to reach a certain
amount or value
...
Let's
explore some applications of log in more detail
...
Let's say a bank gives a 5x return on
investment every year
...
This concept can be
represented using an exponent where 5 raised to 2 is 25 and 5 raised
to 3 is 125
...
For instance, log
to the base 5 of 125 is 3, which means it took three years to reach that
amount
...
For instance, if we
plot the revenues of six companies using a bar chart, Amazon's
revenue will be significantly higher than the others and make it hard to
compare the smaller companies
...
plot function
...
Log can also be used in machine learning to avoid bias in models
...
Logarithmic Function in Earthquake Measurement
Another classical example of logarithmic function is in earthquake
measurement
...
This
makes it easier for people to understand the severity of an earthquake
and take necessary precautions
...
Understanding Log Normal Distribution
Before diving into log normal distribution, it's essential to have a basic
understanding of normal distribution
...
However, when it comes to
income distribution, the curve is right-skewed due to the presence of
outliers like Jeff Bezos and Elon Musk
...
This transformation turns the original distribution into a log-normal
distribution, which is commonly used in data science
...
For instance, when building
a model to predict whether a person should be given a loan or not, the
income variable needs to be transformed using a log function to
ensure that all values are on a similar scale
...
A sample dataset of US income data was taken from the
census
...
By applying the log
function to the x-axis, the distribution was transformed into a more
normal shape
...
A/B Testing: A Powerful Technique for Decision Making
When you have two options and you don't know which one is
performing better, A/B testing can help you make the decision
...
The
business manager proposes moving the button to the middle of the
page to increase leads
...
By
comparing the number of sales leads each version generates, Atliq
can determine which version is better
...
A/B testing is commonly used by websites, including Amazon,
LinkedIn, and Facebook, when making layout changes
...
A/B testing is a simple but powerful
technique that can help make informed decisions
...
A/B testing is a simple and powerful technique that is widely used
across different industries
...
Understanding Hypothesis Testing
Hypothesis testing is the process of testing a claim or hypothesis
about a population parameter using sample data
...
The null
hypothesis is the claim being tested, while the alternate hypothesis is
the opposite of the null hypothesis
...
Alternate hypothesis - Sun does not rotate around the earth
...
Alternate hypothesis - Employees can be more
productive working from home
...
Alternate hypothesis - Certain foods
can cause irritable bowel disease
...
Application of Hypothesis Testing in Data Science
Data scientists may use hypothesis testing to compare models and
determine which model is better for a specific task
...
Understanding hypothesis testing is an essential skill in data science,
and it can be useful in various career paths
...
Outliers are values in a dataset that are significantly different
from the other values
...
We will start with a
simple use case of studying people's height
...
The
median height is the middle number when all the heights are sorted in
ascending order
...
Using Modified Zscore to Detect Outliers in Property Prices
Consider a dataset of property prices where we want to detect an
outlier
...
In this case, 300k is an outlier
...
If you use the mean as your basis for detecting outliers, it may
not detect this value as an outlier because the mean is affected by this
high value
...
Calculating MAD and Modified Zscore
MAD is similar to standard deviation but uses the median instead of
the mean as its basis point
...
5, which is
used in modified zscore
...
6745), where x is a single data point
...
In the demo, we calculate zscore and modified zscore for a
dataset of people's heights and show how to use them to detect
outliers
...
Any value that has a z-score of 3
...
However, when the sample size is smaller, outliers can have a bigger
impact on the average and standard deviation
...
To demonstrate this concept, let's use a movies dataset
...
We can then use
the regular z-score method to detect outliers, but since this may not be
effective for smaller sample sizes, we can also use modified z-scores
...
read_csv('movies
...
describe()
# Define a function to compute z-scores
def get_zscore(value,
mean, std):
return (value - mean) / std
#
Create a z-score column in the dataframe
df['zscore'] =
df['revenue_mln']
...
5
df[df['zscore'] > z_cutoff]
# Define a function to
compute median absolute deviation
import numpy as np
def get_mad(s):
median = np
...
abs(s - median)
mad = np
...
median(df['revenue_mln'])
df['modified_zscore'] =
df['revenue_mln']
...
abs(x - median), 0,
mad))
# Detect outliers using modified z-score
modified_z_cutoff = 3
...
The modified z-score is a statistical method that
measures the distance between a data point and the median,
normalized by the median absolute deviation (MAD)
...
Defining the Modified Z-Score Function
To use the modified z-score method, we need to define a function
called get_modified_zscore
...
The formula for the modified
z-score is:
$0
...
6745 * ((x -
To test this function, let's find the modified z-score of the movie
"Avatar", which has a revenue of $2
...
4mad = 266
...
This was
expected since "Avatar" has one of the highest revenues of any movie
...
We will create a new column in our data
frame called "modified_zscore", which will contain the modified
z-score for each movie:
df['modified_zscore'] = df['revenue_in_millions']
...
A general guideline is to consider
any data point with a modified z-score greater than 3
...
● With the z-score method, we only detected one outlier:
"Avatar"
...
513 billion), and "The
Bodyguard" (revenue of $411 million)
...
Conclusion
The modified z-score is a useful method for detecting outliers in
skewed data distributions that have outliers that affect the mean
...
If the data distribution has few
data points and outliers that affect the mean, the modified z-score
method may be more appropriate
Title: Maths and Stats for Data Science
Description: this notes are all you need to understand the maths and stats knowledge for data science
Description: this notes are all you need to understand the maths and stats knowledge for data science