Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Title: Data preprocessing
Description: This is data preprocessing class note.if you learn this notes you will get good marks.
Description: This is data preprocessing class note.if you learn this notes you will get good marks.
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Data Preprocessing
•
•
•
•
•
•
•
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a
single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Sampling
• Sampling is the main technique employed for data selection
...
• Statisticians sample because obtaining the entire set of data of interest
is too expensive or time consuming
...
Sample Size
8000 points
2000 Points
500 Points
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
Types of Sampling
•
Simple Random Sampling
– There is an equal probability of selecting any particular item
•
Sampling without replacement
– As each item is selected, it is removed from the population
•
Sampling with replacement
– Objects are not removed from the population as they are selected for the sample
...
– Is higher when objects are more alike
...
Euclidean Distance
•
Euclidean Distance
d is t =
n
( pk − qk
k =1
2
)
Where n is the number of dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or data objects p and q
...
Mahalanobis Distance
m a h a la n o b is ( p , q ) = ( p − q )
−1
( p − q )T
is the covariance matrix of the
input data X
j ,k
1
=
n −1
n
(X
ij
i=1
For red points, the Euclidean distance is 14
...
− X j )( X
ik
− X k)
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d
...
5 = (42) 0
...
481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0
...
5 = 2
...
3150
•
•
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only
binary attributes
Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
•
Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
Correlation
•
•
Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, p and q, and then
take their dot product
p k = ( p k − m e a n ( p ) ) / s td ( p )
q k = ( q k − m e a n ( q ) ) / s td ( q )
c o r re la tio n ( p , q ) = p • q
Visually Evaluating Correlation
Scatter plots
showing the
similarity from –1
to 1
Title: Data preprocessing
Description: This is data preprocessing class note.if you learn this notes you will get good marks.
Description: This is data preprocessing class note.if you learn this notes you will get good marks.