Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

You have nothing in your shopping cart yet.

Title: Data preprocessing
Description: This is data preprocessing class note.if you learn this notes you will get good marks.

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above


Data Preprocessing








Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

Aggregation
• Combining two or more attributes (or objects) into a
single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects

– Change of scale
• Cities aggregated into regions, states, countries, etc

– More “stable” data
• Aggregated data tends to have less variability

Sampling
• Sampling is the main technique employed for data selection
...


• Statisticians sample because obtaining the entire set of data of interest
is too expensive or time consuming
...


Sample Size

8000 points

2000 Points

500 Points

Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data

Types of Sampling


Simple Random Sampling

– There is an equal probability of selecting any particular item



Sampling without replacement

– As each item is selected, it is removed from the population



Sampling with replacement

– Objects are not removed from the population as they are selected for the sample
...

– Is higher when objects are more alike
...


Euclidean Distance


Euclidean Distance

d is t =

n

 ( pk − qk

k =1

2
)

Where n is the number of dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or data objects p and q
...


Mahalanobis Distance
m a h a la n o b is ( p , q ) = ( p − q ) 

−1

( p − q )T

 is the covariance matrix of the
input data X


j ,k

1
=
n −1

n



(X

ij

i=1

For red points, the Euclidean distance is 14
...


− X j )( X

ik

− X k)

Cosine Similarity

• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,

where • indicates vector dot product and || d || is the length of vector d
...
5 = (42) 0
...
481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0
...
5 = 2
...
3150




Similarity Between Binary Vectors
Common situation is that objects, p and q, have only
binary attributes
Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1



Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)

Correlation



Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, p and q, and then
take their dot product

p k = ( p k − m e a n ( p ) ) / s td ( p )
q k = ( q k − m e a n ( q ) ) / s td ( q )
c o r re la tio n ( p , q ) = p  • q 

Visually Evaluating Correlation

Scatter plots
showing the
similarity from –1
to 1
Title: Data preprocessing
Description: This is data preprocessing class note.if you learn this notes you will get good marks.