Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Title: Comp3008 Machine Learning
Description: Notes for the comp3008 machine learning course at Southampton University
Description: Notes for the comp3008 machine learning course at Southampton University
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Machine Learning
Comp3008
1
a)
Introductory Lecture
Course Concepts
Learning “rules” from data
More effective than building hard rules or expert systems
Newer approach to AI started in the 1980s
Scales much better and gives better results
Huge commercial applications
b)
Aims
Concentrate on a few representative techniques
Theoretical framework for understanding different techniques
maths: linear algebra, optimisation, probability, learning theory
Practical coursework
c)
Learning from data
Tradition AI was rule based but newer techniques are based on learning from data
Supervised Learning (labelled data)
Regression aka fitting functions
Classification
Unsupervised (finding structures in data)
Clustering (k-means)
Data Reduction (PCA)
Reinforcement Learning (game playing)
Data is typically high dimensional meaning it has many attributes or features
2
a)
Simple Classifiers
Classification
A standard task for machine learning
Typical scenario:
– some labelled data on which to train the data
– difficult to obtain the correct label
– we need a learning machine to perform the classification
Currently only concerned with binary classification
b)
K-Nearest Neighbours
Simple but effective
Compute the distance from a new data point to each of the old ones:
d x , x k =
∑ x −x
i
ki
2
i
Find the K-Nearest Neighbours
Assign the class of the new data point to the majority class of the nearest neighbours
Increasing K means outliers become less significant but takes more time to compute
Increasing K also has a smoothing effect which may not be desirable but works well in high
dimensions
KNN requires all features x i to be the same scale
Can be helpful to normalise features of input and sample data
Typically normalise each 'axis' to be between 0 and 1
Can sometimes be useful to look at the variance on an axis (see slides)
c)
Performance
Fast as there is no learning and it can be fairly accurate
Non-parametric algorithm
d)
Perceptron
Prediction given by f x , w where x is the input vector and w is 'weights' that we train
A classic parametric learning algorithm
Typically f x , w=w T x aka dotproduct of w and x
Also a response function (for the output) g x
Can have different response functions depending on the scenario:
– 'step' : for binary
– linear
– tanh
Perceptrons can be layered together so the output of one feeds as the input to another
Can use different types of inputs:
– local information in an image
– distance from 'centers' in the input space (RBF)
– Different polynormials (curve fitting)
– Layers of perceptrons (MLP)
– Eigenvectors of a kernel function (SVM)
One can also have multiple outputs the same set of input data
e)
Linear Seperability
The perceptron with a step function performs classification
Can be visualised as a decision boundary in input space
– in 2D this can be thought of as a straight line on one side one class sits, on the other side a
different class
The perceptron can only separate linear-separable inputs!
The perceptron is not powerful enough to do tasks such as parity (XOR) or connectedness
However, the world has a lot of linearly-seperable problems and these tend to be ones humans are
good at
This is because data is in very high dimensions and the linear function needed gives a lot of
flexibility
(see last years notes about perceptron learning)
3
a)
Regression
Function Fitting
The concept is to fit a function to a set of data points as a learning machine
Want to minimize the errors
Unfortunately there an infinite set of functions that can go through the set of data points
So one has to reduce the functions being tested
However, simple functions are often better, complex ones often over learn the data
b)
Least Squares Regression
D={ x k , y k } Data points
y predict = f x , w where w is the parameters to learn and x is the input vector to get a value for
Want to minimise the error compared to example data
k = f x k , w− y k this can be put into a vector
E x , w=∥∥2 E(x, w) is the objective function to minimise
c)
Linear Least Squares Regression
Simplest functions to learn are linear ones
Aim is to define a plane in space whereby the distance for each training point to this plane is
minimised
f x , w=x T w
k = x T w− y k
can be combined into an error vector using a matrix X concatenating column data vectors together
= X T w− y
E x , w=∥∥2 this is the total error distance to be minimised
E x , w= X T w− yT X T w− y
E x , w= X T w T − y T X T w− y
E x , w=w T X − y T X T w− y
E x , w=wT XX T w− y T X T w−w T Xy y T y
T
T
T
w Xy= y X w and is a number rather than matrix therefore
E x , w=wT XX T w−2wT Xy y T y
note this uses the trick of adding an extra feature
to cope with the bias
Minimum of this function (with respect to w) is when ∇ E x , w=0
∇ E x , w=∇∥∥2=∇∥X T w− y∥2=2 X T w− y
this would mean X T w= y but if X is MxN and M != N then this can't be solved directly
Therefore must find a more explicit route!
∇ E x , w=∇ wT XX T w−2wT Xy y T y
∇ w T XX T w= XX T XX T T w
T
T
T
T
∇ w XX w= XX XX w
T
T
T
∇ w XX w=2XX w
∇ 2w T Xy=2Xy
therefore
T
∇ E x , w=2 XX w− Xy
note due to E x , w=∥∥2 there is a single solution at Minima
0=2 XX T w− Xy
0= XX T w− Xy
T
Xy= XX w
XX T −1 Xy=w
This gives a direct solution for w in terms of the input and output values
d)
Theory
For square matrix X this is a one-to-one mapping such that:
y= X T w and w= X T −1 y
This is solvable and provides a good inverse mapping too
When we have more datapoints (examples) than parameters then generally we can not fit a curve
through every point so least squares provides a solution to this by minimising the errors
this is an over constrained problem
If we have less datapoints than parameters the problem is under-constrained and therefore there will
be many solutions of w for y= X T w and the problem is illposed
4
a)
Generalisation
Why
Have learnt how to make learning machines which learn from example data (a training set)
However we want the machine to work for unseen data aswell
This is known as the generalisation performance
Usually the best learning is not the best generalisation
b)
Generalisation Error
f(x) is the true function
p(x) is the probability of selecting x
Generalisation error defined as:
E g w=∫ f x , w − f x 2 p x dx i
...
the distance between the guess and the true function
times by the probability of it occurring for every x summed
However, this is usually impossible to compute as we don't know f(x) otherwise we wouldn't be
building a machine
It can be estimated though using some unseen data
T
unseen data T ={ x k , y k }
k =1
1
E g w≈E w∣T =
∑ f x k , w− y k 2 i
...
average error over the unseen set
T x , y ∈T
Important to TEST the generalisation performance using the validation set!
Can be used to help prevent over-fitting/over-learning of data which can easily occur in complex
functions as one tries unnecessarily to go through every point with the learnt function
k
c)
k
Ockham's Razor
Need to choose the simplest model that's not too simple
Principle referred to as Ockham's Razor
We should look for the simplest machine that can do the classification of data
We say that too complex machines over-fit the data
One way of formulating simplicity is Bias-Variance
d)
Bias-Variance
Bias-Variance analysis shows that the generalisation error can be seen as the sum of two terms:
A bias due to the model being too simple
A variance due to the model sensitivity to the data (over fitted)
More of a theoretical process
Notation Aside: 〈
...
e
...
e
...
e
...
Have to turn them into numbers
Binary categories => 0/1 encoding
Multiple categories lead to two possibilities:
n features where only one is 1 : used when there is no ordering in values
n values for 1 feature : used when there is ordering in values
c)
Normalising Data
Traditional vector length method can be used
Make each feature on a scale of 0 to 1
Normalise according to variance to the mean
x ki
Required for Support Vector Machines to work well
x ki − i
i
d)
Missing Data
Often the data sets will have some missing features
No one solution so it depends on the problem
– ignore missing examples
– replace missing features with mean of other datapoints
– flag examples and analyse later
– attempt to model the uncertainty probabilistically
e)
Feature Selection
Incorporating useless features can be counter productive
Often achieve better performance by removing features
However this can be difficult to work out in advance
Exists weighting schemes for features based on correlation
For audio and image signals the number of features are too large and MUST be reduced
Can reduce complexity by (for example) :
searching for features in images like lines
PCA
…
f)
Measuring Generalisation Performance
Want to measure the machine's performance on unseen data
Typically divide data in 3 sets:
– training set
– parameter set (no always needed)
– test set
These must be independent !!
Estimating generalisation performance is important for choosing parameters
Can be hard to split up data if its in limited supply
g)
Cross Validation
To estimate the generalisation error, perform the learning & testing multiple times on different parts
of the data set
Can be expensive to run the cross validation in terms of computation but not in terms of time and
gives fairly accurate results when there is little data
In K-fold cross-validation, the original sample is randomly partitioned into K subsamples
...
The cross-validation process is then
repeated K times (the folds), with each of the K subsamples used exactly once as the validation
data
...
The advantage of this method over repeated random sub-sampling is that all
observations are used for both training and validation, and each observation is used for validation
exactly once
...
h)
ROC Curves
Maps false positives again true positives
True Positives : where classifier is correct
False Positives : where classifier is wrong
A perfect classifier will have a flat (no gradient) curve
True Positive
True Positive
=
Positives
True Positive False Negative
False Positive
False Positive
False Negative Rate=1−specifity=
=
Negative
True NegativeFalse Positive
True Positive Rate=sensitivity=
6
a)
Principle Component Analysis
What is it
Uses eigenvector maths to find the principle components (directions) of data in order to reduce
feature size
Eigenvector for a matrix M are the vectors for which the matrix can be multiplied to which simple
scale them
...
A matrix will only scale its
eigenvectors
...
, x P − i
...
a Matrix formed of centred data points
P−1
C= XX T
The covariance matrix is symmetric semi-definite which gives it certain properties
X=
e)
Properties of covariance matrix
Quadratic form of a vector and matrix is defined as v T Mv
T
T
T
T
2
v Cv=v XX v =u u=∥u∥ ≥0 where u= X T v
Matrices with non-negative quadratic forms are known as positive semi-definite
Because its symmetric and square then for n features an nxn covariance matrix will be formed and n
real orthogonal eigenvectors with real eigenvalues will be found
A covariance matrix will have a 0 eigenvalue only if there is no variation in the direction of the
eigenvector
A covariance matrix will have 0 eigenvalues if there number of patterns is less than the number
dimensions
A covariance matrix formed of n+1 patterns with n dimensions will have no zero eigenvalues if the
patterns are linearly independent
Matrices with no 0 eigenvalues are full rank and are also positive definite so we would expect that
when P>n the covariance matrix will be positive definite
f)
Orthogonal Matrices
Matrix V is orthogonal iff V −1=V T
This means that V −1 V =V T V =Identity Matrix
g)
Matrix Decomposition
Let V be a matrix formed of eigenvectors if a matrix M
=diag 1,
...
e
...
Construct covariance matrix
2
...
Keep eigenvectors with largest eigenvalues (the principle components)
4
...
T
T
D=VS SV
So S T S becomes the diagonal matrix of eigenvalues of D
8
Matrix Math
A vector squared is its dotproduct with itself
v 2 =v⋅v=v T v
A transposed expression is the same as if each of its contents has been transposed
X T w− yT =w T X − y T
BAT =AT BT
f x
x1
∇ f x =
...
wikipedia
...
Note
the height of the functions is the weight
...
Illustration 3: Radial Basis
network
...
d)
Finding centers
Radial basis functions depend on the distance to centres
Finding these centres and how many are needed can be hard
They are normally found in clusters
e)
k-Means Clustering
Find k clusters in the data
Can be tricky to find the right k
Easy algorithm and popular
Takes a set of unlabelled data
D={x j } N
1
Algorithm:
1
...
Randomly assign each data point to a cluster C i
3
...
2
let x be the minima and h be our current guess
this says that if we're close enough to x (i
...
x-h is small) then higher terms can be ignored
Differentiating with respect to h gives the h which minimises this function
0= f ' x f ' ' x h
− f ' x
h=
f ' ' x
which suggests the iteration scheme
f 'x
x= x n−
f ' ' x
Converges quadratically if we start close enough to the minima
In higher dimensions:
x= x n−H −1 ∇ f x where H −1 is the hessian matrix of the function
c)
More gradient descent
x= x 0− ∇ f x 0 Need to choose a good learning rate/step size
Too small and it takes a lot of time
Too large and it can over shoot the minima
Illustration 5: When step size is too big, overshooting
One way is to find the which minimises
d)
f x 0− ∇ f x 0
Other algorithms
Conjuate gradient
Levenberg-Marquardt
12 Regularisation
a)
What is
One way to improve generalisation performance is to bias a learning machine to learn
simpler/smoother functions
This should help reduce the sensitivity of the learning machine on the learning data
To achieve this one can add regularisation terms that punish complex functions
b)
Example
Using linear-least squares regression
let be a matrix of input vectors, w be the vector of weights, y be a column vector of expected
outputs
E=∥T w− y∥2 vR w
R(w) is the regularisation function and v is its weight
E=∥T w− y∥2v∥w∥2 this is known as weight decay as it punishes functions with lots of
weights
...
This means although the result won't fit the test data so well it will be less sensitive
to the data
c)
Tikhonov Regularisation
More direct penalty for non-smooth functions
Looks at the second derivative of input patterns
P
N
2 f x p ; w
1
P∑∑
xi2
p=1 i=1
d)
2
(average second derivative over all input patterns)
Summary
In machine learning we care about unseen data performance
Over complicated machines don't give very good generalisation performance
Wish to tailor the complexity of the machine to the data set
One way to do this automatically is by introducing regularisation terms
13 Computational Learning Theory
a)
Classification
Can use different rules to categorise data
Most rules may be spurious (useless and false)
i
...
in separating red & green triangles & squares into 2 categories then the machine could
learn to separate on shape when its meant to separate on colour
Increasing the data set size will reduce the amount of spurious rules
But how much data is needed?
b)
PAC Learning
Probably Approximately Correct learning
Tries to answer the data-set size question
Only assumption is that the example data has the same probability distribution as the real data
More complex machines can find more complex rules so these need to be trimmed away
Probably : only as good as the examples shown
Approximately : want to give good predictions but not necessarily perfect
c)
Generalisation Error
The generalisation error for a hypothesis h(x) is
E h∣ p=P p h x !=c x where p is the probability distribution of x's and c(x) is the true
answer
Essentially its the probability of making an error
Input space
c(x)
h(x)
Overlap where prediction
is correct
d)
PAC-Learnable
A concept class C is PAC learn-able if for all concepts in it c ∈C and a probability distribution p
a learner choosing a hypothesis h ∈H can:
1
achieve a generalisation performance of E h∣ p for 0
2
1
with a probability of 1− and 0
2
using P training examples
1
1
where P is a polynomial in
and
The idea is to restrict the number of hypothesis's that can be learnt by eliminating those with a
generalisation error greater than
...
e)
Finite Consistent Hypothesis Spaces
Suppose the hypothesis space is finite and contains the true answer
Suppose our learning machine is able to correctly categorise all the training examples
E h∣ p=0
P
Consider examples D={ x k , c x k }
k =1
Consider a hypothesis with an errror E h∣ p this means that the probability of it getting all
the training examples correct is: 1−P
There will be k ∣H∣ hypothesises with E h∣ p
The probability of any of these being correct means marginalising over all the k hypothesis
P
P
− P
k 1− ∣H∣1− ≤∣H∣e
Using ≥∣H∣e− P the number of training examples needed to get a generalisation error better
than with a probability of 1− is given by:
− P
≥∣H∣e
ln −ln ∣H∣≥− P
ln ∣H∣−ln ≤ P
1
ln ∣H∣−ln ≤P
1
1
P ln∣H∣ln
Example:
for a hypothesis space of 2 30 and an error of 1% is required with a probability of
1
1
30
−6
1−10 then P= 0
...
e
...
∧x n
This gives a total of 2 n possible hypothesis
A more sensible hypothesis might combine different conjunctions
i
...
h x = x 1∧¬ x 4 ∧x 5∨ x 2∧¬ x 1∧¬ x 3
...
e
...
N
x
N
L x , = f x −∑ k g k x
k=1
this means that
N
∇ x f x =∑ k ∇ x g k x
k=1
Using the other conditions, one solves the simultaneous equations for the system
c)
Inequality constraints
min f xsubject to g x ≥0
x
Two things can happen:
– minimum of f(x) satisfies g x0 giving an unconstrained optimisation problem (finding
a point in a region)
– OR minimum of f(x) satisfies g x=0 giving a constrained problem (finding a point a
long some boundary)
Create L x , = f x− g x and minimise as normal
if =0 then solution is in the region else if 0 then it lies on a boundary created by the
function g x=0
Known as KKT conditions
15 Support Vector Machines
a)
What are they
Relatively new learning machine type
Get very competitive results on many problems
Many different forms (classification, regression etc
...
P
∥w∥
w
b
Dividing through by m, defining w ' =
and b ' =
we get the quadratic programming
m∥w∥
m
problem:
∥w '∥2
T
min
subjectto y k w ' x k −b ' ≥1 ∀ k =1
...
P
k
2 k , l=1
k=1
k =1
This is also a quadratic programming problem
However its a dual form and depends on the number of training examples (y's) rather than the
number of dimensions being searched
This is because the x T x l is a dot product and just a number
k
It can be replaced by a Kernel if a search in higher dimensional space is required
P
P
1
max ∑ k − ∑ k l y k y l K x k , x l
2 k , l=1
k=1
16 Kernel Trick
a)
Kernels
Functions of two variables that can be factorised
K x , y= x T y
Where x T = 1 x , 2 x ,
...
More on that later
Rather than working with our inputs, we can work in a transformed feature space which is extended
However, we don't have to explicitly calculate this transformed/higher feature space due to the
kernel factorisation
P
P
1
max ∑ k − ∑ k l y k y l K x k , x l
2 k , l=1
k=1
That is the kernel trick
b)
Factorisation
Sometimes using the factored kernel is easier:
2
Let x = x1, x 2 2 x 1 x 2
2,
Kernel is K x , y= x T y
2
K x , y= x1 y 2 x 2 y 22 x1 x 2 y 1 y 2
1
2 2
K x , y= x 1 y 1 x 2 y 2
T
2
2
K x , y= x y
In 2D this kernel projects the data into a 3D space and allows for a measure of distance to be found
in the 2D space
c)
Eigenfunctions and Kernels
Any function can be made from the sum of its eigen functions and values
Mercers theorem:
K x , y=∑ i i x i y
i
Let i x = i i x so
K x , y=∑ i x i y but for x to be real valued,
i
i
i≥0 ∀ i otherwise the 'distance' between points in the extended space can be negative
K x , y=∑ i x i y
i
This is the same as saying:
T
K x , y= x y
So if you like a kernel is the sum of multiple eigenfunctions and their eigenvalues
For SVMs to work the kernels must be projecting into euclidean space so that distances are always
positive
Therefore we must use Positive Semi-definite Kernels ( i≥0 ∀ i ) which can always be
decomposed into as sum of positive functions
K x , y=∑ i x i y
i
d)
Integratal of Kernels
K x , y=∑ i x i y And is positive semi-definite
i
∫∫ f x K x , y f y dx dy≥0
∫∫ f x ∑ i x i y f y dx dy
i
∑ ∫ f xi x dx ∫ i y f y dy
i
∫ f xi x dx=∫ i y f y dy therefore
2
2
∑ ∫ f x i x dx =∑ ∫ i y f y dy ≥0
i
e)
i
Combining Kernels
Adding
K 1 x , y and K 2 x , y are valid kernels
So is K 3 x , y =K 1 x , yK 2 x , y
Multiplied
K 3 x , y =K 1 x , y∗K 2 x , y is valid
n
K 2 x , y =K 1 x , y is valid too
Exponential
K 2 x , y =e K x , y is valid
1
An important kernel is :
2
K x , y =e
−∥x− y∥
2
As it a an infinite number of eigenvalues so if you like its projecting the data into an infinite
dimensional space
...
e
...
e
...
H n
To make the decision we generate some data D
P ( A∣ B)
P( D∣ H ) P( H )
P ( D)
i
...
the probability of the hypothesis occurring given the data is the probability of the data and
hypothesis occurring over the probability of the data occurring
P (H ∣ D)=
P ( H ∣ D) is the posterior probability (the probability of H after we know the data)
P ( D ∣ H ) is the likelihood of the data given the hypothesis's
P ( H ) is the prior probability (before we know the data)
P ( D) is the probability of the evidence
We want to know P (H ∣ D) i
...
the probability of what happened given some evidence
Bayesian converts this to a forward problem P ( D ∣ H ) where we calculate the likelihood of the
data using that hypothesis but we need to know a prior P (H )
P ( D)=∑ P( D , H i )=∑ P ( D∣ H i ) P( H i)
i
i
Example
Have a machine that produced binary digits with a given probability
...
D=[0,0,1,0,1]
H 1=0
...
6 the hypothesis of the probability of getting a 0 bit
1
1
P ( H 1)= ∣ P (H 2 )
equal probability of each hypothesis
2
2
P ( D ∣ H 1)=0
...
2∗0
...
2∗0
...
00064
P ( D ∣ H 2 )=0
...
6∗0
...
6∗0
...
03456
P (D)=P ( D∣ H 1 ) P (H 1 )+ P ( D ∣ H 2 ) P (H 2)=0
...
5+ 0
...
5=0
...
00064∗0
...
01818
P ( D)
0
...
03456∗0
...
98181
P (D)
0
...
6 for a 0
being produced
d)
Using densities
Probability densities can be used with this model too
This is where the probability of an event p taking a value x is
P ( p= x)= f ( x) where f(x) is a density function
P ( p= x ∣ D)= f ( x ∣ D)
P ( D ∣ x) f ( x )
f ( x ∣ D)=
P (D)
The probability of the data (evidence) is calculated by integrating over the whole probability
function
1
1
P ( D)=∫ P ( D∣ p= x) P ( p= x) dx=∫ P ( D∣ p=x) f ( x ) dx
0
0
e)
Chaining
The posterior of one calculation can be used as the prior of another
i
...
P ( X 1 ∣ x) f ( x )
f ( x ∣ X 1 )=
P ( X 1)
P ( X 1, X 2 ∣ x ) f ( x ∣ X 1)
f (x ∣ X 1, X 2)=
P ( X 1, X 2 )
…
P( X 1
...
X n−1 )
f ( x ∣ X 1
...
X n)
f)
MAP
Sometimes the full Bayes formula can be hard to use
Therefore Maximum a Posteriori techniques try to find the hypothesis which maximises the
posterior
This involves maximising the likelihood and the prior
g)
Maximum Likelihood
Let the prior be uniform then we want to maximise the likelihood
i
...
H =argmax P ( D ∣ h)
h
If there are a set of independent observations :
D=( X 1, X 2,
...
e
...
w n 〉
P 〈 w1
...
wn 〉 =
P 〈 w1
...
w n 〉=P 〈w1
...
wn 〉∣¬ Spam
P Spam can be estimated using statistics
However, calculating P 〈w1
...
Naïve Bayes makes an independence assumption:
n
P 〈w1
...
e
Title: Comp3008 Machine Learning
Description: Notes for the comp3008 machine learning course at Southampton University
Description: Notes for the comp3008 machine learning course at Southampton University