Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Title: Very easy Ways to Live a Less Stressful Life
Description: try to read it you will be confotable
Description: try to read it you will be confotable
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Journal of Machine Learning Research 15 (2014) 1929-1958
Submitted 11/13; Published 6/14
Dropout: A Simple Way to Prevent Neural Networks from
Overfitting
Nitish Srivastava
Geoffrey Hinton
Alex Krizhevsky
Ilya Sutskever
Ruslan Salakhutdinov
nitish@cs
...
edu
hinton@cs
...
edu
kriz@cs
...
edu
ilya@cs
...
edu
rsalakhu@cs
...
edu
Department of Computer Science
University of Toronto
10 Kings College Road, Rm 3302
Toronto, Ontario, M5S 3G4, Canada
...
However, overfitting is a serious problem in such networks
...
Dropout is a technique for addressing this problem
...
This prevents units from co-adapting too much
...
At test time,
it is easy to approximate the effect of averaging the predictions of all these thinned networks
by simply using a single unthinned network that has smaller weights
...
We
show that dropout improves the performance of neural networks on supervised learning
tasks in vision, speech recognition, document classification and computational biology,
obtaining state-of-the-art results on many benchmark data sets
...
Introduction
Deep neural networks contain multiple non-linear hidden layers and this makes them very
expressive models that can learn very complicated relationships between their inputs and
outputs
...
This leads to overfitting and many
methods have been developed for reducing it
...
With unlimited computation, the best way to “regularize” a fixed-sized model is to
average the predictions of all possible settings of the parameters, weighting each setting by
c 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov
...
Figure 1: Dropout Neural Net Model
...
Right:
An example of a thinned net produced by applying dropout to the network on the left
...
its posterior probability given the training data
...
, 2011; Salakhutdinov and Mnih, 2008), but we
would like to approach the performance of the Bayesian gold standard using considerably
less computation
...
Model combination nearly always improves the performance of machine learning methods
...
Combining several models is most
helpful when the individual models are different from each other and in order to make
neural net models different, they should either have different architectures or be trained
on different data
...
Moreover, large networks normally require large amounts of
training data and there may not be enough data available to train different networks on
different subsets of the data
...
Dropout is a technique that addresses both these issues
...
The term “dropout” refers to dropping out units (hidden and
visible) in a neural network
...
The choice of which units to drop is random
...
5, which seems to be close to optimal for a wide range of
networks and tasks
...
5
...
Right: At test time, the unit is always present and
the weights are multiplied by p
...
Applying dropout to a neural network amounts to sampling a “thinned” network from
it
...
A
neural net with n units, can be seen as a collection of 2n possible thinned neural networks
...
For each presentation of each training case, a new thinned network is sampled and
trained
...
At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models
...
The idea is to use a single neural net at test time without dropout
...
If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2
...
By doing this scaling, 2n networks with shared weights can be combined into
a single neural network to be used at test time
...
The idea of dropout is not limited to feed-forward neural nets
...
In this paper, we introduce
the dropout Restricted Boltzmann Machine model and compare it to standard Restricted
Boltzmann Machines (RBM)
...
This paper is structured as follows
...
Section 3 describes relevant previous work
...
Section 5 gives an algorithm for training dropout networks
...
Section 7 analyzes the effect of
dropout on different properties of a neural network and describes how dropout interacts with
the network’s hyperparameters
...
In Section 9
we explore the idea of marginalizing dropout
...
This includes a detailed analysis of the practical considerations
involved in choosing hyperparameters when training dropout networks
...
Motivation
A motivation for dropout comes from a theory of the role of sex in evolution (Livnat et al
...
Sexual reproduction involves taking half the genes of one parent and half of the
other, adding a very small amount of random mutation, and combining them to produce an
offspring
...
It seems plausible that asexual reproduction should be a better way to
optimize individual fitness because a good set of genes that have come to work well together
can be passed on directly to the offspring
...
However, sexual reproduction is the way most advanced organisms evolved
...
The ability of a set of genes to be able to work well with another random set of
genes makes them more robust
...
According to this theory, the role of sexual reproduction
is not just to allow useful new genes to spread throughout the population, but also to
facilitate this process by reducing complex co-adaptations that would reduce the chance of
a new gene improving the fitness of an individual
...
This should make each hidden unit more robust and drive it towards creating useful
features on its own without relying on other hidden units to correct its mistakes
...
One
might imagine that the net would become robust against dropout by making many copies
of each hidden unit, but this is a poor solution for exactly the same reason as replica codes
are a poor way to deal with a noisy channel
...
Ten conspiracies each involving five people is probably a
better way to create havoc than one big conspiracy that requires fifty people to all play
their parts correctly
...
Complex co-adaptations can be trained to work well
on a training set, but on novel test data they are far more likely to fail than multiple simpler
co-adaptations that achieve the same thing
...
Related Work
Dropout can be interpreted as a way of regularizing a neural network by adding noise to
its hidden units
...
(2008, 2010) where noise
1932
Dropout
is added to the input units of an autoencoder and the network is trained to reconstruct the
noise-free input
...
We also show that adding noise is not only useful for unsupervised feature
learning but can also be extended to supervised learning problems
...
While
5% noise typically works best for DAEs, we found that our weight scaling procedure applied
at test time enables us to use much higher noise levels
...
Since dropout can be seen as a stochastic regularization technique, it is natural to
consider its deterministic counterpart which is obtained by marginalizing out the noise
...
Recently, van der Maaten et al
...
However, they
apply noise to the inputs and only explore models with no hidden layers
...
Chen
et al
...
In dropout, we minimize the loss function stochastically under a noise distribution
...
Previous work of Globerson and
Roweis (2006); Dekel et al
...
Here, instead of a noise distribution,
the maximum number of units that can be dropped is fixed
...
4
...
Consider a neural network with
L hidden layers
...
, L} index the hidden layers of the network
...
W (l) and b(l) are the weights and biases at layer l
...
, L − 1} and
any hidden unit i)
(l+1)
zi
(l+1)
yi
(l+1) l
= wi
(l+1)
y + bi
(l+1)
= f (zi
,
),
where f is any activation function, for example, f (x) = 1/ (1 + exp(−x))
...
,
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
+1
+1
(l)
r3
(l+1)
(l+1)
bi
bi
(l)
(l)
(l)
y3
y3
(l+1)
wi
(l+1)
zi
f
(l)
(l+1)
r2
yi
(l)
y3
(l)
y2
(l+1)
wi
(l+1)
zi
f
(l+1)
yi
(l)
y2
y2
(l)
r1
(l)
y1
(l)
y1
(l)
y1
(a) Standard network
(b) Dropout network
Figure 3: Comparison of the basic operations of a standard and dropout network
...
For any layer l, r(l) is a vector of independent
Bernoulli random variables each of which has probability p of being 1
...
The thinned outputs are then used as input to the next layer
...
This amounts to sampling a sub-network from a larger
network
...
At test time, the weights are scaled as Wtest = pW (l) as shown in Figure 2
...
5
...
5
...
The only difference is that for each training case in a mini-batch,
we sample a thinned network by dropping out units
...
The gradients for each parameter are
averaged over the training cases in each mini-batch
...
Many methods have been used
to improve stochastic gradient descent such as momentum, annealed learning rates and L2
weight decay
...
One particular form of regularization was found to be especially useful for dropout—
constraining the norm of the incoming weight vector at each hidden unit to be upper
bounded by a fixed constant c
...
This
constraint was imposed during optimization by projecting w onto the surface of a ball of
radius c, whenever w went out of it
...
The constant
1934
Dropout
c is a tunable hyperparameter, which is determined using a validation set
...
It typically improves the performance of stochastic gradient descent
training of deep neural nets, even when no dropout is used
...
A possible justification is that constraining weight vectors
to lie inside a ball of fixed radius makes it possible to use a huge learning rate without the
possibility of weights blowing up
...
As the learning rate decays, the optimization takes shorter steps, thereby
doing less exploration and eventually settles into a minimum
...
2 Unsupervised Pretraining
Neural networks can be pretrained using stacks of RBMs (Hinton and Salakhutdinov, 2006),
autoencoders (Vincent et al
...
Pretraining is an effective way of making use of unlabeled data
...
Dropout can be applied to finetune nets that have been pretrained using these techniques
...
The weights obtained from pretraining
should be scaled up by a factor of 1/p
...
We were initially concerned that the stochastic nature of dropout might wipe out the information in the pretrained weights
...
However, when the learning rates were chosen to be smaller, the information in the pretrained
weights seemed to be retained and we were able to get improvements in terms of the final
generalization error compared to not using dropout when finetuning
...
Experimental Results
We trained dropout neural networks for classification problems on data sets in different
domains
...
Table 1 gives a brief description of
the data sets
...
• TIMIT : A standard speech benchmark for clean speech recognition
...
• Street View House Numbers data set (SVHN) : Images of house numbers collected by
Google Street View (Netzer et al
...
• ImageNet : A large collection of natural images
...
1935
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
• Alternative Splicing data set: RNA features for predicting alternative gene splicing
(Xiong et al
...
We chose a diverse set of data sets to demonstrate that dropout is a general technique
for improving neural nets and is not specific to any particular application domain
...
A more detailed
description of all the experiments and data sets is provided in Appendix B
...
2M
1
...
6
...
These data sets include different image types and training set sizes
...
6
...
1 MNIST
Unit
Type
Method
Standard Neural Net (Simard et al
...
, 2013)
Logistic
NA
Logistic
ReLU
ReLU
ReLU
ReLU
ReLU
DBN + finetuning (Hinton and Salakhutdinov, 2006)
DBM + finetuning (Salakhutdinov and Hinton, 2009)
DBN + dropout finetuning
DBM + dropout finetuning
Logistic
Logistic
Logistic
Logistic
Maxout
Architecture
Error
%
2 layers, 800 units
NA
3 layers, 1024 units
3 layers, 1024 units
3 layers, 1024 units
3 layers, 2048 units
2 layers, 4096 units
2 layers, 8192 units
2 layers, (5 × 240)
units
1
...
40
1
...
25
1
...
04
1
...
95
500-500-2000
500-500-2000
500-500-2000
500-500-2000
1
...
96
0
...
79
0
...
The MNIST data set consists of 28 × 28 pixel handwritten digit images
...
Table 2 compares the performance of dropout
with other techniques
...
60% (Simard et al
...
With dropout the error reduces to 1
...
Replacing logistic
units with rectified linear units (ReLUs) (Jarrett et al
...
25%
...
06%
...
A neural net with 2 layers and 8192 units per layer
gets down to 0
...
Note that this network has more than 65 million parameters and
is being trained on a data set of size 60,000
...
Dropout, on the other hand, prevents overfitting, even in this case
...
Goodfellow et al
...
94% by replacing ReLU units with maxout units
...
5 for hidden
units and p = 0
...
More experimental details can be found in Appendix B
...
Dropout nets pretrained with stacks of RBMs and Deep Boltzmann Machines also give
improvements as shown in Table 2
...
79% which is the best performance ever reported for the permutation invariant setting
...
We demonstrate the effectiveness of dropout in that setting on more interesting data
sets
...
5
done with networks of many different architectures keeping all hyperparameters, inWithout dropout
cluding p, fixed
...
0
@
error rates obtained for these different arR
@
chitectures as training progresses
...
5
same architectures trained with and withWith dropout
out dropout have drastically different test
R
@
errors as seen as by the two separate clus1
...
Dropout gives a huge
improvement across all architectures, with0
200000
400000
600000
800000
1000000
out using hyperparameters that were tuned
Number of weight updates
specifically for each architecture
...
The net6
...
2 Street View House Numbers
works have 2 to 4 hidden layers each
with 1024 to 2048 units
...
, 2011) consists of
color images of house numbers collected by
Google Street View
...
The
part of the data set that we use in our experiments consists of 32 × 32 color images roughly
centered on a digit in a house number
...
For this data set, we applied dropout to convolutional neural networks (LeCun et al
...
The best architecture that we found has three convolutional layers followed by 2
fully connected hidden layers
...
Each convolutional layer was
1937
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Method
Error %
Binary Features (WDCH) (Netzer et al
...
, 2011)
Stacked Sparse Autoencoders (Netzer et al
...
, 2011)
Multi-stage Conv Net with average pooling (Sermanet et al
...
, 2012)
Multi-stage Conv Net + L4 pooling + padding (Sermanet et al
...
, 2013)
Human Performance
36
...
0
10
...
4
9
...
36
4
...
95
3
...
80
2
...
47
2
...
followed by a max-pooling layer
...
2 describes the architecture in more detail
...
9, 0
...
75, 0
...
5, 0
...
Max-norm regularization was
used for weights in both convolutional and fully connected layers
...
We find that convolutional nets outperform other
methods
...
95%
...
02%
...
55%
...
The additional gain in performance obtained by adding dropout in the convolutional
layers (3
...
55%) is worth noting
...
However, dropout in the lower layers still helps because it provides noisy inputs for the higher fully connected layers which prevents them
from overfitting
...
1
...
Figure 5b shows some examples of images from this data
set
...
3
...
Without any data augmentation, Snoek et al
...
98% on
CIFAR-10
...
32% and adding
dropout in every layer further reduces the error to 12
...
Goodfellow et al
...
68% by replacing ReLU units with maxout units
...
48% to 37
...
No data augmentation was used for either data set (apart from the input dropout)
...
Each row corresponds to a different category
...
, 2012)
max pooling + dropout fully connected layers
max pooling + dropout in all layers
maxout (Goodfellow et al
...
60
15
...
98
14
...
61
11
...
48
42
...
26
37
...
57
Table 4: Error rates on CIFAR-10 and CIFAR-100
...
1
...
Starting in 2010, as part of the Pascal Visual Object Challenge, an annual
competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has
been held
...
Since the number of categories is rather large, it is conventional to
report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test
images for which the correct label is not among the five labels considered most probable by
the model
...
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so
most of our experiments were performed on this data set
...
Convolutional nets with dropout outperform other methods by a large
margin
...
(2012)
...
The length of the horizontal bars is proportional to the probability assigned to the labels
by the model
...
Model
Top-1
Top-5
47
...
7
37
...
2
25
...
0
Sparse Coding (Lin et al
...
, 2012)
Table 5: Results on the ILSVRC-2010 test set
...
, 2012)
Avg of 5 Conv Nets + dropout (Krizhevsky et al
...
7
38
...
2
16
...
3
26
...
4
Table 6: Results on the ILSVRC-2012 validation/test set
...
Since the labels for the test set are not available, we report our results on the test set for
the final submission and include the validation set results for different variations of our
model
...
While the best methods based on
standard vision features achieve a top-5 error rate of about 26%, convolutional nets with
dropout achieve a test error of about 16% which is a staggering difference
...
We can see that the model makes very
reasonable predictions, even when its best guess is not correct
...
2 Results on TIMIT
Next, we applied dropout to a speech recognition task
...
Dropout
neural networks were trained on windows of 21 log-filter bank frames to predict the label
of the central frame
...
Appendix B
...
Table 7 compares dropout neural
1940
Dropout
nets with other models
...
4%
...
8%
...
A
4-layer net pretrained with a stack of RBMs get a phone error rate of 22
...
With dropout,
this reduces to 19
...
Similarly, for an 8-layer net the error reduces from 20
...
7%
...
, 2010)
Dropout NN (6 layers)
23
...
8
DBN-pretrained NN (4 layers)
DBN-pretrained NN (6 layers) (Mohamed et al
...
, 2010)
mcRBM-DBN-pretrained NN (5 layers) (Dahl et al
...
7
22
...
7
20
...
7
19
...
6
...
We used a subset of the Reuters-RCV1 data set which is a collection of
over 800,000 newswire articles from Reuters
...
The
task is to take a bag of words representation of a document and classify it into 50 disjoint
topics
...
5 describes the setup in more detail
...
05%
...
62%
...
6
...
On the other hand, Bayesian neural networks (Neal, 1996) are
the proper way of doing model averaging over the space of neural network structures and
parameters
...
Bayesian neural nets are extremely useful for
solving problems in domains where data is scarce such as medical diagnosis, genetics, drug
discovery and other computational biology applications
...
Besides, it is expensive to
get predictions from many large nets at test time
...
In this section, we report experiments that
compare Bayesian neural nets with dropout neural nets on a small data set where Bayesian
neural networks are known to perform well and obtain state-of-the-art results
...
The data set that we use (Xiong et al
...
The
task is to predict the occurrence of alternative splicing based on RNA features
...
Predicting the
1941
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Method
Code Quality (bits)
Neural Network (early stopping) (Xiong et al
...
, 2011)
SVM, PCA (Xiong et al
...
, 2011)
440
463
487
567
623
Table 8: Results on the Alternative Splicing Data Set
...
Given the RNA features, the task is to predict the
probability of three splicing related events that biologists care about
...
Appendix B
...
Table 8 summarizes the performance of different models on this data set
...
(2011) used Bayesian neural nets for this task
...
However, we see that dropout improves significantly
upon the performance of standard neural nets and outperforms all other methods
...
One way to prevent overfitting is to reduce the input dimensionality using PCA
...
However, with dropout
we were able to prevent overfitting without the need to do dimensionality reduction
...
This shows that dropout has a strong regularizing effect
...
5 Comparison with Standard Regularizers
Several regularization methods have been proposed for preventing overfitting in neural networks
...
Dropout can
be seen as another way of regularizing neural networks
...
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained using stochastic gradient descent with different regularizations
...
The values of different hyperparameters associated with each kind of regularization (decay
constants, target sparsity, dropout rate, max-norm upper bound) were obtained using a
validation set
...
7
...
In this section, we closely examine
how dropout affects a neural network
...
We see how dropout affects the sparsity of hidden unit activations
...
62
1
...
55
1
...
25
1
...
also see how the advantages obtained from dropout vary with the probability of retaining
units, size of the network and the size of the training set
...
7
...
5
...
In a standard neural network, the derivative received by each parameter tells it how it
should change so the final loss function is reduced, given what all other units are doing
...
This may lead to complex co-adaptations
...
We hypothesize that for each hidden unit,
dropout prevents co-adaptation by making the presence of other hidden units unreliable
...
It must
perform well in a wide variety of different contexts provided by the other hidden units
...
1943
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Figure 7a shows features learned by an autoencoder on MNIST with a single hidden
layer of 256 rectified linear units without dropout
...
5
...
However, it is apparent that the features
shown in Figure 7a have co-adapted in order to produce good reconstructions
...
On the other hand, in
Figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the
image
...
7
...
5
...
ReLUs were used for both models
...
0
...
Clearly, a large fraction of
units have high activation
...
7
...
Very few units have high activation
...
Thus, dropout automatically leads to sparse representations
...
Figure 8a and Figure 8b compare the sparsity for
the two models
...
Moreover, the average activation of any unit across data cases should
be low
...
For each
model, the histogram on the left shows the distribution of mean activations of hidden units
across the minibatch
...
Comparing the histograms of activations we can see that fewer hidden units have high
activations in Figure 8b compared to Figure 8a, as seen by the significant mass away from
1944
Dropout
zero for the net that does not use dropout
...
The overall mean activation of hidden units is close to 2
...
7 when dropout is used
...
3 Effect of Dropout Rate
Dropout has a tunable hyperparameter p (the probability of retaining a unit in the network)
...
The comparison is
done in two situations
...
The number of hidden units is held constant
...
The number of hidden units is changed so that the expected number of hidden units
that will be retained after dropout is held constant
...
We use a 784-2048-2048-2048-10 architecture
...
Figure 9a shows the test error obtained as a function of p
...
It can be seen that this
has led to underfitting since the training error is also high
...
It becomes flat when 0
...
8 and then increases as p becomes close
to 1
...
0
3
...
5
2
...
5
1
...
0
1
...
0
0
...
5
0
...
0
Test Error
Training Error
2
...
0
0
...
4
0
...
8
Probability of retaining a unit (p)
1
...
0
...
0
0
...
4
0
...
8
Probability of retaining a unit (p)
1
...
Figure 9: Effect of changing dropout rates on MNIST
...
This means that networks
that have small p will have a large number of hidden units
...
However, the test networks will be of different sizes
...
Figure 9b shows the test error obtained as a function of p
...
1 it fell
from 2
...
7%)
...
6 seem to perform best for this choice
of pn but our usual default value of 0
...
1945
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
7
...
This
section explores the effect of changing the data set size when dropout is used with feedforward networks
...
To see if dropout can help, we run classification experiments on MNIST
and vary the amount of data given to the network
...
The network was given
Without dropout
data sets of size 100, 500, 1K, 5K, 10K
25
and 50K chosen randomly from the MNIST
20
training set
...
Dropout with p = 0
...
8
10
at the input layer
...
0
The model has enough parameters that it
104
105
102
103
Dataset size
can overfit on the training data, even with
all the noise coming from dropout
...
size of the data set is increased, the gain
from doing dropout increases up to a point and then declines
...
7
...
Weight Scaling
Test Classification error %
The efficient test time procedure that we
1
...
30
bination by scaling down the weights of the
trained neural network
...
25
more correct way of averaging the models
1
...
1
...
10
gets close to the true model average
...
05
ples k are needed to match the performance
of the approximate averaging method
...
000
20
40
60
80
100
120
Number of samples used for Monte-Carlo averaging (k)
computing the error for different values of k
we can see how quickly the error rate of the
finite-sample average approaches the error Figure 11: Monte-Carlo model averaging vs
...
rate of the true model average
...
Figure 11 shows the test error rate obtained for
different values of k
...
It can be seen that around k = 50, the Monte-Carlo
method becomes as good as the approximate method
...
This suggests that the weight scaling method is a fairly good approximation of the true
model average
...
Dropout Restricted Boltzmann Machines
Besides feed-forward neural networks, dropout can also be applied to Restricted Boltzmann
Machines (RBM)
...
8
...
It defines
the following probability distribution
P (h, v; θ) =
1
exp(v W h + a h + b v)
...
Dropout RBMs are RBMs augmented with a vector of binary random variables r ∈
{0, 1}F
...
If rj takes the value 1, the hidden unit hj is retained, otherwise it is dropped from
the model
...
Z (θ, r) is the normalization constant
...
The distribution over h, conditioned on v and r is factorial
F
P (hj |rj , v),
P (h|r, v) =
j=1
P (hj = 1|rj , v) =
1(rj = 1)σ bj +
Wij vi
i
1947
...
5
...
The features are ordered by L2
norm
...
Conditioned on r, the distribution over {v, h} is same as the distribution that an RBM
would impose, except that the units for which rj = 0 are dropped from h
...
8
...
,
2006) can be directly applied for learning Dropout RBMs
...
Similar to
dropout neural networks, a different r is sampled for each training case in every minibatch
...
8
...
This section explores whether this effect transfers to Dropout RBMs as well
...
Figure 12b
shows features learned by a dropout RBM with the same number of hidden units
...
5
...
Left: The activation histogram shows that a large number of units have activations away from zero
...
learned by the dropout RBM appear qualitatively different in the sense that they seem to
capture features that are coarser compared to the sharply defined stroke-like features in the
standard RBM
...
8
...
Figure 13a shows the histograms of hidden unit activations and their means on
a test mini-batch after training an RBM
...
The histograms clearly indicate that the dropout RBMs learn much sparser representations
than standard RBMs even when no additional sparsity inducing regularizer is present
...
Marginalizing Dropout
Dropout can be seen as a way of adding noise to the states of hidden units in a neural
network
...
These models can be seen as deterministic versions of dropout
...
In this section, we briefly explore these
models
...
Marginalization in the context
of denoising autoencoders has been explored previously (Chen et al
...
The marginalization of dropout noise in the context of linear regression was discussed in Srivastava (2013)
...
van der Maaten et al
...
Wager et al
...
9
...
Let X ∈ RN ×D be a data matrix of N data points
...
Linear regression tries to find a w ∈ RD that minimizes
||y − Xw||2
...
Marginalizing the noise,
the objective function becomes
minimize
w
ER∼Bernoulli(p) ||y − (R ∗ X)w||2
...
Therefore, dropout with linear regression is equivalent, in
expectation, to ridge regression with a particular form for Γ
...
If a particular data dimension varies a lot, the regularizer tries to squeeze its weight
more
...
This leads to the following form
minimize
w
||y − X w||2 +
1−p
||Γw||2 ,
p
where w = pw
...
For p close to 1, all the inputs are retained and the regularization constant is small
...
9
...
However, Wang and Manning (2013) showed that in the context of dropout applied
to logistic regression, the corresponding marginalized model can be trained approximately
...
Their means and variances can be
computed efficiently
...
However, the assumptions involved in this technique become successively weaker as more
layers are added
...
1950
Dropout
Data Set
Architecture
Bernoulli dropout
Gaussian dropout
MNIST
CIFAR-10
2 layers, 1024 units each
3 conv + 2 fully connected layers
1
...
04
12
...
1
0
...
04
12
...
1
Table 10: Comparison of classification error % with Bernoulli and Gaussian dropout
...
5 for the hidden units and p = 0
...
For CIFAR-10, we use p = (0
...
75, 0
...
5, 0
...
5) going from the input layer to the
top
...
1−p
p
...
Multiplicative Gaussian Noise
Dropout involves multiplying hidden activations by Bernoulli distributed random variables
which take the value 1 with probability p and 0 otherwise
...
We
recently discovered that multiplying by a random variable drawn from N (1, 1) works just
as well, or perhaps better than using Bernoulli noise
...
That is, each hidden activation hi is perturbed to
hi + hi r where r ∼ N (0, 1), or equivalently hi r where r ∼ N (1, 1)
...
The expected value of the activations remains
unchanged, therefore no weight scaling is required at test time
...
Another way to achieve the same effect is to scale up the retained activations by multiplying
by 1/p at training time and not modifying the weights at test time
...
Therefore, dropout can be seen as multiplying hi by a Bernoulli random variable rb that
takes the value 1/p with probability p and 0 otherwise
...
For the Gaussian multiplicative noise, if we set σ 2 = (1 − p)/p, we end up multiplying
hi by a random variable rg , where E[rg ] = 1 and V ar[rg ] = (1 − p)/p
...
However, given these first and second order moments, rg has the
highest entropy and rb has the lowest
...
For each layer, the value of σ in the Gaussian model was set to be 1−p
p
using the p from the corresponding layer in the Bernoulli model
...
Conclusion
Dropout is a technique for improving neural networks by reducing overfitting
...
Random dropout breaks up these co-adaptations by
1951
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
making the presence of any particular hidden unit unreliable
...
This suggests that dropout is a general technique
and is not specific to any domain
...
Dropout considerably improved the
performance of standard neural nets on other data sets as well
...
The central idea of dropout is to take a large model that overfits easily and repeatedly
sample and train smaller sub-models from it
...
We developed Dropout RBMs and empirically showed that they have certain desirable properties
...
A dropout network
typically takes 2-3 times longer to train than a standard neural network of the same architecture
...
Each training case effectively tries to train a different random architecture
...
Therefore, it is not surprising that training takes a long time
...
This creates a trade-off between overfitting and training time
...
However, one way to obtain some of the benefits of dropout without stochasticity is to marginalize the noise to obtain a regularizer that does the same thing as the
dropout procedure, in expectation
...
For more complicated models, it is not obvious how to
obtain an equivalent regularizer
...
Acknowledgments
This research was supported by OGS, NSERC and an Early Researcher Award
...
A Practical Guide for Training Dropout Networks
Neural networks are infamous for requiring extensive hyperparameter tuning
...
In this section, we describe heuristics that might be useful for
applying dropout
...
1 Network Size
It is to be expected that dropping units will reduce the capacity of a neural network
...
Moreover, this set of pn units will be different each time and the units are not allowed to
build co-adaptations freely
...
We found this to
be a useful heuristic for setting the number of hidden units in both convolutional and fully
connected networks
...
2 Learning Rate and Momentum
Dropout introduces a significant amount of noise in the gradients compared to standard
stochastic gradient descent
...
In
order to make up for this, a dropout net should typically use 10-100 times the learning rate
that was optimal for a standard neural net
...
While momentum values of 0
...
95 to 0
...
Using high
learning rate and/or momentum significantly speed up learning
...
3 Max-norm Regularization
Though large momentum and learning rate speed up learning, they sometimes cause the
network weights to grow very large
...
This constrains the norm of the vector of incoming weights at each hidden unit to be bound
by a constant c
...
A
...
This
hyperparameter controls the intensity of dropout
...
Typical values of p for hidden units are in the range 0
...
8
...
For real-valued inputs (image
patches or speech frames), a typical value is 0
...
For hidden layers, the choice of p is coupled
with the choice of number of hidden units n
...
Large p may not produce enough dropout to prevent
overfitting
...
Detailed Description of Experiments and Data Sets
...
The code for reproducing these results can be obtained from
http://www
...
toronto
...
The implementation is GPU-based
...
, 2012) to implement our networks
...
1 MNIST
The MNIST data set consists of 60,000 training and 10,000 test examples each representing
a 28×28 digit image
...
Hyperparameters were tuned on the validation set such that the best validation error was produced
after 1 million weight updates
...
This net was used to evaluate the performance on the test set
...
Therefore, once the hyperparameters were fixed, it made sense to combine the validation
and training sets and train for a very long time
...
Thus, there are six architectures in all
...
5 in all hidden layers
and p = 0
...
A final momentum of 0
...
To test the limits of dropout’s regularization power, we also experimented with 2 and 3
layer nets having 4096 and 8192 units
...
However, the three layer nets performed slightly worse than 2 layer ones with the same
level of dropout
...
B
...
The training set consists of two parts—A standard labeled training set and another
set of labeled examples that are easy
...
Two-thirds of it were taken from the standard set (400 per class) and
one-third from the extra set (200 per class), a total of 6000 samples
...
(2012)
...
Other preprocessing techniques such as global or local contrast
normalization or ZCA whitening did not give any noticeable improvements
...
The convolutional layers have 96, 128 and 256 filters respectively
...
Each
max pooling layer pools 3 × 3 regions at strides of 2 pixels
...
All units use the
rectified linear activation function
...
9, 0
...
75, 0
...
5, 0
...
In addition, the max-norm constraint with c = 4 was used for all the weights
...
95 was used in all the layers
...
Since the training set was quite large, we did not combine the validation
set with the training set for final training
...
B
...
They have 10 and 100 image categories respectively
...
We used 5,000 of the training images for validation
...
The images were preprocessed by doing global contrast
normalization in each color channel followed by ZCA whitening
...
ZCA whitening means that we
mean center the data, rotate it onto its principle components, normalize each component
1954
Dropout
and then rotate it back
...
B
...
, 2011) was used to preprocess the data into logfilter banks
...
Dropout neural networks were trained on windows of 21 consecutive frames
to predict the label of the central frame
...
The inputs were mean centered and normalized to have unit variance
...
8 in the input layers and 0
...
Max-norm constraint with c = 4 was used in all the layers
...
95 with a
high learning rate of 0
...
The learning rate was decayed as 0 (1 + t/T )−1
...
The variance of each input unit for the
Gaussian RBM was fixed to 1
...
01)
...
B
...
These classes are arranged in a tree hierarchy
...
The data was split into equal sized
training and test sets
...
However, the improvement was
not as significant as that for the image and speech data sets
...
B
...
For each input, the target consists of 4
softmax units (one for tissue type)
...
For each softmax unit, the aim is to predict a distribution over
these 3 states that matches the observed distribution from wet lab experiments as closely
as possible
...
A two layer dropout network with 1024 units in each layer was trained on this data set
...
5 was used for the hidden layer and p = 0
...
Max-norm
regularization with high decaying learning rates was used
...
(2011)
...
Chen, Z
...
Weinberger, and F
...
Marginalized denoising autoencoders for
domain adaptation
...
ACM, 2012
...
E
...
Ranzato, A
...
E
...
Phone recognition with the meancovariance restricted Boltzmann machine
...
O
...
Shamir, and L
...
Learning to classify with missing and corrupted features
...
A
...
Roweis
...
In
Proceedings of the 23rd International Conference on Machine Learning, pages 353–360
...
I
...
Goodfellow, D
...
Mirza, A
...
Bengio
...
In Proceedings of the 30th International Conference on Machine Learning, pages 1319–
1327
...
G
...
Salakhutdinov
...
Science, 313(5786):504 – 507, 2006
...
E
...
Osindero, and Y
...
A fast learning algorithm for deep belief nets
...
K
...
Kavukcuoglu, M
...
LeCun
...
IEEE, 2009
...
Krizhevsky
...
Technical report,
University of Toronto, 2009
...
Krizhevsky, I
...
E
...
Imagenet classification with deep convolutional neural networks
...
Y
...
Boser, J
...
Denker, D
...
E
...
Hubbard, and L
...
Jackel
...
Neural Computation, 1(4):541–551, 1989
...
Lin, F
...
Zhu, M
...
Cour, K
...
Cao, Z
...
-H
...
Zhou,
T
...
Zhang
...
Large scale visual recognition challenge, 2010
...
Livnat, C
...
Pippenger, and M
...
Feldman
...
Proceedings of the National Academy of Sciences, 107(4):1452–1457, 2010
...
Mnih
...
Technical Report UTML
TR 2009-004, Department of Computer Science, University of Toronto, November 2009
...
Mohamed, G
...
Dahl, and G
...
Hinton
...
IEEE Transactions on Audio, Speech, and Language Processing, 2010
...
M
...
Bayesian Learning for Neural Networks
...
, 1996
...
Netzer, T
...
Coates, A
...
Wu, and A
...
Ng
...
In NIPS Workshop on Deep Learning
and Unsupervised Feature Learning 2011, 2011
...
J
...
E
...
Simplifying neural networks by soft weight-sharing
...
D
...
Ghoshal, G
...
Burget, O
...
Goel, M
...
Motlicek, Y
...
Schwarz, J
...
Stemmer, and K
...
The Kaldi
Speech Recognition Toolkit
...
IEEE Signal Processing Society, 2011
...
Salakhutdinov and G
...
Deep Boltzmann machines
...
R
...
Mnih
...
In Proceedings of the 25th International Conference on Machine
Learning
...
J
...
Perronnin
...
In Proceedings of the 2011 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1665–1672, 2011
...
Sermanet, S
...
LeCun
...
In International Conference on Pattern Recognition (ICPR
2012), 2012
...
Simard, D
...
Platt
...
In Proceedings of the Seventh International Conference
on Document Analysis and Recognition, volume 2, pages 958–962, 2003
...
Snoek, H
...
Adams
...
In Advances in Neural Information Processing Systems 25, pages 2960–2968,
2012
...
Srebro and A
...
Rank, trace-norm and max-norm
...
Springer-Verlag, 2005
...
Srivastava
...
Master’s thesis, University of
Toronto, January 2013
...
Tibshirani
...
Journal of the Royal
Statistical Society
...
Methodological, 58(1):267–288, 1996
...
N
...
On the stability of inverse problems
...
L
...
Chen, S
...
Q
...
Learning with marginalized
corrupted features
...
ACM, 2013
...
Vincent, H
...
Bengio, and P
...
Manzagol
...
In Proceedings of the 25th International Conference
on Machine Learning, pages 1096–1103
...
P
...
Larochelle, I
...
Bengio, and P
...
Manzagol
...
In Proceedings of the 27th International Conference on Machine Learning, pages
3371–3408
...
S
...
Wang, and P
...
Dropout training as adaptive regularization
...
S
...
D
...
Fast dropout training
...
ACM, 2013
...
Y
...
Barash, and B
...
Frey
...
Bioinformatics, 27(18):2554–2562, 2011
...
D
...
Fergus
...
CoRR, abs/1301
...
1958
Title: Very easy Ways to Live a Less Stressful Life
Description: try to read it you will be confotable
Description: try to read it you will be confotable