Reinforcement Learning | More Info | Notesale | Buy and Sell Study Notes Online | Extra Student Income | University Notes

Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

Buy These Notes

You have nothing in your shopping cart yet.

Title: Reinforcement Learning
Description: Computational Science module for Neuroscience BSc at UCL Lecture by Prof Neil Burgess I got 69 in the module and a first class degree in Neuroscience

Buy These Notes Preview

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above

20th November

Reinforcement learning
Classical conditioning
Pavlov’s dog






Learning to expect things to happen, not learning to act
...
g
...
g
...
g
...

In the first phase, the stimulus always lead to reward- acquisition
...

Extinction learning- in the second phase, having learnt the association, present stimulus without
reward
...
This could be that the animal learnt a new response
...
There will be
a weak response
...
In
the brain there will be lots of different neurons representing reinforcement and lots of neurons representing
stimulus
...

If the stimulus is present, the connection weight should change
...
So the connection
weight becomes its previous value plus a small positive constant multiplied by the
size of the reward
...
So it becomes its previous value
multiplied by a value that is less than one, so the
connection weight decays
...

Delta is the actual reward delivered minus the expectation
of the reward, which is the net input strength, S x w
...
If stimulus not present, S=0
...

20th November
If r = wS, then δ = 0
...

If r > wS, then δ = +n, so the connection weight increases
...

So the connection weight estimate the reward an animal gets with a given stimulus
...

Multiple stimuli
What about when multiple stimuli are present? e
...
S1,S2  r
How would animals respond to S1 or S2? How should the model be modified?
Now there are two inputs and two connection weights
...

δi = r - wiSi
Second: just one delta for everything
...

δi = δ = r -V; V = Σi wiSi
V is the expected reinforcement r given all stimuli
...
e
...
If
the reward is present but S2 wasn’t, then the connection weight from S2 to r won’t increase because of the
delta rule
...
He
used block and overshadowing
...

So using the second learning rule with just one delta, the w1 becomes strengthened during phase 1 of the
experiment
...

So r = V, and δ = 0
...
This means there is a single delta for all of the connection weights
...
This implies that the learning of the association of the stimuli predicting the
reward has been shared out between the stimuli, and they are not learning independently
...

20th November
Second-order conditioning

The test shows that S2  r
Stimulus two is indirectly associated with reward, so the Rescorla-Wagner rule doesn’t work as there is no
reward in phase 2
...
The
temporal credit assignment problem comes into play
...
e
...
The sum for all times that τ≥t of all the future rewards
...

So use delta rule to ensure that this happens, i
...
modify connection weights to make V(t) closer to r (t) +
V(t+1), i
...
use:

wi (t+1) = wi (t) + εsiδ(t);
δ(t) = [r (t) + V(t+1)] – V(t)
So that δ is the difference between V(t) and estimate of all future reward
...
e
...

Think back to the bug finding its way round, we’ve talked about when the bug
reaches the reward, but for rest of the time when there is no reward, the above
learning rule is key- the difference between the estimate value of now r(t) and the
estimate value of the next time step V(t+1)
...

Time: expectation of reward
A model network need representation of time as an input
...
In experiments, the time is usually
represented as each trial run for an animal
...

20th November
As time passes and things happen, the connection weights should change to estimate the value of the
situation in terms of how much reward we predict to get in the future
...

So in trial one, V(t) is zero as there is no expectation of reward, δ is also zero as r(t) is zero, V(t+1) is zero and
V(t) is zero
...
So the input neuron that was
ative when the reward occurred would be active, and its connection weight to V(t) increases
...
δ(t) is positive just before the reward, because r(t)=0, v(t+1)=1 (from previous learning) and V(t)=0
...
The actual reward is 1, V(t) is 1 because we fully
increased our estimate of value of tR because of the increase in connection weight, i
...
reward is fully
predicted
...

On trial 3, V(t) is positive for tR and tR-1
...

Many trials later, the expected future reward V [V(t)] increases as soon as CS occurs
...

20th November
Dopamine’s role
Dopamine is a neuromodulatory molecule that comes out of the axons from substantia nigra
...
Involved
in reward and addiction, and motor control
...
1997
If animal is given unexpected reward, with
no CS, there is firing of dopaminergic
neurons
...
This is similar to the delta signal
described above
...
During the
first trial, the reward was unexpected, so
r(t) =1, and delta is positive 1
...
When there is no reward, δ(t)= -1, because r(t) is zero, i
...
no
reward, V(t+1) =0, as there was no reward after the CS, but V(t) =1, i
...
there was prediction of reward
...

Partial reinforcement task
P=0
...
25 reward is presented 25% of the time with
CS
...

At p=1
...

At p=0
...
When you hear the bell, you
predict the reward with 75% probability, so when you do get the reward, it
is a little bit more than you expected
...
25, on only 25% of the time will the reward follow CS, there is only a
little bit of dopamine burst at presentation of CS, but when reward actually
occurs, the dopamine firing is much greater, because it is unpredicted to a
certain extent
...
g
...
classical conditioning (rewards are just
dependent on stimuli)

20th November
Consider simple bee foraging problem:




Choose between yellow and blue flowers
Each pays off probabilistically, with different amounts of nectar ry
versus rb
Bees rapidly learn to choose richer colour

So how does animal use reward to guide their action?
Modelling action choice
Critic: estimates the value of the current state
...

The temporal difference learning rule, signalled by
dopamine, take actual reward and estimate of
reward and use that as a learning rule to change
the connection weights of input neurons
...

“Actor-critic” architecture: use value function V to
decide actions to maximise expected reward
...
At points B and C, it is easy to learn a direction turning associated
with maximum reward, but what if you start at A? Turning left or right at A is not
itself associated with any reward, but it does have an effect on future reward
...

So we need to estimate V(t) at different places, then we will know turning left at A is associated with B, and
B associated with max reward
...

Input: could be place cells
...
Then if we know what the expected future reward is from
different locations, we can then try to learn the other connection weights
(on the right) in the actor, which tells us what to do
...
5
because half of the time the reward value is 5 and half of the
time the reward value is 1
...
Once these connection weights have been learnt, the
connection weight at A can be learnt by using V(t+1), which is
1
...

The length of the tract comes into play, the animal might be less keen to walk further to get to point E, even
though it has the biggest reward
...
Don’t worry about this
though, this model doesn’t take this into account
...

The actor learns to act, using V(t) to calculate δ and δ to assess actions: δ(t) =r(t)+V(t+1)–V(t):
If left at A, δ(t) = 0 + 2
...
75 = 0
...
75 = – 0
...

w’i → w’i + εSiδ(t);
Back to dopamine

20th November
Dual dopamine systems project same signal to motivational (ventral [bottom] striatum) & motor (dorsal
[top] striatum) areas, for state evaluation & policy improvement respectively? Policy is how you are going to
act, the learning of the ‘actor’
...
Policy is another way of saying what the actor is
going to do
...
Dorsal striatum is more to do with caudate and putamen, therefore projections to
motor systems
...

So we expect lots of firing at reward during first trial, then this firing shifts to CS presentation and no firing at
reward
...
So in fMRI studies, you can look for
brain regions with activation patterns like this
...
2004)
...
e
...
g
...

The change in value v(t+1)-v(t) + any reward r(t) is used to evaluate any action o(t) so output weights w’ can
be modified as in Arp, using δ(t) = v(t+1)-v(t)+r(t) (if δ(t)>0, action o(t) was good):w → w + εS(t)δ(t)
But how to learn v? A simple learning rule creates connection weights w so that v(t) = w
...
This is: w → w +
εS(t)δ(t) , i
...
you can also use δ to learn weights for the critic!

Buy These Notes Preview

Notesale: Turn your study into money

Already a Member? >

Search for notes by fellow students, in your own course and all over the country.

My Basket

Document Preview