Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

You have nothing in your shopping cart yet.

Title: Probability
Description: This comprises Classical probability, Axioms of probability, discrete random variables, interesting problems and many more. Probability has been explained in simple straight forward answers. Determining probability mathematically has never been this simpler when you read this note

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above


Part IA — Probability
Based on lectures by R
...
They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine
...
Combinatorial analysis, permutations
and combinations
...

[3]
Axiomatic approach
Axioms (countable case)
...
Inclusion-exclusion formula
...
Independence
...
Relation between Poisson and binomial distributions
...
Examples, including Simpson’s paradox
...
Functions of a random variable, indicator function, variance, standard
deviation
...
Generating functions: sums
of independent random variables, random sum formula, moments
...
Random walks: gambler’s ruin, recurrence relations
...
Mean time to absorption
...
Combinatorial applications of generating functions
...
Expectations; expectation of a function of a
random variable
...
Memoryless
property of exponential distribution
...
Simulation: generating continuous random
variables, independent normal random variables
...
Correlation coefficient, bivariate normal random variables
...
Weak law of large numbers
...

Moment generating functions and statement (no proof) of continuity theorem
...
Examples, including sampling
...
1 Classical probability
...
2 Counting
...
3 Stirling’s formula
...
1 Axioms and definitions
...
2 Inequalities and formulae
...
3 Independence
...
4 Important discrete distributions
2
...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...
1 Discrete random variables
...
2 Inequalities
...
3 Weak law of large numbers
...
4 Multiple random variables
...
5 Probability generating functions


...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...


22
22
31
33
34
37

4 Interesting problems
43
4
...
43
4
...
46
5 Continuous random variables
5
...

5
...
3 Jointly distributed random variables
...
4 Geometric probability
...
5 The normal distribution
...
6 Transformation of random variables
...
7 Moment generating functions
...


...


...


...



...


...


...


...


...


...


...



...


...


...


...


...


...


...



...


...


...


...


...


...


...



...


...


...


...


...


...


...



...


...


...


...


...


...


...



...


...


...


...
1 Cauchy distribution
...
2 Gamma distribution
...
3 Beta distribution*
...
4 More on the normal distribution
6
...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...



...


...


...


...


...


68
68
69
69
70
71

7 Central limit theorem


...


...


...


...


...



...


...


...


...


...



...


...


...


...


...


74

8 Summary of distributions
78
8
...
78
8
...
78

2

0

Introduction

0

IA Probability

Introduction

In every day life, we often encounter the use of the term probability, and they
are used in many different ways
...

(ii) The probability that a selection
of 6 members wins the National Lottery

Lotto jackpot is 1 in 19
=
13983816
or 7
...

6
(iii) The probability that a drawing pin will land ’point up’ is 0
...

(iv) The probability that a large earthquake will occur on the San Andreas
Fault in the next 30 years is about 21%
(v) The probability that humanity will be extinct by 2100 is about 50%
The first two cases are things derived from logic
...
By definition, a fair coin is equally likely to
land heads or tail
...

The third is something probably derived from experiments
...
The fourth and fifth examples
belong to yet another category that talks about are beliefs and predictions
...
In this course, we only
consider classical probability
...
Afterwards, in
Chapter 2, we will have a formal axiomatic definition of probability and formally
study their properties
...
1

Classical probability

Definition (Classical probability)
...

A classical example is the problem of points
...
A and B play a game in which they keep throwing coins
...
Otherwise, B gets a point
...

Now suppose A has got 8 points and B has got 7, but the game has to end
because an earthquake struck
...
Someone must have won by the
end of 19 rounds, i
...
after 4 more rounds
...
Otherwise, B wins
...
So A should get 11/16 of the prize
...

Definition (Sample space)
...
We can lists the outcomes as ω1 , ω2 , · · · ∈ Ω
...

Definition (Event)
...

Example
...
“Getting an odd number” and “getting 3” are two possible
events
...

Definition (Set notations)
...

– “A or B” is the set A ∪ B
...

– A and B are mutually exclusive or disjoint if A ∩ B = ∅
...

Definition (Probability)
...
Let A ⊆ Ω be an
event
...

Number of outcomes in Ω
N
4

1

Classical probability

IA Probability

Here we are assuming that each outcome is equally likely to happen, which
is the case in (fair) dice rolls and coin flips
...
Suppose r digits are drawn at random from a table of random digits
from 0 to 9
...
Then |Ω| = 10r
...
Then |Ak | =
(k + 1)r
...

P (Ak ) =
10r
Now let Bk = [largest digit drawn is k]
...
So |Bk | = |Ak | − |Ak−1 | and
P (Bk ) =

1
...

10r

Counting

To find probabilities, we often need to count things
...

Example
...
How many possible
meals combinations are there? Clearly 6 × 7 × 6 = 252
...
Suppose we have to make r multiple
choices in sequence
...
Then the total number of choices is m1 × m2 × · · · mr
...
How many ways can 1, 2, · · · , n be ordered? The first choice has n
possibilities, the second has n−1 possibilities etc
...

Sampling with or without replacement
Suppose we have to pick n items from a total of x items
...
Let X = {1, 2, · · · , x} be the items
...

Definition (Sampling with replacement)
...
Then any sampling
function f satisfies sampling with replacement
...
When we sample without replacement, after choosing an item, we kill it with fire and cannot choose it again
...

5

1

Classical probability

IA Probability

We can also have sampling with replacement, but we require each item to be
chosen at least once
...

Example
...
How many injective
functions are there N → X?
When we choose f (a), we have 4 options
...
When we choose f (c), we have 2 choices left
...

Example
...
We select one at random once and try
to unlock
...
We have to fail the first r − 1 trials and
succeed in the rth
...

r
n
nr
Now suppose we are smarter and try without replacement
...

n(n − 1) · · · (n − r + 1)
n
Example (Birthday problem)
...

We solve this by finding the probability of no match, 1 − f (r)
...
For nobody to have
the same birthday, the first person can have any birthday
...
So
P(no match) =

365 · 364 · 363 · · · (366 − r)

...
475695 and f (23) =
0
...

While this might sound odd since 23 is small, this is because we are thinking
about the wrong thing
...
With 23 people, we have 23 × 22/2 =
253 pairs, which is quite large compared to 365
...
For example, if we
pick two representatives from a class, the order of picking them doesn’t matter
...
i
...
we might get (2, 5, 4) and (4, 2, 5)
...


6

1

Classical probability

IA Probability

– Re-number each item by the number of the draw on which it was first seen
...
This
happens if the labelling of items doesn’t matter
...
So we can rename (2, 5, 2) and (8, 5, 5) both as (1, 1, 2)
...
The most important ones are:
– Replacement + with ordering: the number of ways is xn
...

– Without replacement + without order:
we only care which items get

selected
...

– Replacement + without ordering: we only care how many times the item
got chosen
...

Say n = 6 and k = 3
...
So the number
of ways is n+k−1
k−1
...
The
 number of ways of picking such that k apples are chosen is, by
definition, nk
...
The number of ways of doing so
such that item
i is picked ni times is defined to be the multinomial coefficient

n

...
A multinomial coefficient is

  
 

n
n
n − n1
n − n1 · · · − nk−1
n!
=
···
=

...

Example
...

1

If we have a trinomial, then
n

(x + y + z) =

X




n
xn1 y n2 z n3
...
How many ways can we deal 52 cards to 4 player, each with a hand
of 13? The total number of ways is


52
52!
=
= 53644737765488792839237440000 = 5
...

13, 13, 13, 13
(13!)4
While computers are still capable of calculating that, what if we tried more
power cards? Suppose each person has n cards
...
We can use Stirling’s Formula to approximate it:

1
...
log n! ∼ n log n
Proof
...


k=1

Now we claim that
Z

n

log x dx ≤
1

n
X

Z
log k ≤

n+1

log x dx
...
Both sides tend to 1
...

n log n
Now we prove Stirling’s Formula:

8

1

Classical probability

IA Probability

Theorem (Stirling’s formula)
...

n! ∼



1

2πnn+ 2 e−n

Proof
...


Write t = 1/(2n + 1)
...


We can simplifying by noting that
1
log(1 + t) − t = − t2 +
2
1
log(1 − t) + t = − t2 −
2

1 3
t −
3
1 3
t −
3

1 4
t + ···
4
1 4
t − ···
4

Then if we subtract the equations and divide by 2t, we obtain
dn − dn+1 =
<
=
=
=

1 2 1 4 1 6
t + t + t + ···
3
5
7
1 2 1 4 1 6
t + t + t + ···
3
3
3
1 t2
3 1 − t2
1
1
3 (2n + 1)2 − 1


1
1 1

12 n n + 1

By summing these bounds, we know that
d1 − dn <

1
12


1−

1
n



Then we know that dn is bounded below by d1 + something, and is decreasing
since dn − dn+1 is positive
...
We know A is a lower
bound for dn since (dn ) is decreasing
...
Then dn −dm < n1 − m
12
...
Hence we know that
A < dn < A +
9

1

...
To find A, we
have a small detour to prove a formula:
R π/2
Take In = 0 sinn θ dθ
...
We also know that
Z π/2
In =
sinn θ dθ
0


π/2
= − cos θ sinn−1 θ 0 +

Z

π/2

(n − 1) cos2 θ sinn−2 θ dθ

0

π/2

Z

(n − 1)(1 − sin2 θ) sinn−2 θ dθ

=0+
0

= (n − 1)(In−2 − In )
So

n−1
In−2
...
Then
In =

1
2
2
=
3

I2n =
I2n+1

3
2n − 1
(2n)! π
···
π/2 = n 2
4
2n
(2 n!) 2
4
2n
(2n n!)2
· ···
=
5
2n + 1
(2n + 1)!
·

So using the fact that In is decreasing, we know that
1≤

I2n
I2n+1

I2n−1
1
→ 1
...

= π(2n + 1) 4n+1
I2n+1
2
(n!)4
ne
e

Since the last expression is equal to 1, we know that A = log 2π
...
We use the 1/12n term from the proof above
to get a better approximation:


1
1
2πnn+1/2 e−n+ 12n+1 ≤ n! ≤ 2πnn+1/2 e−n+ 12n
...
Suppose we toss a coin 2n times
...
Suppose we draw 26 cards from 52
...
2181
...
1

Axioms and definitions

So far, we have semi-formally defined some probabilistic notions
...
We were only allowed to have a finite
number of possible outcomes, and all outcomes occur with the same probability
...
For example,
we cannot use this to model a coin that gives heads with probability π −1
...
A probability space is a triple (Ω, F, P)
...

F has to satisfy the following axioms:
(i) ∅, Ω ∈ F
...

S∞
(iii) A1 , A2 , · · · ∈ F ⇒ i=1 Ai ∈ F
...
e
...

P
i

i

Items in Ω are known as the outcomes, items in F are known as the events, and
P(A) is the probability of the event A
...
e
...
However, if Ω is, say, R, we have to be a bit more careful
and only include nice subsets, or else we cannot have a well-defined P
...
Instead, in discrete cases,
we just specify the probabilities of each outcome, and use the third axiom to
obtain the full P
...
Let Ω = {ω1 , ω2 , · · · }
...
Let p(ωi ) = pi
...

ωi ∈A

This P(A) satisfies the above axioms, and p1 , p2 , · · · is the probability distribution
Using the axioms, we can quickly prove a few rather obvious results
...

11

2

Axioms of probability

IA Probability

(i) P(∅) = 0
(ii) P(AC ) = 1 − P(A)
(iii) A ⊆ B ⇒ P(A) ≤ P(B)
(iv) P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
...

(i) Ω and ∅ are disjoint
...
So P(∅) = 0
...

(iii) Write B = A ∪ (B ∩ AC )
...

(iv) P(A ∪ B) = P(A) + P(B ∩ AC )
...
Then the result follows
...
So we say that P is a
subadditive function
...
We say P is submodular
...

Definition (Limit of events)
...
Then we define the limit as
lim An =


[

n→∞

An
...
e
...


1

Theorem
...

n→∞

n→∞

Proof
...
In general,
B n = An \

n−1
[

Ai
...


2

Axioms of probability

IA Probability

Then
P(lim An ) = P
=P


[

!
Ai

1

[

!
Bi

1

=


X

P(Bi ) (Axiom III)

1

= lim

n→∞

n
X

P(Bi )

i=1
n
[

= lim P
n→∞

!
Ai

1

= lim P(An )
...


2
...
For any A1 , A2 , · · · ,
!


[
X
P
Ai ≤
P(Ai )
...

Proof
...

So we need a (not so) clever trick to make them disjoint
...


k=1

So we know that
[

Bi =

[

Ai
...
So our Axiom (iii) gives
!
!
[
[
X
X
P
Ai = P
Bi =
P (Bi ) ≤
P (Ai )
...

Example
...
Let
Ak = [kth toss head] and P(Ak ) = pk
...
What is the
probability that there are infinitely many heads?
13

2

Axioms of probability

IA Probability

S∞
The event “there is at least one more head after the ith coin toss” is k=i Ak
...
e
...

So the probability required is
!
!
∞ [



\
[
X
P
Ak = lim P
Ak ≤ lim
pk = 0
i→∞

i=1 k=i

k=i

i→∞

k=i

Therefore P(infinite number of heads) = 0
...
Is it possible to colour a complete n-graph (i
...
a graph
of n vertices with edges between every pair of vertices) red and black such that
there is no k-vertex complete subgraph with monochrome edges?
Erd¨
os said this is possible if
 
n 1−(k2)
2
< 1
...

Then the probability that at least one subgraph has monochrome edges is
 
[  X
k
n
P
Ai ≤
P(Ai ) =
2 · 2−(2)
...

If this probability is less than 1, then there must be a way to colour them in
which it is impossible to find a monochrome subgraph, or else the probability is

k
1
...

Theorem (Inclusion-exclusion formula)
...

Proof
...
n = 2 is proven above
...


i=2

Then we can apply the induction hypothesis for n − 1, and expand the mess
...

Example
...
If
i 6= π(i) for all i, we say we have a derangement
...


14

2

Axioms of probability

IA Probability

Then
P

!

n
[

Ai

=

i=1

X

P(Ak ) −

k

X

P(Ak1 ∩ Ak2 ) + · · ·

k1
 
 
1
n 1 1
n 1 1
1
=n· −
+
+ ···
n
2 nn−1
3 nn−1n−2
1
1
1
= 1 − + − · · · + (−1)n−1
2! 3!
n!
→ e−1
S
So the probability of derangement is 1 − P( Ak ) ≈ 1 − e−1 ≈ 0
...

Recall that, from inclusion exclusion,
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(AB) − P(BC) − P(AC) + P(ABC),
where P(AB) is a shorthand for P(A ∩ B)
...

In general
Theorem (Bonferroni’s inequalities)
...


i1
If r is even, then
!
n
X
[
X
P
Ai ≥
P(Ai1 ) −
P(Ai1 Ai2 ) +
1

i1



i1
X

X

P(Ai1 Ai2 Ai3 ) + · · ·

i1
P(Ai1 Ai2 Ai3 · · · Air )
...
Easy induction on n
...
Let Ω = {1, 2, · · · , m} and 1 ≤ j, k ≤ m
...

Then
Ak ∩ Aj = {1, 2, · · · , min(j, k)} = Amin(j,k)
and
Ak ∪ Aj = {1, 2, · · · , max(j, k)} = Amax(j,k)
...

Now let 1 ≤ x1 , · · · , xn ≤ m be some numbers
...

i
So
max{x1 , x2 , · · · , xn } ≥

X

xi −

X
i1
15

min{x1 , x2 }
...
3

IA Probability

Independence

Definition (Independent events)
...

Otherwise, they are said to be dependent
...
For example,
if you roll two dice separately, the outcomes will be independent
...
If A and B are independent, then A and B C are independent
...

P(A ∩ B C ) = P(A) − P(A ∩ B)
= P(A) − P(A)P(B)
= P(A)(1 − P(B))
= P(A)P(B C )

This definition applies to two events
...
Roll two fair dice
...
Let A3 = [sum is odd]
...
However, the collection of all three are not independent,
since if A1 and A2 are true, then A3 cannot possibly be true
...
We define:
Definition (Independence of multiple events)
...


16

2

Axioms of probability

IA Probability

Example
...
We roll 4 dice
...

6 6
36

But

1
6= P(A12 )P(A13 )P(A23 )
...

P(A12 ∩ A13 ∩ A23 ) =

We can also apply this concept to experiments
...
Further suppose that these two
experiments are independent, i
...

P((αi , βj )) = pi qj
for all i, j
...

Now suppose A ⊆ Ω1 and B ⊆ Ω2 are results (i
...
events) of the two
experiments
...
Then the probability
X
X
X
P(A ∩ B) =
pi qi =
pi
qi = P(A)P(B)
...
We can generalize this to n
independent experiments, or even countably many infinite experiments
...
4

Important discrete distributions

We’re now going to quickly go through a few important discrete probability
distributions
...
The sample
space is Ω = {ω1 , ω2 , · · · } and pi = P({ωi })
...
Suppose we toss a coin
...
The Bernoulli distribution, denoted B(1, p) has
P(T ) = 1 − p
...
Suppose we toss a coin n times, each with
probability p of getting heads
...

So
P(two heads) =

 
n 2
p (1 − p)n−2
...

k

We call this the binomial distribution and write it as B(n, p)
...
Suppose we toss a coin with probability p
of getting heads
...
We say it is memoryless because how many tails
we’ve got in the past does not give us any information to how long I’ll have to
wait until I get a head
...
Suppose we have an urn with n1 red
balls and n2 black balls
...
The probability that there are k
red balls is


n1
n2
k
n−k

n1 +n2
n

P(k red) =


...
The Poisson distribution denoted P (λ) is
λk −λ
e
k!

pk =
for k ∈ N
...

Suppose that an event happens at a rate of λ
...

As we take the limit n → ∞, we obtain the Poisson distribution
...
Suppose n → ∞ and p → 0
such that np = λ
...

k
k!
Proof
...


2
...
Suppose B is an event with P(B) > 0
...

P(B)

We interpret as the probability of A happening given that B has happened
...

P(B)
P(B)

Example
...
Then
P(A1 ) = 1
...

and
P(A2 | A1 ) = 1
...

It is significantly bigger, albeit still incredibly tiny
...

If P(A | B) > P(A), then we say that B attracts A
...
We can also say A repels B if A attracts
BC
...

(i) P(A ∩ B) = P(A | B)P(B)
...

(iii) P(A | B ∩ C) =

P(A∩B|C)
P(B|C)
...

Proof
...
So we only prove (iv)
...

(i) Let A ⊆ B
...


= 1
...
Then
!
S
[
P( i Ai ∩ B)
P
Ai B =

P(B)
i
S
P ( i Ai )
=
P(B)
X P(Ai )
=
P(B)
X P(Ai ∩ B)
=
P(B)
X
=
P(Ai | B)
...
ASpartition of the sample space is a collection of disjoint
events {Bi }∞
i=0 such that
i Bi = Ω
...

The following result should be clear:
Proposition
...


i=1

Example
...
The gambler gets +1 for head, and
−1 for tail
...
Let
px = P(goes broke | starts with $x),
and B1 be the event that he gets head on the first toss
...
Then solving the recurrence
relation, we have
x
px = 1 −
...
Suppose Bi is a partition of the sample space, and
A and Bi all have non-zero probability
...

j P(A | Bj )P(Bj )
Note that the denominator is simply P(A) written in a fancy way
...
Suppose we have a screening test that tests whether a
patient has a particular disease
...
Suppose that the
test is not absolutely accurate, and
P(+ | D) = 0
...
01
P(D) = 0
...

So what is the probability that a person has the disease given that he received a
positive result?
P(+ | D)P(D)
P(+ | D)P(D) + P(+ | DC )P(DC )
0
...
001
=
0
...
001 + 0
...
999
= 0
...
Even if you get a positive result, since the disease is
so rare, it is more likely that you don’t have the disease and get a false positive
...
Consider the two following cases:
(i) I have 2 children, one of whom is a boy
...

What is the probability that both of them are boys?
(i) P(BB | BB ∪ BG) =

1/4
1/4+2/4

= 31
...
Then
P(B ∗ B ∗ ∪ B ∗ B | BB ∗ ∪ B ∗ B ∗ ∪ B ∗ G) =
=

1
1
6
· 14
+ 2 · 14
· 14
1
6
1
+ 2 · 14
· 14
+ 2 · 14
·

1
14
1
14

·

1
14

1
2

13

...
So if we have the information that a boy
is born on a Tuesday, it is now less likely that there is just one boy
...


21

3

Discrete random variables

3

IA Probability

Discrete random variables

With what we’ve got so far, we are able to answer questions like “what is the
probability of getting a heads?” or “what is the probability of getting 10 heads
in a row?”
...
What does it even mean to take the average of a “heads” and
a “tail”?
To make some sense of this notion, we have to assign, to each outcome, a
number
...

Then on average, we can expect to get 0
...
This is the idea of a random variable
...
1

Discrete random variables

Definition (Random variable)
...
ΩX is usually a set of numbers, e
...
R or N
...
g
...

Definition (Discrete random variables)
...

Notation
...

i
...
the probability that the outcome is in T
...
If Ω is itself countable, then we can write this as
X
P(X ∈ T ) =

...
Let X be the value shown by rolling a fair die
...
We know that
P(X = i) =

1

...

Definition (Discrete uniform distribution)
...

Example
...

Then the sum can be represented by X + Y , with
ΩX+Y = {2, 3, · · · , 12}
...


22

3

Discrete random variables

IA Probability

Notation
...

We can also write X ∼ B(n, p) to mean
 
n r
P(X = r) =
p (1 − p)n−r ,
r
and similarly for the other distributions we have come up with before
...
The expectation (or mean) of a real-valued X is
equal to
X
E[X] =
pω X(ω)
...
Otherwise, we say the expectation doesn’t
exist
...


x∈ΩX

We are sometimes lazy and just write EX
...
Note that this definition
only holds in the case where the sample space Ω is countable
...
g
...

Example
...
Then
E[X] = 2 ·

1
2
1
+3·
+ · · · + 12 ·
= 7
...

However, it is possible for the expected value to be infinite:
Example (St
...
Suppose we play a game in which we keep
tossing a coin until you get a tail
...
The expected value is
E[X] =

1
1
1
· 2 + · 4 + · 8 + · · · = ∞
...

There are many ways to resolve this paradox, such as taking into account the
fact that the host of the game has only finitely many money and thus your real
expected gain is much smaller
...
We calculate the expected values of different distributions:

23

3

Discrete random variables

IA Probability

(i) Poisson P (λ)
...
Then
PX (r) =

λr e−λ

...

(ii) Let X ∼ B(n, p)
...

Given a random variable X, we can create new random variables such as
X + 3 or X 2
...

Then f (X) is a new random variable that maps ω 7→ f (X(ω))
...
if a, b, c are constants, then a+bX and (X −c)2 are random variables,
defined as
(a + bX)(ω) = a + bX(ω)
(X − c)2 (ω) = (X(ω) − c)2
...

(i) If X ≥ 0, then E[X] ≥ 0
...

(iii) If a and b are constants, then E[a + bX] = a + bE[X]
...
This is
true even if X and Y are not independent
...

Proof
...
Then
X
E[X] =
pω X(ω) ≥ 0
...
So
X(ω) = 0 for all ω
...


ω

(iv)
E[X+Y ] =

X

pω [X(ω)+Y (ω)] =

X

ω

X
pω X(ω)+
pω Y (ω) = E[X]+E[Y ]
...

This is clearly minimized when c = E[X]
...

An easy generalization of (iv) above is
Theorem
...

i=1

Proof
...


ω

Definition (Variance and standard deviation)
...

The standard deviation is the square root of the variance,
25

p
var(X)
...
If we have a
low variance, then the value of X is very likely to be close to E[X]
...

(i) var X ≥ 0
...

(ii) var(a + bX) = b2 var(X)
...

(iii) var(X) = E[X 2 ] − E[X]2 , also proven by expanding the definition
...
Let X ∼ B(n, p) be a binomial distribution
...
We also have
n
X

n!
pr (1 − p)n−r
r!(n

r)!
r=0

n 
X
n − 2 r−2
= n(n − 1)p2
p (1 − p)(n−2)−(r−2)
r

2
r=2

E[X(X − 1)] =

r(r − 1)

= n(n − 1)p2
...
So
var(X) = E[X 2 ] − (E[X])2 = np(1 − p) = npq
...
If X ∼ P (λ), then E[X] = λ, and var(X) = λ,
since P (λ) is B(n, p) with n → ∞, p → 0, np → λ
...
Suppose P(X = r) = q r p for r = 0, 1, 2, · · ·
...

p
= pq

26

3

Discrete random variables

IA Probability

Then
E[X(X − 1)] =


X

r(r − 1)pq r

0

= pq 2


X

r(r − 1)q r−2

0

d2 1
= pq 2 2
dq 1 − q
2pq 2
=
(1 − q)3
So the variance is
var(X) =

2pq 2
q
q
q2
+
= 2
...
The indicator function or indicator variable
I[A] (or IA ) of an event A ⊆ Ω is
(
1 ω∈A
I[A](ω) =
0 ω 6∈ A
This indicator random variable is not interesting by itself
...

It has the following properties:
Proposition
...


– I[AC ] = 1 − I[A]
...

– I[A ∪ B] = I[A] + I[B] − I[A]I[B]
...

These are easy to prove from definition
...

Example
...
Let N be the number of couples sitting
next to each other
...
Then
N=

n
X

I[Ai ]
...

n
1

27

3

Discrete random variables

IA Probability

We also have
2

E[N ] = E

X

2 
I[Ai ]



X
X
= E
I[Ai ]2 + 2
I[Ai ]I[Aj ]
i

i




= nE I[Ai ] + n(n − 1)E I[A1 ]I[A2 ]
2
n

We have E[I[A1 ]I[A2 ]] = P(A1 ∩ A2 ) =



1
1
n−1 n−1

+

n−2 2
n−1 n−1




...


we ultimately obtain var(N ) =
In fact, as n → ∞, N ∼ P (2)
...

!
n
n
X
X
[
P(Ai ) −
Ai =
P(Ai1 ∩ Aj2 ) +
P
1

i

i1
X

P(Ai1 ∩ Ai2 ∩ Ai3 ) − · · ·

i1
+ (−1)n−1 P(A1 ∩ · · · ∩ An )
...
Let Ij be the indicator function for Aj
...


i1 <···
Then
1−

n
Y

(1 − Ij ) = S1 − S2 + S3 · · · + (−1)n−1 Sn
...

1

We can extend the idea of independence to random variables
...

Definition (Independent random variables)
...
They are independent iff for any x1 , x2 , · · · , xn ,
P(X1 = x1 , · · · , Xn = xn ) = P(X1 = x1 ) · · · P(Xn = xn )
...
If X1 , · · · , Xn are independent random variables, and f1 , · · · , fn are
functions R → R, then f1 (X1 ), · · · , fn (Xn ) are independent random variables
...
Note that given a particular yi , there can be many different xi for which
fi (xi ) = yi
...
Then
X
P(f1 (X1 ) = y1 , · · · fn (Xn ) = yn ) =
P(X1 = x1 , · · · , Xn = xn )
x1 :f1 (x1 )=y1
·
·
xn :fn (xn )=yn

X

=

n
Y

P(Xi = xi )

x1 :f1 (x1 )=y1 i=1
·
·
xn :fn (xn )=yn

=

n
Y

X

P(Xi = xi )

i=1 xi :fi (xi )=yi

=

n
Y

P(fi (xi ) = yi )
...

Theorem
...

Proof
...
Then
"n
#
Y
X
X
E
Xi =
···
x1 x2 · · · xn × P(X1 = x1 , · · · , Xn = xn )
1

=

=

x1 ∈R1
xn ∈Rn
n
Y X

xi P(Xi = xi )

i=1 xi ∈Ri
n
Y

E[Xi ]
...
Let X1 , · · · Xn be independent random variables, and f1 , f2 , · · · fn
are functions R → R
...

Theorem
...


29

3

Discrete random variables

IA Probability

Proof
...


Corollary
...
Then
 X 
1
1
Xi = var(X1 )
...


var

1X
Xi
n



X 
1
var
Xi
n2
1 X
= 2
var(Xi )
n
1
= 2 n var(X1 )
n
1
= var(X1 )
n

=

This result is important in statistics
...

Example
...
e
...
Then
Y = X1 + X2 + · · · + Xn ∼ B(n, p)
...

Example
...
We can measure
the lengths, but is not accurate
...
Suppose
E[A] = a,

var(A) = σ 2

E[B] = b,

var(B) = σ 2
...

Then we estimate a and b by
a
ˆ=

X +Y ˆ X −Y
, b=

...
e
...
Also
var(ˆ
a) =

1
1
1
var(X + Y ) = 2σ 2 = σ 2 ,
4
4
2

and similarly for b
...


3
...
In particular, Chebyshev’s inequality will allow us to prove the
weak law of large numbers
...
A function f : (a, b) → R is convex if for all
x1 , x2 ∈ (a, b) and λ1 , λ2 ≥ 0 such that λ1 + λ2 = 1,
λ1 f (x1 ) + λ2 f (x2 ) ≥ f (λ1 x1 + λ2 x2 )
...


λ1 f (x1 ) + λ2 f (x2 )

x1

λ1 x 1 + λ2 x 2

x2

A function is concave if −f is convex
...
If f is differentiable and f 00 (x) ≥ 0 for all x ∈ (a, b), then it is
convex
...

Theorem (Jensen’s inequality)
...

This says that E[f (X)] ≥ f (E[X]) (where P(X = xi ) = pi )
...
e
...


31

3

Discrete random variables

IA Probability

Proof
...
It is true for n = 2 by the definition of convexity
...

p2 + · · · + pn


p2
pn
≤ p1 f (x1 ) + (p2 + · · · + pn )
f (x2 ) + · · · +
f (xn )
()
()
= p1 f (x1 ) + · · · + pn (xn )
...

Strictly convex case is proved with ≤ replaced by < by definition of strict
convexity
...
Given x1 , · · · , xn positive reals, then
Y

xi

1/n



1X
xi
...
Take f (x) = − log x
...

Take P(x = xi ) = 1/n
...
Since − log x is strictly convex, AM = GM
only if all xi are equal
...
For any two random variables X, Y ,
(E[XY ])2 ≤ E[X 2 ]E[Y 2 ]
...
If Y = 0, then both sides are 0
...
Let
w =X −Y ·

E[XY ]

...


32

3

Discrete random variables

IA Probability

Theorem (Markov inequality)
...

ε
Proof
...
We have
|X|

...

Take the expected value to obtain
P(|X| ≥ ε) ≤

E|X|

...
If X is a random variable with E[X 2 ] < ∞
and ε > 0, then
E[X 2 ]
P(|X| ≥ ε) ≤

...
Again, we have
x2

...

I[{|X| ≥ ε}] ≤

Note that these are really powerful results, since they do not make any
assumptions about the distribution of X
...

An important corollary is that if µ = E[X], then
P(|X − µ| ≥ ε) ≤

3
...
Let X1 , X2 , · · · be iid random variables,
with mean µ P
and var σ 2
...

Then for all ε > 0,



Sn



P
− µ ≥ ε → 0
n
as n → ∞
...

n

33

3

Discrete random variables

IA Probability

Proof
...
For example, if
X1 = X2 = X3 = · · · = 1 or 0, each with probability 1/2
...

Example
...
Then
number of heads
Sn
=

...

n
This means that as we toss more and more coins, the proportion of heads will
tend towards p
...

Theorem (Strong law of large numbers)
...

P
n
We say
Sn
→as µ,
n
where “as” means “almost surely”
...
The proof is left for Part II because it is too hard
...
4

Multiple random variables

If we have two random variables, we can study the relationship between them
...
Given two random variables X, Y , the covariance is
cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
...

(i) cov(X, c) = 0 for constant c
...

(iii) cov(X, Y ) = cov(Y, X)
...

(v) cov(X, X) = var(X)
...

(vii) If X, Y are independent, cov(X, Y ) = 0
...

It is important to note that cov(X, Y ) = 0 does not imply X and Y are
independent
...

– Let (X, Y ) = (2, 0), (−1, −1) or (−1, 1) with equal probabilities of 1/3
...

However, cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0 − 0 · 0 = 0
...
So cov(X, Y ) = 0
but X and Y are clearly not independent (they have to satisfy x2 + y 2 = 1)
...

For one, the covariance can (potentially) have dimensions, which means that the
numerical value of the covariance can depend on what units we are using
...
To solve these problems, we define
Definition (Correlation coefficient)
...


Proposition
...

Proof
...

Again, zero correlation does not necessarily imply independence
...
Given two random variables
X, Y , P(X = x, Y = y) is known as the joint distribution
...
We can
also consider different conditional expectations
...
Let X and Y be random variables (in
general not independent) with joint distribution P(X = x, Y = y)
...

y∈Ωy

The conditional distribution of X given Y is
P(X = x | Y = y) =

P(X = x, Y = y)

...

E[X | Y = y] =
x∈ΩX

We can view E[X | Y ] as a random variable in Y : given a value of Y , we return
the expectation of X
...
Consider a dice roll
...
Let X be the value of the roll
...

Example
...
Let Y = X1 + · · · + Xn
...

n−1
n
r p (1 − p)
So


r
r
Y
r
+0 1−
= =
...
If X and Y are independent, then
E[X | Y ] = E[X]
Proof
...
Since it is equally likely to be even or odd,
the expected value of the dice roll is 3
...
This is formally captured by

36

3

Discrete random variables

IA Probability

Theorem (Tower property of conditional expectation)
...

Proof
...

This is also called the law of total expectation
...
Then
X
E[X] =
E[X | Ai ]P(Ai )
...
5

Probability generating functions

Consider a random variable X, taking values 0, 1, 2, · · ·
...

Definition (Probability generating function (pgf))
...


0

This is a power series (or polynomial), and converges if |z| ≤ 1, since
X
X
|p(z)| ≤
pr |z r | ≤
pr = 1
...

This definition might seem a bit out of the blue
...

Example
...
e
...
So


1
1
1 − z6
p(z) = E[z X ] = (z + z 2 + · · · + z 6 ) = z

...
The distribution of X is uniquely determined by its probability
generating function
...
By definition, p0 = p(0), p1 = p0 (0) etc
...

In general,


di
= i!pi
...

Theorem (Abel’s lemma)
...

z→1

If p0 (z) is continuous, then simply E[X] = p0 (1)
...
What is important here is that even
if p0 (1) doesn’t exist, we can still take the limit and obtain the expected value,
e
...
when E[X] = ∞
...
For z < 1, we have
p0 (z) =


X


X

rpr z r−1 ≤

1

rpr = E[X]
...


z→1

On the other hand, for any ε, if we pick N large, then
N
X

rpr ≥ E[X] − ε
...

z→1

So E[X] ≤ lim p0 (z)
...

E[X(X − 1)] = lim p00 (z)
...
Same as above
...
Consider the Poisson distribution
...

r!

1 r −λ
λ e = eλz e−λ = eλ(z−1)
...

We have

d λ(z−1)
E[X] =
= λ,
e

dz
z=1
and
E[X(X − 1)] =


d2 λ(z−1)
= λ2
e

dx2
z=1

So
var(X) = E[X 2 ] − E[X]2 = λ2 + λ − λ2 = λ
...
Suppose X1 , X2 , · · · , Xn are independent random variables with pgfs
p1 , p2 , · · · , pn
...

Proof
...

Example
...
Then
p(z) =

n
X

r

P(X = r)z =

X n
r

r=0

pr (1 − p)n−r z r = (pz + (1 − p))n = (pz + q)n
...
But pz + q is the pgf of Y ∼ B(1, p)
...
e
...

Example
...

We can also do it directly:
P(X + Y = r) =

r
X

P(X = i, Y = r − i) =

i=0

r
X

P(X = i)P(X = r − i),

i=0

but is much more complicated
...

Example
...
One way
to do it is

39

3

Discrete random variables

IA Probability

We can do it recursively: suppose there are fn ways to tile a 2 × n grid
...
So
fn = fn−1 + fn−2 ,
which is simply the Fibonacci sequence, with f0 = f1 = 1
...

F (z) =
n=0

Then from our recurrence relation, we obtain
fn z n = fn−1 z n + fn−2 z n
...


n=2

Since f0 = f1 = 1, we have
F (z) − f0 − zf1 = z(F (z) − f0 ) + z 2 F (z)
...
If we write
α1 =


1
(1 + 5),
2

α2 =


1
(1 − 5)
...

α1 − α2
n=0
n=0
So
fn =

α1n+1 − α2n+1

...
A Dyck word is a string of brackets that match, such as (), or ((())())
...
There are 2 of length 4, (()) and
()()
...

Let Cn be the number of Dyck words of length 2n
...
Since the lengths of w1
and w2 must sum up to 2(n − 1),
Cn+1 =

n
X
i=0

40

Ci Cn−i
...


n=0

From (∗), we can show that
c(x) = 1 + xc(x)2
...
Then
Cn =

 
1
2n

...
For example, an insurance company may receive a random
number of claims, each demanding a random amount of money
...
This can be answered using probability
generating functions
...
Let X1 , X2 , · · · , Xn be iid with pgf p(z) = E[z X ]
...
What is the pgf of S = X1 + · · · + XN ?
E[z S ] = E[z X1 +···+XN ]
= EN [EXi [z X1 +
...



d
h(p(z))
E[S] =
dz
z=1
= h0 (p(1))p0 (1)
= E[N ]E[X1 ]
41

3

Discrete random variables

IA Probability

To calculate the variance, use the fact that
E[S(S − 1)] =



d2


...


42

4

Interesting problems

4

IA Probability

Interesting problems

Here we are going to study two rather important and interesting probabilistic
processes — branching processes and random walks
...


4
...
At
the beginning, there is only one individual
...
In the next iteration, each offspring
will individually independently reproduce randomly according to the same
distribution
...

Consider X0 , X1 , · · · , where Xn is the number of individuals in the nth
generation
...

(iii) Suppose all offspring behave independently
...

It is useful to consider the pgf of a branching process
...
Then

X
n
F (z) = E[z Yi ] = E[z X1 ] =
pk z k
...

The main theorem of branching processes here is
Theorem
...


43

4

Interesting problems

IA Probability

Proof
...
Suppose
E[X1 ] =

X

kpk = µ

and
var(X1 ) = E[(X − µ)2 ] =

X
(k − µ)2 pk < ∞
...


Proof
...

To calculate the variance, note that
var(Xn ) = E[Xn2 ] − (E[Xn ])2
and hence
E[Xn2 ] = var(Xn ) + (E[X])2
We then calculate
E[Xn2 ] = E[E[Xn2 | Xn−1 ]]
= E[var(Xn ) + (E[Xn ])2 | Xn−1 ]
= E[Xn−1 var(X1 ) + (µXn−1 )2 ]
= E[Xn−1 σ 2 + (µXn−1 )2 ]
2
= σ 2 µn−1 + µ2 E[Xn−1
]
...

Of course, we can also obtain this using the probability generating function as
well
...

Let q be the probability that extinction eventually occurs
...


n=1

Since A1 ⊆ A2 ⊆ A3 ⊆ · · · , we know that
q = P(A) = lim P(An ) = lim P(Xn = 0)
...
So


F (q) = F lim Fn (0) = lim F (Fn (0)) = lim Fn+1 (0) = q
...

Alternatively, using the law of total probability
X
X
q=
P(X1 = k)P(extinction | X1 = k) =
pk q k = F (q),
k

k

where the second equality comes from the fact that for the whole population to
go extinct, each individual population must go extinct
...
However, if there are many fixed points, which should we pick?
Theorem
...
Write µ = E[X1 ]
...


45

4

Interesting problems

IA Probability

Proof
...
Then note
that 0 ≤ α ⇒ F (0) ≤ F (α) = α since F is increasing (proof: write the function
out!)
...
Continuing inductively, Fn (0) ≤ α for all n
...

n→∞

So q = α
...
We
know that F 0 (z), F 00 (z) ≥ 0 for z ∈ (0, 1) (proof: write it out again!)
...
Since F 0 (1) = µ ≤ 1, it must approach (1, 1) from above
the F = z line
...


4
...

Definition (Random walk)
...
Let Sn =
S0 + X1 + · · · + Xn
...

If p = q = 12 , we say it is a symmetric random walk
...
A gambler starts with $z, with z < a, and plays a game in which he
wins $1 or loses $1 at each turn with probabilities p and q respectively
...
So
pz = qpz−1 + ppz+1 ,
for 0 < z < a, and p0 = 0, pa = 1
...
Then
pt2 − t + q = (pt − q)(t − 1) = 0,

46

4

Interesting problems

IA Probability

noting that p = 1 − q
...

p
z

Since pz = 0, we get A = −B
...

1 − (p/q)a

If p = q, then pz = A + Bz = z/a
...
Since pz + qz = 1, we know that the game will eventually end
...


So if the odds are against you (i
...
the probability of losing is greater than the
probability of winning), then no matter how small the difference is, you are
bound to going bankrupt eventually
...
We
first show that this is bounded
...
This happens with probability pa + q a ,
and takes a steps
...
So we must have
Dz ≤

a
pa + q a

This is a very crude bound, but it is sufficient to show that it is bounded, and
we can meaningfully apply formulas to this finite quantity
...

We first find a particular solution by trying Dz = Cz
...

So
C=

1
,
q−p

for p 6= q
...
Try Dz = tz
...
So the general solution is
 z
z
q
+

...

q − p q − p 1 − (q/p)a

If p = q, then to find the particular solution, we have to try
Dz = Cz 2
...
So
Dz = −z 2 + A + Bz
...

Using generating functions
We can use generating functions to solve the problem as well
...

We have the following conditions: U0,0 = 1; Uz,0 = 0 for 0 < z ≤ a;
U0,n = Ua,n = 0 for n > 0
...

Uz (s) =


X

Uz,n sn
...

Multiply by sn+1 and sum on n = 0, 1, · · ·
...

48

4

Interesting problems

IA Probability

We try Uz (s) = [λ(s)]z
...

Then
λ1 (s), λ2 (s) =



p
1 − 4pqs2

...

Since U0 (s) = 1 and Ua (s) = 0, we know that
A(s) + B(s) = 1
and
A(s)λ1 (s)a + B(s)λ2 (s)a = 0
...

λ1 (s)a − λ2 (s)a

Since λ1 (s)λ2 (s) = pq , we can “simplify” this to obtain
Uz (s) =

 z
q
λ1 (s)a−z − λ2 (s)a−z

...
We can apply the same method to find the generating
function for absorption at a, say Vz (s)
...
Hence the expected duration is Dz = Uz0 (1) + Vz0 (1)
...
1

Continuous random variables

So far, we have only looked at the case where the outcomes Ω are discrete
...
Then our sample space is
Ω = {ω ∈ R : 0 ≤ ω < 2π}
...


With continuous distributions, we can no longer talk about the probability of
getting a particular number, since this is always zero
...
42 radians
...
To capture the distribution of X, we want to define a function
f such that for each x and small δx, the probability of X lying in [x, x + δx] is
given by f (x)δx + o(δx)
...

Rb
Integrating this, we know that the probability of X ∈ [a, b] is a f (x) dx
...

Definition (Continuous random variable)
...

a

We call f the probability density function, which satisfies
– f ≥0
R∞
– −∞ f (x) = 1
...
Then we also have


[
P
[X = a] = 0,
a∈Q

since it is a countable union of probability 0 events (and axiom 3 states that the
probability of the countable union is the sum of probabilities, i
...
0)
...
The cumulative distribution
function (or simply distribution function) of a random variable X (discrete,
continuous, or neither) is
F (x) = P(X ≤ x)
...

In the case of continuous random variables, we have
Z x
F (x) =
f (z) dz
...
In general, F 0 (x) = f (x) whenever F is
differentiable
...

The cdf of a continuous random variable might look like this:
P

The cdf of a discrete random variable might look like this:
P

The cdf of a random variable that is neither discrete nor continuous might look
like this:
P

Note that we always have
P(a < x ≤ b) = F (b) − F (a)
...


Definition (Uniform distribution)
...

b−a

x

f (z) dz =
a

x−a
b−a

for a ≤ x ≤ b
...

Definition (Exponential random variable)
...

We write X ∼ E(λ)
...


a

Proposition
...
e
...

This means that, say if X measures the lifetime of a light bulb, knowing it has
already lasted for 3 hours does not give any information about how much longer
it will last
...

Proof
...

Similarly, given that, you have lived for x days, what is the probability of
dying within the next δx days?
P(x ≤ X ≤ x + δx)
P(X ≥ x)
−λx
λe
δx
=
−λx
e
= λδx
...

In general, we can define the hazard rate to be
h(x) =

f (x)

...

P(X ≥ x)
1 − F (x)

In the case of exponential distribution, h(x) is constant and equal to λ
...

However, we will (obviously) have to replace the sum with an integral
...
The expectation (or mean) of a continuous random
variable is
Z ∞
E[X] =
xf (x) dx,
−∞

provided not both

R∞
0

R0

xf (x) dx and

xf (x) dx are infinite
...
If X is a continuous random variable, then
Z ∞
Z ∞
E[X] =
P(X ≥ x) dx −
P(X ≤ −x) dx
...

Proof
...

0

We can similarly show that

R∞
0

P(X ≤ −x) dx = −

R0
−∞

yf (y) dy
...
Suppose X ∼ E(λ)
...

x

So

Z
E[X] =



e−λx dx =

0

1

...
The variance of a continuous random variable is
var(X) = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2
...


5

Continuous random variables

IA Probability

Example
...
Then
b

Z

x

E[X] =
a

1
a+b
dx =

...

12
Apart from the mean, or expected value, we can also have other notions of
“average values”
...
Given a pdf f (x), we call x
ˆ a mode if
f (ˆ
x) ≥ f (x)
for all x
...
For example, in the
uniform distribution, all x are modes
...

2
−∞
x
ˆ
For a discrete random variable, the median is x
ˆ such that
P(X ≤ x
ˆ) ≥

1
,
2

P(X ≥ x
ˆ) ≥

1

...

Suppose that we have an experiment whose outcome follows the distribution
of X
...
The average of these results is known as the sample mean
...
If X1 , · · · , Xn is a random sample from some
distribution, then the sample mean is
n

X
¯= 1
X
Xi
...
2

Stochastic ordering and inspection paradox

We want to define a (partial) order on two different random variables
...

A simple definition might be expectation ordering, where X ≥ Y if E[X] ≥
E[Y ]
...
Instead, we can use the stochastic order
...
The stochastic order is defined as: X ≥st Y if
P(X > t) ≥ P(Y > t) for all t
...
Stochastic ordering implies expectation ordering,
since
Z ∞
Z ∞
X ≥st Y ⇒
P(X > x) dx ≥
P(y > x) dx ⇒ E[X] ≥ E[Y ]
...

Example (Inspection paradox)
...
Family i has Xi children at the school, where X1 , · · · , Xn are iid random
variables, with P (Xi = k) = pk
...

Now pick a child at random
...
Then
P(J = j, Xj = k)

...
The numerator is more complex
...
So
i6=j

"
P(XJ = k | J = j) = E

k+

nkpk
P

#

i6=j

Xi


...
So
"
P(XJ = k) = E

k+

nkpk
P

i6=j

#
Xi


...
So
"
#
nk
P(XJ = k)
P
=E

...
So the average value of the
family size of the child we picked is greater than the average family size
...

This means that if we pick the children randomly, the sample mean of the
family size will be greater than the actual mean
...


5
...
Two random variables X, Y have joint distribution F : R2 7→ [0, 1] defined by
F (x, y) = P(X ≤ x, Y ≤ y)
...
We say X1 , · · · , Xn are
jointly distributed continuous random variables and have joint pdf f if for any
set A ⊆ Rn
Z
P((X1 , · · · , Xn ) ∈ A) =
f (x1 , · · · , xn ) dx1 · · · dxn
...

Rn

Example
...

−∞

−∞

If F is differentiable, then
∂2
F (x, y)
...
If X and Y are jointly continuous random variables, then they are
individually continuous random variables
...
We prove this by showing that X has a density function
...

Definition (Independent continuous random variables)
...

If we let FXi and fXi be the cdf, pdf of X, then
F (x1 , · · · , xn ) = FX1 (x1 ) · · · FXn (xn )
and
f (x1 , · · · , xn ) = fX1 (x1 ) · · · fXn (xn )
are each individually equivalent to the definition above
...

Example
...

Then we can see that f (x1 , x2 ) = 1 · 1 = f (x1 ) · f (x2 )
...

On the other hand, if (Y1 , Y2 ) takes a random value from [0, 1] × [0, 1] with
the restriction that Y1 ≤ Y2 , then they are not independent, since f (x1 , x2 ) =
2I[Y1 ≤ Y2 ], which cannot be split into two parts
...
For independent continuous random variables Xi ,
Q
Q
(i) E[ Xi ] = E[Xi ]
P
P
(ii) var( Xi ) = var(Xi )

5
...

Example
...
What is the probability that |X − Y | ≤ `? By “at random”, we mean
f (x, y) =

1
,
L2

since each of X and Y have pdf 1/L
...

The total area of the white part is simply the area of a square with length L − `
...
So the desired probability is
Z
2L` − `2
f (x, y) dx dy =

...
Suppose we draw a random chord in a circle
...


57

5

Continuous random variables

IA Probability

(i) We randomly pick two end points over the circumference independently
...
For the
length of the chord to be longer than a side of the triangle, the other end
point must between the two other vertices of the triangle
...


(ii) wlog the chord is horizontal, on the lower side of the circle
...
Then the
probability of getting a long line is 1/2
...
Then
you get a long line if and only if the mid-point lies in the smaller circle
shown below
...


We get different answers for different notions of “random”! This is why when we
say “randomly”, we should be explicit in what we mean by that
...
A needle of length ` is tossed at random onto a
floor marked with parallel lines a distance L apart, where ` ≤ L
...
What is P(A)?
X

`
θ

L

Suppose that X ∼ U [0, L] and θ ∼ U [0, π]
...



5

Continuous random variables

IA Probability

This touches a line iff X ≤ ` sin θ
...

π
πL

P(X≤` sin θ)

Since the answer involves π, we can estimate π by conducting repeated experiments! Suppose we have N hits out of n tosses
...
Hence
2`
π
ˆ=

...


5
...
The normal distribution with parameters
µ, σ 2 , written N (µ, σ 2 ) has pdf


1
(x − µ)2
f (x) = √
exp −
,
2σ 2
2πσ
for −∞ < x < ∞
...
e
...

We usually write φ(x) for the pdf and Φ(x) for the cdf of the standard normal
...
This is partly due to the
central limit theorem, which says that if we have a large number of iid random
variables, then the distribution of their averages are approximately normal
...

We first have to show that this makes sense, i
...

Proposition
...
Substitute z =

1
2πσ 2

1

2

e− 2σ2 (x−µ) dx = 1
...


Then
Z ∞
1 2
1
√ e− 2 z dz
...

2

Z



We also have
Proposition
...

Proof
...

dx + √
2πσ −∞
2πσ −∞
The first term is antisymmetric about µ and gives 0
...
So we get µ
...

Proposition
...

Proof
...
Substitute Z = X−µ
σ
...

Then
Z ∞
2
1
z 2 e−z /2 dz
var(Z) = √
2π −∞

∞
Z ∞
2
1
1
−z 2 /2
= − √ ze
+√
e−z /2 dz

2π −∞
−∞
=0+1
=1
So var X = σ 2
...
UK adult male heights are normally distributed with mean 70” and
standard deviation 3”
...

What is P(Y > X), where X and Y are the heights of randomly chosen UK
and Netherlands males, respectively?
We have X ∼ N (70, 32 ) and Y ∼ N (71, 32 )
...




Y −X −1
−1

P(Y > X) = P(Y − X > 0) = P
>√
= 1 − Φ(−1/ 18),
18
18
since

(Y −X)−1

18

∼ N (0, 1), and the answer is approximately 0
...


60

5

Continuous random variables

IA Probability

Now suppose that in both countries, the Olympic male basketball teams are
selected from that portion of male whose hight is at least above 4” above the
mean (which corresponds to the 9
...
What is the
probability that a randomly chosen Netherlands player is taller than a randomly
chosen UK player?
For the second part, we have
R 75
R∞ R∞
φ (x) dx + x=75 y=x φY (y)φX (x) dy dx
x=74 X
R∞
R∞
,
P(Y > X | X ≥ 74, Y ≥ 75) =
φ (x) dx y=75 φY (y) dy
x=74 X
which is approximately 0
...
So even though the Netherlands people are only
slightly taller, if we consider the tallest bunch, the Netherlands people will be
much taller on average
...
6

Transformation of random variables

We will now look at what happens when we apply a function to random variables
...

Single random variable
Theorem
...

dy

Proof
...

dy

It is often easier to redo the proof than to remember the result
...
Let X ∼ U [0, 1]
...
Then
P(Y ≤ y) = P(− log X ≤ y)
= P(X ≥ e−y )
= 1 − e−y
...
So Y is exponentially
distributed with λ = 1
...
Let U ∼ U [0, 1]
...

Proof
...

This condition “strictly increasing” is needed for the inverse to exist
...

This can also be done for discrete random variables P(X = xi ) = pi by letting
X = xj if

j−1
X

pi ≤ U <

i=0

j
X

pi ,

i=0

for U ∼ U [0, 1]
...
Let
Y1 = r1 (X1 , · · · , Xn )
Y2 = r2 (X1 , · · · , Xn )

...

Yn = rn (X1 , · · · , Xn )
...

2
n
Let R ⊆ R such that P((X1 , · · · , Xn ) ∈ R) = 1, i
...
R is the set of all values
(Xi ) can take
...
Then there exists an inverse function

X1 = s1 (Y1 , · · · , Yn )
X2 = s2 (Y1 , · · · , Yn )

...

Xn = sn (Y1 , · · · , Yn )
...

∂si
Definition (Jacobian determinant)
...
Then the Jacobian determinant is
 ∂s1
∂s1 
· · · ∂y
∂y1
n
∂(s1 , · · · , sn )


...

J=
= det 
...


...
Then using results from IA Vector Calculus, we
get
Z
P((X1 , · · · , Xn ) ∈ A) =

f (x1 , · · · , xn ) dx1 · · · dxn
ZA
f (s1 (y1 , · · · yn ), s2 , · · · , sn )|J| dy1 · · · dyn

=
B

= P((Y1 , · · · Yn ) ∈ B)
...
(Y1 , · · · , Yn ) has density
g(y1 , · · · , yn ) = f (s1 (y1 , · · · , yn ), · · · sn (y1 , · · · , yn ))|J|
if (y1 , · · · , yn ) ∈ S, 0 otherwise
...
Suppose (X, Y ) has density
(
4xy 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0
otherwise
We see that X and Y are independent, with each having
a density fp
(x) = 2x
...
Then we have X = U V and Y = V /U
...
So

g(u, v) = 4 uv

r

v 1
2v
=
,
u 2u
u

if (u, v) is in the image S, 0 otherwise
...

u

Since this is not separable, we know that U and V are not independent
...
Suppose
 
 
Y1
X1

...

Y = 
...
 = AX
Yn
Then X = A−1 Y
...
So |J| = | det(A−1 )| = | det A|−1
...

| det A|

63

5

Continuous random variables

IA Probability

Example
...
Suppose we want to find
the pdf of Y = X1 + X2
...
Then X1 = Y − Z and X2 = Z
...
Then
g(y, z) = f (y − z, z)
So

Z



Z



f (y − z, z) dz =

gY (y) =
−∞

f (z, y − z) dz
...
Then
Z ∞
g(y) =
f1 (z)f2 (y − z) dz
...

What if the mapping is not? There is no simple formula for that, and we have
to work out each case individually
...
Suppose X has pdf f
...
We have
Z b
Z −a
Z b
P(|X| ∈ x(a, b)) =
f (x) +
f (x) dx =
(f (x) + f (−x)) dx
...

Example
...

Let Y = min(X1 , X2 )
...

So Y ∼ E(λ + µ)
...
In general, we define the order
statistics as follows:
Definition (Order statistics)
...
e
...
This is the order statistics
...

64

5

Continuous random variables

IA Probability

Assume the Xi are iid with cdf F and pdf f
...

So the pdf of Yn is
d
F (y)n = nf (y)F (y)n−1
...

What about the joint distribution of Y1 , Yn ?
G(y1 , yn ) = P(Y1 ≤ y1 , Yn ≤ yn )
= P(Yn ≤ yn ) − P(Y1 ≥ y1 , Yn ≤ yn )
= F (yn )n − (F (yn ) − F (y1 ))n
...

∂y1 ∂yn
We can think about this result in terms of the multinomial distribution
...

Suppose that δ is sufficiently small that all other n − 2 Xi ’s are very unlikely
to fall into [y1 , y1 + δ) and (yn − δ, yn ]
...
We want exactly one Xi to fall
into thefirst and last bins, and n − 2 Xi ’s to fall into the middle one
...

The probability of each thing falling into the middle bin is F (yn ) − F (y1 ),
and the probabilities of falling into the first and last bins are f (y1 )δ and f (yn )δ
...

We can also find the joint distribution of the order statistics, say g, since it
is just given by
g(y1 , · · · yn ) = n!f (y1 ) · · · f (yn ),
if y1 ≤ y2 ≤ · · · ≤ yn , 0 otherwise
...

In the case of iid exponential variables, we find a nice distribution for the
order statistic
...
Let X1 , · · · , Xn be iid E(λ), and Y1 , · · · , Yn be the order statistic
...


...

65

5

Continuous random variables

IA Probability

These are the distances between the occurrences
...


...


...


...
Suppose that the pdf of Z1 , · · · , Zn is, say
h
...

with all Zi independent
...
7

Moment generating functions

If X is a continuous random variable, then the analogue of the probability
generating function is the moment generating function:
Definition (Moment generating function)
...

For those θ in which m(θ) is finite, we have
Z ∞
m(θ) =
eθx f (x) dx
...

We will assume the following without proof:
Theorem
...

Definition (Moment)
...

r

Theorem
...

n

θ=0
66

5

Continuous random variables

IA Probability

Proof
...

2!

So
m(θ) = E[eθX ] = 1 + θE[X] +
Example
...
Then its mgf is
Z ∞
Z
θX
θx
−λx
E[e ] =
e λe
dx = λ
0

θ2
E[X 2 ] + · · ·
2!



e−(λ−θ)x dx =

0

λ
,
λ−θ

where 0 < θ < λ
...

E[X] = m (0) =

2
(λ − θ) θ=0
λ
0

Also,
E[X 2 ] = m00 (0) =


2

= 2
...

λ2
λ
λ

Theorem
...

Proof
...


67

6

More distributions

6

IA Probability

More distributions

6
...
The Cauchy distribution has pdf
f (x) =

1
π(1 + x2 )

for −∞ < x < ∞
...
The distribution is a bell-shaped curve
...
The mean of the Cauchy distribution is undefined, while E[X 2 ] =

...

Z
E[X] =
0



x
dx +
π(1 + x2 )

Z

0

−∞

x
dx = ∞ − ∞
π(1 + x2 )

which is undefined, but E[X 2 ] = ∞ + ∞ = ∞
...
Let Z = X + Y
...

So 12 Z has a Cauchy distribution
...

By induction, we can show that n1 (X1 + · · · + Xn ) follows Cauchy distribution
...
Of course, this is because those theorems require
the random variable to have a mean, which the Cauchy distribution lacks
...

(i) If Θ ∼ U [− π2 , π2 ], then X = tan θ has a Cauchy distribution
...


68

6

More distributions

IA Probability

X = tan θ
θ
1

(ii) If X, Y ∼ N (0, 1), then X/Y has a Cauchy distribution
...
2

Gamma distribution

Suppose X1 , · · · , Xn are iid E(λ)
...
Then the mgf of Sn
is

n
h
i





n
λ
E eθ(X1 +
...

λ−θ
We call this a gamma distribution
...
The gamma distribution Γ(n, λ) has pdf
f (x) =

λn xn−1 e−λx

...

We now show that this is indeed the sum of n iid E(λ):
Z ∞
λn xn−1 e−λx
E[eθX ] =
dx
eθx
(n − 1)!
0

n Z ∞
λ
(λ − θ)n xn−1 e−(λ−θ)
=
dx
λ−θ
(n − 1)!
0
The integral is just integrating over Γ(n, λ − θ), which gives one
...

λ−θ

6
...
Let Y1 ≤ Y2 ≤ · · · ≤ Yn be the order
statistics
...

(i − 1)!(n − i)!


n
Note that the leading term is the multinomial coefficient i−1,1,n−1

...

This is the beta distribution: Yi ∼ β(i, n − i + 1)
...
The beta distribution β(a, b) has pdf
f (x; a, b) =

Γ(a + b) a−1
x
(1 − x)b−1
Γ(a)Γ(b)

for 0 ≤ x ≤ 1
...

Its moment generating function is
m(θ) = 1 +


X

k−1
Y

k=1

r=0

a+r
a+b+r

!

θk
,
k!

which is horrendous!

6
...
The moment generating function of N (µ, σ 2 ) is


1
E[eθX ] = exp θµ + θ2 σ 2
...

E[eθX ] =

Z



eθx √

−∞

Substitute z =

x−µ
σ
...

2πσ

Then
θX

Z



1 2
1
√ eθ(µ+σz) e− 2 z dz

−∞
Z ∞
2
1 2 2
1
1
θµ+ 2 θ σ
√ e− 2 (z−θσ) dz
=e

−∞
|
{z
}

]=

pdf of N (σθ,1)

=e

θµ+ 12 θ 2 σ 2


...
Suppose X, Y are independent random variables with X ∼ N (µ1 , σ12 ),
and Y ∼ (µ2 , σ22 )
...

(ii) aX ∼ N (aµ1 , a2 σ12 )
...

(i)
E[eθ(X+Y ) ] = E[eθX ] · E[eθY ]
1

2 2

2

1

= eµ1 θ+ 2 σ1 θ · eµ2 θ+ 2 σ2 θ
1

2

2

= e(µ1 +µ2 )θ+ 2 (σ1 +σ2 )θ
which is the mgf of N (µ1 + µ2 , σ12 + σ22 )
...
Write φ(x) = √12π e−x /2 for its pdf
...
So
1 2
1 1
P(X ≥ x) ≤ √ e− 2 x
...

2

6
...
Then their joint density is
g(x1 , · · · , xn ) =
=

n
Y
i=1
n
Y
1

φ(xi )
1 2
1
√ e− 2 xi


Pn 2
1
− 21
1 xi
e
(2π)n/2
1 T
1
=
e− 2 x x ,
(2π)n/2

=

where x = (x1 , · · · , xn )T
...
Suppose we are interested in
Z = µ + AX,
where A is an invertible n × n matrix
...
Then
X = A−1 (Z − µ)
and
|J| = | det(A−1 )| =
71

1
det A

6

More distributions

IA Probability

So



1
1
1
−1
T
−1
f (z1 , · · · , zn ) =
exp − (A (z − µ)) (A (z − µ))
2
(2π)n/2 det A


1
1
T −1
exp − (z − µ) Σ (z − µ)
=
2
(2π)n/2 det A


1
1
T −1

exp − (z − µ) Σ (z − µ)
...
We say
 
Z1

...
 ∼ M V N (µ, Σ) or N (µ, Σ)
...

What is this matrix Σ? Recall that cov(Zi , Zj ) = E[(Zi − µi )(Zj − µj )]
...

In the special case where n = 1, this is a normal distribution and Σ = σ 2
...
Then Σ = diag(σ12 , · · · , σn2 )
...

2πσi
1
So Z1 , · · · , Zn are independent, with Zi ∼ N (µi , σi2 )
...
However,
this is only true when Zi are multivariate normal
...

For these random variables that involve vectors, we will need to modify our
definition of moment generating functions
...


Bivariate normal
This is the special case of the multivariate normal when n = 2
...

The bivariate normal has

 2
σ1
ρσ1 σ2
Σ=

...

σ1 σ2

−ρσ1−1 σ2−1
σ2−2



The joint pdf of the bivariate normal with zero mean is


 2
1
2ρx1 x2
x1
x22
1
p
exp −

+
f (x1 , x2 ) =
2(1 − ρ2 ) σ12
σ1 σ2
σ22
2πσ1 σ2 1 − ρ2
If the mean is non-zero, replace xi with xi − µi
...

Nice and elegant
...
Let
Sn = X1 + · · · + Xn
...

n
Theorem (Central limit theorem)
...
Define
Sn = X1 + · · · + Xn
...



Note that the final term is the pdf of a standard normal
...

σ n
To show this, we will use the continuity theorem without proof:
Theorem (Continuity theorem)
...

We now provide a sketch-proof of the central limit theorem:
Proof
...


θ2
E[Xi2 ] + · · ·
2!

1
1
= 1 + θ2 + θ3 E[Xi3 ] + · · ·
2
3!

Now consider Sn / n
...
+Xn )/


θX1 / n



n

]


θXn / n

= E[e
] · · · E[e
]

√ n
= E[eθX1 / n ]

n
1 1
1
1
= 1 + θ2 + θ3 E[X 3 ] 3/2 + · · ·
2 n 3!
n
1

→ e2θ

2

as n → ∞ since (1 + a/n)n → ea
...

So the result follows from the continuity theorem
...
Also, sometimes the moment generating function is not defined
...

The proper proof uses the characteristic function
χX (θ) = E[eiθX ]
...

Let Xi ∼ B(1, p)
...
So E[Sn ] = np and var(Sn ) = p(1 − p)
...

np(1 − p)
Example
...
Each of n passengers chooses a plane
at random
...
Suppose
each has s seats
...
e
...

> √
F (s) = P(S > s) = P  q
n/2
n · 21 · 12
Since

S − np

∼ N (0, 1),
n/2

we have


F (s) ≈ 1 − Φ

s − n/2

n/2



...
34, Φ(2
...
99,
n/2
and F (s) ≈ 0
...
So with only 74 seats as buffer between the two planes, the
probability of overbooking is just 1/100
...
An unknown proportion p of the electorate will vote Labour
...
005
...
Then
P(|p0 − p| ≤ 0
...
005n)






 |Sn − np|
0
...
005 with probability ≥ 0
...
Then we want
0
...
975) = 1
...


(we use 0
...
95 since we are doing a two-tailed test) Since the
maximum possible value of p(1 − p) is 1/4, we have
n ≥ 38416
...
Instead, we go by
P(|p0 < p| ≤ 0
...
95
...

Example (Estimating π with Buffon’s needle)
...

X

`
θ

L

Suppose we toss the pin n times, and it hits the line N times
...
Write p0 for the actual proportion observed
...

p



p(1 − p)

...
Since p = 2`/πL,
this is maximized with ` = L
...

0,
2n

If we want to estimate π to 3 decimal places, then we need
P(|ˆ
π − π| ≤ 0
...
95
...
001

2n
≥ Φ−1 (0
...
96
(π − 2)(π 2 )

So n ≥ 2
...
So we can obtain π to 3 decimal places just by throwing a
stick 20 million times! Isn’t that exciting?

77

Summary of distributions
Discrete distributions
Distribution
Bernoulli
Binomial

78

8
...
1

8

8

PMF
k

1−k

 p(1 − p)
n k
p (1 − p)n−k
k

Mean

Variance

PGF

p

p(1 − p)

q + pz

np

np(1 − p)

(q + pz)n

Geometric

(1 − p)k p

1−p
p

1−p
p2

1−p
1 − pz

Poisson

λk −λ
e
k!

λ

λ

eλ(z−1)

Continuous distributions
Distribution
Uniform
Normal
Exponential
Cauchy

Multivariate normal

CDF

Mean

Variance

MGF

1
b− a

1
(x − µ)2

exp −
2σ 2
2πσ

x−a
b−a

a+b
2

1
(b − a)2
12

eθb − eθa
θ(b − a)

/

µ

σ2

λe−λx
1
π(1 + x2 )
λn xn−1 e−λx
(n
 − 1)!

1
1

exp − (z − µ)T Σ−1 (z − µ)
2
(2π)n/2 det Σ

1 − e−λx

1
λ
undefined

1
λ2
undefined

/

n
λ

n
λ2

λ
λ−θ
undefined
n

λ
λ−θ

/

µ

Σ

/

/

1

eθµ+ 2 θ

2

σ2

IA Probability

Gamma

PDF


Title: Probability
Description: This comprises Classical probability, Axioms of probability, discrete random variables, interesting problems and many more. Probability has been explained in simple straight forward answers. Determining probability mathematically has never been this simpler when you read this note