Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Statistics Using R
with Biological Examples
Kim Seefeld, MS, M
...
*
Ernst Linder, Ph
...
University of New Hampshire, Durham, NH
Department of Mathematics & Statistics
*Also affiliated with the Dept
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
My goal is to
reach those with little or no training in higher level statistics so that they can do
more of their own data analysis, communicate more with statisticians, and
appreciate the great potential statistics has to offer as a tool to answer biological
questions
...
I hope it accomplishes this mission and
encourage its free distribution and use as a course text or supplement
...
I thank the Churchill group at the
Jackson labs to invite me to Bar Harbor while I was writing the original
manuscript of this book
...
I dedicate this work to all my students – past, present and future – both those
that I teach in the classroom and the ones I am “teaching” through my writings
...
K Seefeld, May 2007
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The goal of this book is to serve as a primer to higher level statistics for
researchers in biological fields
...
Many
of the topics we have chosen (Markov Chains, multivariate analysis) are
considered advanced level topics, typically taught only to graduate level
students in statistics
...
In
doing so we, as much as possible, eliminated using complicated equations and
mathematical language
...
We anticipate that this will inspire further
interest in statistical study as well as make the reader a more educated consumer
of the bioinformatics literature, able to understand and analyze the statistical
techniques being used
...
We (the authors) are both teachers who believe in learning by doing and feel
there would be little use in presenting statistical concepts without providing
examples using these concepts
...
It is not true, as often misperceived by
researchers, that computer programming languages (such as Java or Perl) or
office applications (such as spreadsheets or database applications) can replace a
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The majority of functionality needed to perform
sophisticated data analysis is found only in specialized statistical software
...
R has been in active, progressive development by a team of top-notch
statisticians for several years
...
What is most amazing about R is
that it completely free, making it wonderfully accessible to students and
researchers
...
One of the biggest growth areas in contributed
packages in recent years has come from bioinformatics researchers, who have
contributed packages for QTL and microarray analysis, among other
applications
...
Rather than learn multiple tools, students and
researchers can use one consistent environment for many tasks
...
The “disadvantage” of R is that there is a learning curve required to master its
use (however, this is the case with all statistical software)
...
In the beginning of the book we cover enough ground to get one up and
running with R
...
However, R is a fully
extensible system and as an open source project, users are welcome to contribute
code
...
Therefore R will
appeal to computer scientists interested in applying their skills to statistical data
analysis applications
...
The Basics of R (Ch 2 – 5)
This section presents an orientation to using R
...
Chapter 3 introduces how to work with data in R, including how to
manipulate data, how to save and import/export datasets, and how to get help
...
Chapter 5 covers basic exploratory data analysis and summary functionality and
outliners the features of R’s graphics system
...
4
Probability Theory and Modeling (Ch 6-9)
These chapters are probably the most “theoretical” in the book
...
Chapters
6-8 cover probability theory, univariate, and multivariate probability
distributions respectively
...
Chapter 9 introduces Bayesian data
analysis, which is a different theoretical perspective on probability that has vast
applications in bioinformatics
...
Chapter 11 explains some popular algorithms – the Gibbs sampler and
the Metropolis Hastings algorithm – that use Markov chains and appear
extensively in bioinformatics literature
...
Inferential Statistics (Ch 13-15)
The topics in these chapters are the topics covered in traditional introductory
statistics courses and should be familiar to most biological researchers
...
Chapter 13
covers the basics of statistical sampling theory and sampling distributions, but
added to these basics is some coverage of bootstrapping, a popular inference
technique in bioinformatics
...
Regression and ANOVA
are covered in Chapter 15 along with a brief introduction to general linear
models
...
It is hoped that this book serves as a bridge to
enable biological researchers to understand the statistical techniques used in
these packages
...
5
2
The R Environment
This chapter provides an introduction to the R environment, including an
overview of the environment, how to obtain and install R, and how to work with
packages
...
As a
project, R is part of the GNU free software project (www
...
org), an
international effort to share software on a free basis, without license restrictions
...
The development and
licensing of R are done under the philosophy that software should be free and
not proprietary
...
Mainly, that “R is free software and comes with ABSOLUTELY
NO WARRANTY
...
There is no quality control team of a software company regulating R as a
product
...
The R project started in 1995 by a group of statisticians at
University of Auckland and has continued to grow ever since
...
There are a lot of niches in
terms of R users, including: environmental statistics, econometrics, medical and
public health applications, and bioinformatics, among others
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
r-project
...
Rather than repeat its
contents here, we encourage the reader to go ahead and spend some time reading
the contents of this site to get familiar with the R project
...
The
next chapter briefly discusses this language and introduces how to work with
data objects using the S language
...
R is an interactive software application designed
specifically to perform calculations (a giant calculator of sorts), manipulate data
(including importing data from other sources, discussed in Chapter 3), and
produce graphical displays of data and results
...
It is not at all
difficulty to use, but it does take a little getting used to, and this and the three
subsequent chapters are geared mainly toward getting the user acquainted with
working in R
...
This can be done on
the Comprehensive R Archive Network, or CRAN, site, illustrated in Figure 2-1
...
7
The URL for this site is www
...
r-project
...
This site will be referred to
many times (and links to the www
...
org site directly through the R
homepage link on the left menu screen) and the user is advised to make a note of
these URLs
...
On the top of the right side of the page shown in Figure 2-1 is a section entitled
“Precompiled Binary Distributions”, this means R versions you can download
which are already compiled into a program package
...
In this sections are links to download R for various operating systems, if you
click on the Windows link for example; you get the screen depicted in Figure 22
...
The current
version of R is available for download as the file with filename ending in *
...
R is constantly being updated
and new versions are constantly released, although prior versions remain
available for download
...
8
Figure 2-3
After downloading, the program needs to be installed
...
exe” icon (* is the filename of the current version) created
and following the series of instructions presented in dialog boxes, which include
accepting the user license and whether you want documentation installed
...
Installation for Mac and Linux systems follows similar steps
...
Exploring the Environment
When you start up R the screen will look like Figure 2-4
...
There is a main application window and within
it a console window
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The Command Line
The command line is where you interact with R
...
At the command line you type in code telling R what to do
...
Try
typing some basic arithmetic tasks at the command line
...
Subsequent
lines after the first line are designated by a “+” symbol
...
10
The Menu Bar
The menu bar in R is very similar to that in most Windows based programs
...
Much of the
functionality provided by the menus is redundant with those available using
standard windows commands (CTRL+C to copy, for example) and with
commands you can enter at the command line
...
File
The file menu contains options for opening, saving, and printing R documents,
as well as the option for exiting the program (which can also be done using the
close button in the upper right hand corner of the main program window)
...
The next chapter discusses the different
save options available in some detail as well as what a workspace and a history
are in terms of R files
...
Edit
The edit menu contains the standard functionality of cut, copy and paste, and
select all
...
The “Data editor” option
allows you to access the data editor, a spreadsheet like interface for manually
editing data discussed in depth in the next chapter
...
Misc
The Misc menu contains some functionality not categorized elsewhere
...
This is your panic button should you have
this misfortune of coding R to do something where it gets stuck, such as
programming it in a loop which has no end or encountering some other
unforeseeable snag
...
Always try this
before doing something more drastic as it will often work
...
We
will discuss working with objects in the next chapter
...
11
Packages
The packages menu is very important, as it is the easiest way to load and install
packages to the R system
...
Windows
The Windows menu provides options for cascading, and tiling windows
...
Help
The Help menu directs you to various sources of help and warrants some
exploration
...
The next two options provide the FAQ (Frequently Asked Questions) HTML
documents for R and R for the operating system you are using
...
The FAQ documents provide answers to technical
questions and are worth browsing through
...
“R language (standard) pops up the
help dialog box in Figure 2-5
...
This can also be accomplished using
the help () command, as we will see in the next chapter
...
This
should be available off-line as part of the R installation
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
“Apropos” pops up a dialog box similar to the help box depicted in Figure 2-5
but that you only need to enter a partial search term to search R documents
...
One of the most difficult tasks in R is finding documentation to help you
...
However, much of the documentation
is technical rather than tutorial, and geared more toward the programmer and
developer rather than the applied user
...
The Toolbar
Below the menu bar is the toolbar, depicted in Figure 2-5
...
If you scroll over the icons
with your mouse slowly you will get rollover messages about the feature of each
icon
...
Figure 2-5
Packages
The basic R installation contains the package base and several other packages
considered essential enough to include in the main software installation
...
Installing and loading
contributed packages adds additional specialized functionality
...
You only need to install the packages once to you system, as they are saved
locally, ready to be loaded whenever you want to use them
...
Installing Packages
In order to use an R package, it must be installed on your system
...
Most packages are available from the
CRAN site as contributed packages, and can be directly downloaded in R
...
13
order to do this, select “Install package from CRAN” from the Packages menu
...
Some packages are not available directly from the CRAN site
...
To install them select the “Install package from local zip file”
option on the packages menu and R will put them in the appropriate directory
...
This
makes R more efficient and uses less overhead than if all installed packages are
loaded every time you use R, but makes the use do a little more work
...
This produces another dialog box very similar to Figure 2-7,
only this time the list of packages includes only those packages which are
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Select the package to load and you should be all set to use the
features of that package during the current work session
...
Even if you open previous
work in R you will still need to reload packages you are using the features of
...
S-Plus
The other major implementation of the S language, which is a popular statistical
application, is a commercial package called S-Plus
...
insightful
...
There is a demo version of S-Plus available for
evaluation purposes
...
Most code written with R can be used with S-Plus and vice versa
...
S-Plus is a GUI based application
environment that has many features that allow for data analysis to be more menu
and pop-up dialog box assisted, requiring less coding by the user
...
R and Other Technologies
Although not a topic that will be covered in this book, it is of interest to note that
R is not an isolated technology, and a significant part of the R project involves
implementing methods of using R in conjunction with other technologies
...
For example, the
package ROracle provides functionality to interface R with Oracle databases,
and package XML contains tools for parsing XML and related files
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
These
include the ability to create and work with data objects, controlling the
workspace, importing and saving files, and how and where to get additional
help
...
R is both an
implementation of the S language that can be considered a language on its own,
and a software system
...
The S language (and R language, if you consider them distinct languages, which
is a debatable issue) was specifically designed for statistical programming
...
There is no
need to know anything about object-oriented programming, other than the
general idea of working with objects, in order to be an effective applied user of
R
...
16
The abstract concepts used in object-oriented languages can be confusing
...
Everything is
an object in R and using R is all about creating and manipulating objects
...
Data
objects are variable-named objects that you create to hold data in specified
forms, which are described in detail in this chapter
...
We
immediately start working with function objects in this chapter, but the next
chapter covers them in more depth, including some basics of writing your own
functions
...
In
essence, R is all about creating data objects and manipulating these data objects
using function objects
...
The form of
the data object you create depends on your data analysis needs, but R has a set
of standard data objects for your use
...
The different types of data objects
handle different modes of data (character, numeric, and logical T/F) and format
it differently
...
The second part of this
section deals with some general tasks of working with data objects that are of
general use
...
First they have a type (scalar,
vector, etc)
...
Finally, they generally are assigned a variable name (that you supply, using the
assignment operator)
...
But you should always
provide a descriptive variable name of a data object you are going to perform
further manipulations on
...
Types of Data Objects in R
Scalars
The simplest type of object is a scalar
...
To
create a scalar data object, simply assign a value to a variable using the
assignment operator “<-”
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
You can
manipulate scalar objects in R and perform all sorts of algebraic calculations
...
Logical data can be
entered simply as T or F (no quotes)
...
> single<-'singleQuote'
> double<-"doubleQuote"
> single
[1] "singleQuote"
> double
[1] "doubleQuote"
#You will get an error if you enter character data with no quotes at all
> tryThis<-HAHA
Error: Object "HAHA" not found
The function “mode (variable name)” will tell you the mode of a variable
...
Vectors are the data objects
probably most used in R and in this book are used literally everywhere
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Data values in a vector are all
the same mode, but a vector can hold data of any mode
...
This is a good way to enter data easily as you can past in unformatted
data values from other documents
...
6:
Read 5 items
Another way to make a vector is to make it out of other vectors:
> v1<-c(1,2,3)
> v2<-c(4,5,6)
You can perform all kinds of operations on vectors, a very powerful and useful
feature of R, which will be used throughout this book
...
You will get a warning message, but it does let you perform the
requested operation:
> x1<-c(1,2,3)
> x2<-c(3,4)
> x3<-x1+x2
Warning message:
longer object length
is not a multiple of shorter object length in: x1 + x2
> x3
[1] 4 6 6
You can also create a vector by joining existing vectors with the c () function:
> q<-c(v1,v2)
> q
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In most cases character data is
used to describe the other data, and is not used in calculations
...
To store character data as
qualitative variables, a factor data type is used
...
You may create a factor by first creating a character vector, and then converting
it to a factor type using the factor () function:
> settings<-c("High","Medium","Low")
> settings<-factor(settings)
Notice that this creates “levels” based on the factor values (these are the values
of categorical variables)
...
In mathematics
matrices have many applications, and a good course in linear algebra is required
to fully appreciate the usefulness of matrices
...
Formatting data as matrices and arrays provides an
efficient data structure to perform calculations in a computationally fast and
efficient manner
...
Let’s declare a simple 2 by 2 matrix
...
20
This book makes no assumption of knowledge of matrix mathematics, and when
matrices and arrays are used in applied examples, appropriate background
information will be provided
...
Specially structured matrices can also be easily created
...
)
Lists
Lists are the “everything” data objects
...
Lists are useful for organizing information
...
Note
that list values are indexed with double bracket sets such as [[1]] rather than
single bracket sets used by other data objects
...
21
Data Frames
Data frames are versatile data objects you can use in R
...
Each column of the data
frame is a vector
...
However, different vectors can be of different modes
...
We will use data frames frequently in
this book
...
ornl
...
:
> organism<-c("Human","Mouse","Fruit Fly", "Roundworm","Yeast")
> genomeSizeBP<-c(3000000000,3000000000,135600000,97000000,12100000)
> estGeneCount<-c(30000,30000,13061,19099,6034)
Now, with three vectors of equal length we can join these in a data frame using
the function data
...
Note that the format here is “column name”=”vector to add” and the
equals (not assignment) operator is used
...
Here, the variable names are used as column names, but
you could rename the columns with names other than the variable names if you
like
...
frame(organism=organism,genomeSizeBP=genomeSizeBP,
+ estGeneCount=estGeneCount)
> comparativeGenomeSize
organism genomeSizeBP estGeneCount
1
Human
3
...
000e+09
30000
3 Fruit Fly
1
...
700e+07
19099
5
Yeast
1
...
This section discusses some common tasks to access and modify existing
data objects
...
Working with Vectors
In order to be able to work with a specific element in a data object, first you
need to be able to identify that specific element
...
Every element in a vector is assigned an index value in the order in which
elements were entered
...
To address a specific
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
> sort(z)
[1]
1
1
2
2
51
[16] 123 312 512 2341
> order(z)
[1] 1 10 2 7 3 19 4
3
5 18
3
4
9 11 13
12
6 15
21
23
23
23
8 17 16 14 12
You may want to extract only certain data values from a vector
...
One is that you can directly identify
specific elements and assign them to a new variable
...
Illustrating both of these:
> #extracting specific elements
> z3<-z[c(2,3)]
> z3
[1] 2 7
> #logical extraction, note syntax
> z100<-z[z>100]
> z100
[1] 2341 512 312 123
Sometimes, if you are coding a loop for example, you may need to know the
exact length of your vector
...
23
31
32
Working with Data Frames
Because a data frame is a set of vectors, you can use vector tricks mentioned
above to work with specific elements of each vector within the data frame
...
>
>
>
>
x<-c(1,3,2,1)
y<-c(2,3,4,1)
xy<-data
...
Functions
rbind, for rows, and cbind, for columns, easily perform these tasks
...
>
>
>
>
1
2
3
4
>
>
>
>
1
2
3
4
5
#create and bind column z to
z<-c(2,1,4,7)
xyz<-cbind(xy,z)
xyz
x y z
1 2 2
3 3 1
2 4 4
1 1 7
#create and bind new row w
w<-c(3,4,7)
xyz<-rbind(xyz,w)
xyz
x y z
1 2 2
3 3 1
2 4 4
1 1 7
3 4 7
There are many ways to work with data in data frames; only the basics have
been touched on here
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
There is a set of “is
...
For
example:
> x<-c(1,2,3,4)
> #checking data object type
> is
...
data
...
character(x)
[1] FALSE
> is
...
To do this R
provides a series of “as
...
Don’t forget to assign the new data object
to a variable (either overwriting the existing variable or creating a new one)
because otherwise the data object conversion will only be transient
...
matrix(x)
> y
[,1]
[1,]
1
[2,]
2
[3,]
3
[4,]
4
You can also use the “as
...
For example, you may want to change a numerical vector to a character mode
vector
...
character(x)
> z
[1] "1" "2" "3" "4"
R is smart enough to try catching you if you try to do an illogical conversion,
such as convert character data to numeric mode
...
> words<-c("Hello", "Hi")
> words
[1] "Hello" "Hi"
> as
...
25
Missing Data
Anyone working with empirical data sooner or later deals with a data set that has
missing values
...
You
should encode missing data in R as NA and convert any data imports with
missing data in other forms to NA as well, assuming you are not using a
numerical convention (such as entering 0’s)
...
> missingData2<-missingData*2
> missingData2
[1] 2 6 2 NA 4 2
If you have a computation problem with an element of a data object and are not
sure whether that is a missing value, the function is
...
> is
...
na(missingData[4])
[1] TRUE
Controlling the Workspace
This section describes some basic housekeeping tasks of listing, deleting, and
editing existing objects
...
Listing and Deleting Objects in Memory
When working in R and using many data objects, you may lose track of the
names of the objects you have already created
...
> ls()
[1] "q" "v1" "v2" "x1" "x2" "x3" "z"
> objects()
[1] "q" "v1" "v2" "x1" "x2" "x3" "z"
Sometimes you will want to remove specific objects from the workspace
...
> rm(q)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This
can be particularly helpful to edit imported files easily to correct entries or if you
have multiple data entries to edit beyond just simple editing of a particular entry
...
Figure 3-1
To use the data editor, use the data
...
entry(x)
All changes made using the data editor are automatically saved when you close
the data editor
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Save to file saves everything, savehistory saves commands and
objects and save image just saves the objects in the workspace
...
Save to file
Save to file is an option available under the file menu
...
This option produces a save as dialog box, making
saving the file in a specific directory easy
...
This method of saving is most familiar
and simplest, but sometimes you may not want to save everything, particularly
when you have large amounts of output and only want to save commands or
objects
...
Rhistory file using
the savehistory function
...
> x<-c(1,2,3,4,5)
> x
[1] 1 2 3 4 5
> savehistory(file="shortSession
...
Rhistory file in the main R directory (C:\Program Files\R\*
where * is the current version of R) unless otherwise specified
...
Figure 3-2
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The option to save the workspace
can be performed at any time using the save
...
> x<-c(1,2,3,4,5)
> x
[1] 1 2 3 4 5
> save
...
It defaults to having no specific name (and
will be overwritten the next time you save a workspace with no specific name),
but you can do save
...
Note that R will automatically restore the latest saved workspace when you
restart R with the following message
[Previously saved workspace restored]
To intentionally load a previously saved workspace use the load command (also
available under the file menu as “Load Workspace”)
...
RData")
> x
[1] 1 2 3 4 5
Importing Files
We have seen that entering data can be done from within R by using the scan
function, by directly entering data when creating a data object, and by using the
data editor
...
It is of note here that there is a manual available for free on the R site and on the
R help menu (if manuals were installed as part of the installation) called “R Data
Import/Export” which covers in detail the functionality R has to import and
export data
...
This manual covers
importing data from spreadsheets, data, and networks
...
Do this by going under the File menu and choosing the “Change dir”
option”, which produces the dialog box illustrated in Figure 3-3
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
*()
The most convenient form to import data into R is to use the read functions,
notably read
...
This function will read in a flat file data file, created in
ASCII text format
...
txt)
...
Figure 3-4 gives an example of what this format should look like in Notepad:
Figure 3-4
Using read
...
table("sizeTime
...
min
...
Notably read
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
txt) or comma separated data files
(extension *
...
Data from most spreadsheet programs can be saved in one or
both of these file types
...
Package foreign
Also of note is an R package called foreign
...
Package foreign is
available for download and installation from the CRAN site
...
Sometimes for trouble imports with formatting that R cannot read, using scan ()
or the data editor to enter your data may be a simple and easy solution
...
In particular saving spreadsheet
data in a text (comma or tab delineated) format is simple and useful
...
Getting Help
Virtually everything in R has some type of accessible help documentation
...
This section
gives some suggestions for where to look for help
...
This includes on-line documentation for the R
base packages, as well as on-line documentation for any loaded packages
...
For example to get help on function sum:
> help(sum)
This produces a help file as depicted in Figure 3-5
...
31
Figure 3-5
As an alternative to calling the help function you can just put a question mark in
front of the topic of interest:
> ?sum
If you don’t know the exact name of what you’re looking for, the apropos()
function can help
...
The result of calling apropos is that
you will get a list of everything that matches that clue, as in Figure 3-6 for
apropos(“su”), and you can then do a help search on your term of interest
...
32
Figure 3-6
Note that on-line help is especially useful for describing function parameters,
which this book will not discuss in great detail because many functions in R
have lots of optional parameters (read
...
The reader should be comfortable with looking up
functions using help to get detailed parameter information to use R functions to
suit their needs
...
The R system itself
comes with several documents that are installed as part of the system
(recommended option)
...
In addition to manuals that cover R in general, most packages have their own
documents
...
These documents also generally
list the name and contact information for the author(s) of the package
...
r-project
...
33
package name
...
Books
There are several books written that are about or utilize R
...
These are included
in the resource section at the end of the book
...
These forums serve as “technical
support” since R is open source and has no formal support system
...
A
guide to such lists is found at www
...
org as depicted in Figure 3-7
...
Figure 3-7
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This should empower the reader with a minimal set of programming
skills required to effectively use R
...
Variables and Assignment
The foundation of programming is using variables and assigning values to those
variables
...
Most of what
we did in the previous chapter involved creating data objects and assigning the
values and type of data object to a variable
...
Syntax
R is not incredibly fussy as far as syntax goes in comparison to other high-level
languages
...
As far as punctuation is
concerned, semicolons are required to separate statements if they are typed on a
single line, but are not required if statements are written on separate lines
...
35
> y <- 7
In R there is a great deal of room for personal style in terms of how you write
your code
...
This is
useful for making code look nice and easy to analyze
...
Another important thing to note is that R is always case sensitive
...
The
word MyVariable is not the same as myvariable, and the two would be treated
separately if used to hold variables as illustrated below:
> MyVariable<-5
> MyVariable
[1] 5
> myvariable
Error: Object "myvariable" not found
If you get the error object not found message (as above) and you are sure you
created a variable of that name, check the case (or do an ls() to check all current
objects in the workspace)
...
The usual algebraic rules
of precedence apply (multiplication and division take precedence over addition
and subtraction)
...
Do
not use bracket sets “[ ]” or “{}” as these are for other purposes in R
...
Some trivial examples are listed below:
> 2+4
[1] 6
> y<-0
> x<-4
> x*y^2
[1] 0
> x^4
[1] 256
> z<-5
> (x+y)*z/x
[1] 5
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Table 4-3 lists the commonly used logical and
relational operators
...
Essentially flow control
of a program can be thought of as being in three layers – order (sequence of
code written), selection (use of logical and relational operators), and repetition
(or looping)
...
Table 4-3: Logical and Relational Operators
Operator
Functionality
&
And
|
Or
!
Not
==
Equal to
!=
Not equal to
<
Less than
>
Greater than
<=
Less than or equal to
>=
Greater than or equal
to
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The if statement can be used alone or with the else
statement
...
When the if statement is used alone, and the condition in
parenthesis is not true then nothing happens
...
Note that there is not a condition
in parenthesis after the else statement
...
if (condition is true)
then do this
else
do this
For example:
> q<-3
> t<-5
> # if else conditional statement written on one line
> if(q
[1] 8
Note the use of {} brackets around some of the code
...
This code can also be written:
>
+
+
+
if(q
}else
w<-q-t
This separates the code onto different lines, which is unnecessary for this simple
case but with longer code it becomes unwieldy to write all the code on one line
...
38
The logical operators, &, |, and ! can be used to add additional selection criteria
to a selection statement
...
In R two common looping
expressions are while and for
...
39
while (condition controlling flow is true)
perform task
For example suppose we want to execute adding 1 to the variable x while x<=5:
> x<-0
> while(x<=5){x<-x+1}
> x
[1] 6
Note that x=6 at the end because the loop is still true at the beginning of the loop
when x=5
...
A
counter variable (usually designated by a lowercase letter ”i”) is used to count
how many times the loop is executed:
for (i in start:finish)
execute task
As an example, if you want to initialize a vector with values you can loop
through it to assign the values:
> y<-vector(mode="numeric")
> for(i in 1:10){
+ y[i]<-i}
> y
[1] 1 2 3 4 5 6 7 8 9 10
For loops can also be nested (one inside the other) to initialize matrices or
perform similar tasks:
> z<-matrix(nrow=2,ncol=4)
> for(i in 1:2){
+ for(j in 1:4) z[i,j]<-i+j}
> z
[1,]
[2,]
[,1] [,2] [,3] [,4]
2
3
4
5
3
4
5
6
Subsetting with Logical Operators
Although they are handy for doing simple repetitive tasks, for loops are not used
as often in R as they are in other languages and are not recommended because
they tend to be memory intensive, which can cause problems
...
Fortunately R
provides powerful alternatives to looping in the form of subsetting with logical
operators
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
For example:
> x <- 7 ; y <- 3
> x > y
[1] TRUE
> x < y
[1] FALSE
Logical operators applied to vectors, matrices etc
...
For example:
> x
> y
> x
[1]
> x
[1]
> x
[1]
<- 1:6
<- rep(4,6)
> y
FALSE FALSE FALSE FALSE TRUE TRUE
<= y
TRUE TRUE TRUE TRUE FALSE FALSE
== y
FALSE FALSE FALSE TRUE FALSE FALSE
If we use the outcomes of a logical vector statement for subsetting a vector, only
the elements where the outcomes are equals TRUE will be selected
...
Note that the result is a vector of elements
...
Functions are
called by another line of code that sends a request to the function to do
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The function call may or may not pass
arguments to the function
...
An
alternative to this is to user an underline “functionName_function()” which you
will see as well, although this may be deprecated in the most recent version
...
The word function is
always explicitly used to define the variable as a function object
...
You should write functions when you have a repetitive task to do
...
Throughout this book functions are used extensively
...
Additional packages are
filled with additional functions, usually with related functions for related tasks
packaged together
...
Some selected mathematical functions for common
operations are listed in Table 4-2
...
Table 4-2
Function
Operation Performed
sqrt(x)
Square root of x
abs(x)
Absolute value
sin(x), tan(x), cos(x)
Trigonometric functions
exp(x)
Exponential
log(x)
Natural logarithm
log10(x)
Base 10 logarithm
ceiling(x)
Closest integer not less than x
floor(x)
Closest integer not greater x
round(x)
Closest integer to the element
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Chances are, if it’s a standard procedure, the
functionality has already been written
...
If you do write novel
functionality that may be of use to others, consider contributing it to the R
community as a small package
...
Let’s return to the data frame
from the last chapter with data on gene counts and genome sizes for selected
organisms:
comparativeGenomeSize
organism genomeSizeBP estGeneCount
1
Human
3
...
000e+09
30000
3 Fruit Fly
1
...
700e+07
19099
5
Yeast
1
...
To do
this, let’s write function that takes as arguments the genome size and estimated
gene count information, and calculates from this the estimated bp per gene:
> #function geneDensity simply calculates bp per gene
> geneDensity<-function(bp,genes){
+ bp/genes}
> #pass data frame columns to function geneDensity
> #storing results in variable bpPerGene
> bpPerGene<-geneDensity(comparativeGenomeSize$genomeSizeBP,
+ comparativeGenomeSize$estGeneCount)
> #result of function computation
> bpPerGene
[1] 100000
...
000 10382
...
800
2005
...
43
To round the numbers to the nearest integer we can call round, one of the simple
math functions from Table 4-2:
> bpPerGene<-round(bpPerGene)
> bpPerGene
[1] 100000 100000 10382
5079
2005
Next, to create a new data frame with all the information we can use techniques
of data frame manipulation from the previous chapter:
> comparativeGenomes<-cbind(comparativeGenomeSize,bpPerGene)
> comparativeGenomes
organism genomeSizeBP estGeneCount bpPerGene
1
Human
3
...
000e+09
30000
100000
3 Fruit Fly
1
...
700e+07
19099
5079
5
Yeast
1
...
Package Creation
One of the most important features of R is that it allows the user to create
packages and contribute these to the R community for others to use
...
However,
the document “Writing R Extensions” covers the process of creating a package
(which is not technically difficult)
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This is an overview chapter that presents
basics for topics that will be built on throughout this book
...
Much of this material should be familiar to anyone who studied introductory
statistics
...
Notice that usually, the function name is what you would expect to be in most
cases, as R is designed to be rather intuitive
...
For functions not listed in table 5-1, try help and the expected name or using the
apropos() function with part of the expected name to find the exact function call
...
Most of the functions in table 5-1 do
not have additional parameters, and will work for a data vector, matrix or data
frame
...
45
Table 5-1: Some Data Summary Functions
Function name
Task performed
sum(x)
Sums the elements in x
prod(x)
Product of the elements in x
max(x)
Maximum element in x
min(x)
Minimum element in x
range(x)
Range (min to max) of elements in x
length(x)
Number of elements in x
mean(x)
Mean (average value) of elements in x
...
> x<-c(0
...
2,0
...
12,0
...
12,0
...
13,0
...
12,0
...
19)
> sum(x)
[1] 2
...
360122e-09
> max(x)
[1] 0
...
12
> range(x)
[1] 0
...
50
> length(x)
[1] 12
> mean(x)
[1] 0
...
195
> var(x)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
01313333
> sd(x)
[1] 0
...
For example, the standard deviation (supposing we didn’t know it
had its own function) is the square root of the variance and can be calculated as:
> sd<-var(x)^0
...
1146008
The summary() Function
Suppose we have an enzyme that breaks up protein chains, which we’ll call
ChopAse
...
We do some assays on the same 200 amino acid protein chain and get the
following results for digestion time based on variety:
> ChopAse
varietyA timeA varietyB timeB varietyC timeC
1
a 0
...
12
c 0
...
13
b 0
...
12
3
a 0
...
13
c 0
...
12
b 0
...
13
5
a 0
...
13
c 0
...
12
b 0
...
13
Based on this data and the way it is presented it is difficult to determine any
useful information about the three varieties of Chopase and how they differ in
activity
...
The summary function simultaneously calls many of the descriptive functions
listed in Table 5-1, and can be very useful when working with large datasets in
data frames to present quickly some basic descriptive statistics, as in the
ChopAse example:
> summary(ChopAse)
varietyA
timeA
a:6
Min
...
120
1st Qu
...
120
Median :0
...
125
3rd Qu
...
130
Max
...
130
varietyB
timeB
b:6
Min
...
1200
1st Qu
...
1300
Median :0
...
1333
3rd Qu
...
1375
Max
...
1500
varietyC
timeC
c:6
Min
...
1100
1st Qu
...
1200
Median :0
...
1233
3rd Qu
...
1300
Max
...
1300
This gives some quick quantitative information about the dataset without having
to break up the data frame or do multiple function calls
...
This may
be statistically significant, and this observation can be utilized for statistical
testing of differences (such as those covered in Chapter 13)
...
47
Working with Graphics
Of course, it is often most interesting to present data being explored in a
graphical format
...
This
section focuses on understanding the graphics environment in R and how to
control features of the graphics environment
...
Graphics Technology in R
At startup, R initiates a graphics device drive
...
To open this window without
calling graphics, use the function call windows() (for Windows operating
systems, it is x11() on Unix, and macintosh() on Mac O/S)
...
Let’s look at each of these types of plotting commands
...
The
simplest high level plotting function is plot(), as illustrated below
> x<-c(1,2,3,4,5,6,7,8)
> y<-c(1,2,3,4,5,6,7,8)
> plot(x,y)
This produces the simple graph in Figure 5-1
...
Plot is a generic function and
has a diverse set of optional arguments
...
Under the usage section is the list
of the function and most of the parameters, which are described in more detail in
the arguments section of help below
...
These parameter arguments are pretty standard for most of the high
level plotting functions in R
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
It also
changes the type of plot from the default of points, to both points and lines
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
ts(x)
Plot of x with respect to time (index values of the
vector unless specified)
contour(x,y,z)
Contour plot of vectors x and y, z must be a matrix of
dimension rows=x and columns=y
image(x,y,z)
Same as contour plot but uses colors instead of lines
persp(x,y,z)
3-d contour plot
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Low-Level Plotting Functions
Low-level plotting functions add additional information to an existing plot, such
as adding lines to an existing graph
...
For
example, adding titles can be done as arguments of a high-level function
(main=””, etc) or as a low-level plotting function (title(main=””), etc)
...
We could add to this plot in several
ways using low-level plotting functions
...
> text(4,6,label="Slope=1")
> title("X vs Y")
> lines(x,y)
This embellishes the plot in Figure 5-1 to become the plot in Figure 5-4
...
Note that most of these
work not just with plot () but with the other high-level plotting functions as well
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
title(“”)
Adds a main title to the plot; also can add
additional arguments to add subtitles
rug(x)
Draws the data on the x-axis with small
vertical lines
rect(x0,y0,x1,y1)
Draws a rectangle with specified limits
(note –good for pointing out a certain
region of the plot)
legend(x,y,legend=,…)
Adds a legend at coordinate x,y; see
help(legend) for further details
axis()
Adds additional axis to the current plot
Graphical Parameter Functions
Graphical parameter functions can be categorized into two types: those that
control the graphics window and those that fine-tune the appearance of graphics
with colors, text, fonts, etc
...
par() is a very important graphics function, and it is well worth the time to read
the help document for par, pictured in Figure 5-5
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
You can do this with either the mfrow or mfcol parameters of the par function
...
Mfrow draws the plots in row order (row 1 column 1,
row 1 column 2, etc) whereas mfcol draws plots in column order (row 1 column
1, row 2 column 1)
...
Table 5-4 lists some of the basic graphical parameters
...
Let’s look at an example that utilizes some of par’s functionality using data from
NCBI on numbers of base pairs and sequences by year
...
53
Table 5-4: Selected Graphical Parameters
Parameter
Specification
bg
Specifies (graphics window) background color
col
Controls the color of symbols, axis, title, etc
(col
...
lab, col
...
axis,cex
...
axis=0
...
lab=1)
> #Plot base pair data by year
> plot(NCBIdata$Year,MBP,xlab="Year",ylab="BP in Millions",
+ main="Base Pairs by Year")
> #Add line to plot, color blue
> lines(NCBIdata$Year,MBP,col="Blue")
>
>
+
>
#Similarily, plot sequence data and line
plot(NCBIdata$Year,ThousSeq,xlab="Year",ylab="Seq
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
>
>
>
+
>
+
#Code for Figure 5-7
par(mfcol=c(2,1),cex
...
6,cex
...
8)
barplot(NCBIdata$BasePairs,names
...
arg=NCBIdata$Year,
col=grey,xlab="Years",ylab="Sequences",main="Sequences by Year")
Figure 5-7
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Most of
the graphical examples in this book – and there are many of them - will use the
simplest plotting code possible to illustrate examples, since our focus is on
understanding techniques of data analysis
...
R allows for the user to code virtually
every detail of a graph
...
With a little practice, you can code R to produce custom, publication quality
graphics to effectively illustrate almost any data analysis result
...
On the File menu there are many options for saving a
graphic, including different graphical formats (png, bmp, jpg) and other formats
(metafile, postscript,pdf)
...
Figure 5-8
Another option to save a graphic is to simply right mouse click on the graphic,
which will produce a pop up menu with options to copy or save the graphic in
various formats as well as to directly print the graphic
...
56
environment the copy options are as a bitmap or metafile, and the save options
are as metafile or postscript
...
This section presents some
selected graphics packages that may be of interest, all of which should be
available from the CRAN site
...
MIM
is a software package, which is useful for creating graphical models, including
directed and chain graphs
...
We will see in coming
chapters that these types of models, which are popular among computer
scientists, are useful in advanced statistical modeling, such as Bayesian statistics
and Markov Chain Monte Carlo modeling
...
It is also relatively simple for the
beginner to use and contains one function scatterplot3d() with many flexible
parameter options which create many different 3d plots, such as the demo plot in
Figure 5-9
...
57
Figure 5-9
grid
Grid is a sophisticated and very useful R package, which provides a “rewrite of
the graphics layout capabilities” (from package description on CRAN site)
...
Some added functionality available with this package includes:
allowing interactive editing of graphs, improving axis labeling capabilities, and
primitive drawing capabilities
...
This type of graphics is very
useful for applications in multivariate statistics as they allow for presenting
many graphs together using common x- and y-axis scales which is a visually
effective way for doing comparisons between groups or subgroups
...
Figure 5-10
presents one of the demonstrations of a type of plot available in the lattice
package
...
58
Figure 5-10
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Probability is the very guide of life
Everything existing in the universe is the fruit of chance
...
Probability focuses on the description
and quantification of chance or random events
...
This chapter introduces some fundamentals of probability theory, beginning
with an overview of the two schools of thought concerning how to think about
probability
...
The contents of this chapter
serve as an important foundation for subsequent chapters
...
However it is important to note
early on that there are two major approaches to understanding probability and
statistics
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In this view the probability of an event is
defined as the limiting proportion of times the event would occur given many
repetitions
...
Application of the frequentist approach is limited to scenarios where frequent
repetitions of the same random experiment are possible
...
The second way of thinking about probability allows the incorporation of an
investigator’s intuitive reasoning
...
These
so-called prior probabilities are then updated in a rational way after data are
collected
...
Bayes never published his work in his lifetime, but other mathematicians
followed up on his work, resulting in the publication and promotion of his ideas
...
Until
recently Bayesian statistics was largely theoretical and mathematically complex
...
Many of these algorithms and models have applications in bioinformatics, and
some of these will be introduced in this book
...
Understanding both, classical (or frequentist) and Bayesian statistics requires
knowledge of the theory and methods of probability and probability models
...
This chapter reviews the essentials of probability and serves as a conceptual
foundation for much of the material in the subsequent chapters of this section
and for various other chapters in this book
...
Chapter 9
specifically introduces Bayesian theory and approaches to modeling
...
61
a text on probability such as Sheldon Ross’s “A First Course in Probability” or
other recommended references listed in the appendix
...
A keyword here is random
...
Randomness is an essential concept in most of
statistics
...
To analyze an experiment in terms of probability, first the set of outcomes of a
random experiment must be defined
...
This is best explained by an example
...
The set of all possible
outcomes is defined by the set of numbers that represent the number of dots on
the face of the die that turns up as a result of the die roll
...
This set, containing all possible
outcomes of the experiment, is known as the sample space
...
It is to events that probabilities are
assigned
...
An event
may be a single outcome, or a subset containing multiple outcomes
...
Because in this
example all possible outcomes are equally likely and there are six possibilities,
the probability for this event is 1/6
...
In this
case the probability of the event is ½
...
These basic
operations provide a logical framework for understanding how events relate in
the sample space
...
Set theory makes use of Venn diagrams, as depicted in Figure 6-1
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In casual wording the union can be referred
to as “A or B occurs” or “at least one of the events A and B occurs”
...
The union of A and B is
depicted in Figure 6-2
...
In casual wording
an intersection can be referred to as “both events A and B occur”
...
Later in chapter 8
we will discuss joint probabilities and joint distributions where the concept of
intersection plays a key role
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The complement of A consists of all
outcomes in the sample space that are not in A, as illustrated in Figure 6-4
...
A
AC
Figure 6-4: Venn Diagram of the Complement of an Event
Disjoint Events
Two events, A and B, in the same sample space, are disjoint if their intersection
contains no elements
...
To put the definition in
plain English, disjoint events are events that have nothing in common with each
other
...
As a cautionary note, although the
terms disjoint and mutually exclusive mean the same thing, these terms should
not be interchanged with the term independence, which has a completely
different meaning, to be discussed in chapter 8
...
64
A
B
Α∩Β=∅
Figure 6-5: Intersection of Disjoint Events is the Null Set
The Fundamental Rules of Probability
Some simple rules (axioms) are used in probability to guarantee a consistent
notion of how probability represents the chance of events related to random
experiments
...
Rule 1: Probability is always positive
Probability is always positive and the concept of negative probability does not
exist
...
For an event to have probability equal to zero, it means the event is impossible
...
Rule 2: For a given sample space, the sum of
probabilities is 1
For a simple example, consider the sample space of the experiment of picking a
nucleotide at random from the four possible nucleotides
...
Thus, the
sum of all probabilities for this sample space is one
...
65
For each event in a sample space where the outcomes have equal probability of
occurring, the probability of an event equals the frequency of the event over the
total events in the sample space
...
But the sum of events in the sample space is still 1; the probabilities of
individual nucleotides are changed to reflect the composition of events in the
sample space
...
Suppose in the 4-nucleotide example that only P (A) is known and we want to
calculate the probability of any other nucleotide (which is the event of the
complement of A)
...
25, then P (Ac) =1-0
...
75
...
G
C
T
A
Ac=1-P(A)
Figure 6-8:The Complement of A is everything in the Sample
Space Except the Event A
Rule 3: For disjoint (mutually exclusive) events, P
(A ∪ B)=P (A) + P (B)
In the case of disjoint events, there is no intersection of events, so the probability
of their unions is simply the sum of their probabilities
...
However, many sample spaces consist of many possible
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In this case, if
the outcomes are disjoint events, then the probability of all events can be
represented by
P (E1)
n)
∪ P( E 2) ∪ P( E3)
...
…+P (E
Or, in shorthand notation
n n
P Υ Ei = ∑ P( Ei )
i =1 i =1
For events that are not disjoint, the calculation of P(A U B) must take into
account the area of intersection for the events, and is calculated as P (A U B)=P
(A)+P (B)-P (A∩B), subtracting the area of intersection (which otherwise would
be double counted)
...
Counting
It is also helpful to review here the methods of counting, and how R can be used
as a tool for helping out with count calculations
...
The mathematical theory of
counting is called combinatorial analysis
...
For example, suppose the number of alleles possible for gene A is 3
...
And
suppose the number of alleles for gene B is 5 (outcomes of experiment 2)
...
The fundamental principle of counting can be applied to multiple experiments
by extension of the two-experiment scenario
...
67
n1, n2, n3…nk, then together the k experiments have a total of n1*n2*n3*…*nk
possible outcomes
...
Suppose for protein 1 there are 4 genetic alleles, for protein 2
there are 2 genetic alleles, for protein 3 there are 9 genetic alleles, for protein 4
there are 11 alleles, and for protein 5 there are 6 alleles
...
Permutations (Order is Important)
When picking distinct objects and arranging them there are two scenarios to
consider: order important and order unimportant
...
We can pick three letters and count the number of unique
ways we can arrange them (order important)
...
There are a
total of 3!=3*2*1=6 possible permutations of arranging three distinct letters in
groups of 3
...
Mathematically the formula for a permutation or an arrangement of r
out of n distinct objects (order is important) is:
Pn , r =
n!
( n − r )!
Combinations (Order is Not Important)
What about the case where order is not important? For example, in the case of
the letters a, b, c what if all we want to know is how many ways we can select 2
out of 3 letters, regardless of order? In this case the answer is 3 because it
doesn’t matter whether the order of the letters is ab or ba or any of the other
combinations of two letters
...
Mathematically the formula for a combination of
selecting r out of n distinct objects (order unimportant) is:
Cn , r =
n!
r! ( n − r )!
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Counting in R
In R, combinations can be calculated using the choose () function
...
Choose (n, r),
with parameter n and r, calculates Cn,r
...
This makes factorial calculations in R not so simple as coding
“n!”
...
The gamma function is a mathematical relationship
defined by the following formula:
∞
Γ( x ) = ∫0 t x −1e − t dt (x>0)
The Greek letter capital gamma is used in Γ(x )
...
However, the gamma function has a nice identity which can be evaluated for any
a positive number n:
Γ( n ) = ( n − 1)!
In other words, gamma of n is equal to (n-1) factorial
...
For
example, to calculate 4! which is equal to 4*3*2*1, you can use gamma (5) in R,
as below:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
For example, to calculate the numbers of
unique 8-mer peptide arrangements taken from the 20 amino acids (order is
important here so it is a permutation), simply use the formula translating the
permutation into an equation of gamma functions as follows:
Pn , r =
n!
20!
20! Γ( 21)
=
=
=
( n − r )! ( 20 − 8)! 12! Γ(13)
In R this can simply be calculated as:
> gamma(21)/gamma(13)
[1] 5079110400
Thus there are 5,079,110,400 possible permutations of 8 unique amino acids
...
The gamma function is involved in several continuous
probability distributions (discussed later in this chapter) including the gamma
distribution, the beta distribution and the Chi-Square distribution
...
Modeling Probability
Statistics in its mathematical formulation makes extensive use of models to
represent relationships between variables
...
Perhaps the most familiar simple mathematical model is the linear
relationship between a independent variable x and the dependent variable, y
...
Another way
to represent this is to consider y as a function of x, written f(x) so that:
f(x)=mx + b
The outcomes of an experiment are also modeled mathematically in a
probability model
...
70
the distribution of data that are the outcomes of an experiment
...
The remainder of this chapter is devoted to several concepts
important in understanding probability models: random variables, discrete
versus continuous variables, probability distributions and densities, parameters,
and univariate versus multivariate statistics
...
To model
these outcomes using mathematical functions, we use variables called “random
variables”
...
For example, consider the RNA sequence below:
AUGCUUCGAAUGCUGUAUGAUGUC
In this sequence there are 5 A’s, 9 U’s, 6 G’s, and 4 C’s with a total of 24
residues
...
Because there are advantages to working
with quantitative information, when the data is described qualitatively a random
variable is used to assign a number to the non-numerical outcomes
...
A small letter represents the outcome of the random variable,
so little x can be used here
...
Table 6-1: Using Random Variable X to Quantitatively Model
Residues in a Particular RNA Sequence
Residue
Value of X (=x)
P (X=x)
A
0
5/24=0
...
167
G
2
6/24=0
...
375
If the experiment is to count the frequency of each nucleotide in another
sequence of RNA, the values of the random variable will be the same but the
probabilities of the random variable assuming that value will be slightly
different reflecting the different trials of the experiment
...
Probability models simply use random variables to
represent the outcome of an experiment whether it is a simple experiment (as
above) or a much more complicated experiment with many outcomes
...
71
Discrete versus Continuous Random Variables
Random variables can be discrete or continuous
...
Although many discrete random variables define sample spaces with
finite numbers of outcomes, countable does not mean finite
...
Examples of experimental outcomes that are
modeled with discrete random variables include numbers of people standing in a
line, number of A’s in a nucleotide sequence, and the number of mutations,
which occur during a certain time interval
...
Measurable
quantities are generally modeled with continuous random variables
...
Because much of the information bioinformatics deals with is discrete data
(sequence information is usually analyzed using discrete random variables) the
emphasis of this book is on this type of data
...
Probability Distributions and Densities
Now with an understanding of the concept of a random variable, whether
discrete or continuous, we can talk about probability models
...
A probability
model fits the data and describes it
...
Often the data is fit to a distribution of known form (to be
discussed in the next two chapters) such as a beta or gamma distribution, other
times in more complex scenarios the data is fixed to a distribution that is a
mixture of known forms
...
This
function is called a probability mass function in the case of a discrete random
variable or probability density function in the case of a continuous random
variable
...
In addition
all random variables (discrete and continuous) have a cumulative distribution
function, or CDF
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In R a simple histogram
(show in Figure 6-9) can be used to model the probability distribution function
for this example
...
208,0
...
25,0
...
20
0
...
00
0
...
10
Probability
0
...
30
0
...
For example if
X equals 2, the CDF is the probability that X=2 or X=1 or X=0
...
For our RNA
residue example, the calculations for the CDF are shown in Table 6-2
...
208
0
...
167
0
...
375
0
...
25
1
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Figure 6-10 shows the plot
produced
...
208, 0
...
625, 1)
> plot(X,CumProb,xlim=range(0,1,2,3,4), main="RNA Residue Analysis CDF",
xlab="X=", type="S")
0
...
2
0
...
8
1
...
In the case of a continuous random variable, the model for how probability is
distributed is called a probability density function and is denoted by f(x)
...
There is
no assigned probability that X=1 when X is a continuous random variable
...
Using a little calculus, the probability for a continuous random
variable is:
b
P(a < x < b ) = ∫a f ( x )dx
This is also the “area under the curve” on the interval [a, b], which is part of the
reason why the function is called a “density” function rather than a distribution
function as in the discrete case
...
74
step) and the F (x) is the interval from negative infinity (or wherever x is
defined) to the value of x
...
For
calculating the empirical CDF from n data values we assign a probability of 1/n
to each outcome and then plot the CDF to this set of probabilities
...
This package
contains functions that will easily generate an empirical CDF given any data
vector, and also contains functions to create CDF plot easily
...
The data can simply be entered and the plot stepfun
function used to easily generate a CDF plot, as depicted in Figure 6-11
...
> x<-c(2,4,2,1,3,4,2,1,3,5)
> plot
...
0
0
...
4
f(x)
0
...
8
1
...
Random variables define the data in a probability model
...
75
mathematically frame how a probability model fits
...
Standard probability
Parameters represent characteristics of the model they are used in and usually
are classified as different types, such as shape, scale and location, discussed
below in more detail
...
In some cases parameters serve hybrid
functionality
...
Changing the value of the parameter while utilizing the same
mathematical model for the distribution will form different family members with
distinctly appearing distributions
...
The idea is you want a distribution
to fit your data model “just right” without a fit that is “overfit”
...
Shape Parameters
To look at shape parameters, it is best to illustrate with an example
...
The same data is used but modeled using different values for the shape
parameter for this distribution (with all other parameters constant)
...
Families of
distributions such as the gamma family are particularly useful in modeling
applications since they are flexible enough to model a variety of data sets by
using different parameter values
...
76
Figure 6-12: Altering a Shape Parameter with Other
Parameters Held Constant
Scale Parameters
A second type of parameter is a scale parameter
...
As an illustration, the plot in Figure 6-13 is also of the gamma family of
distributions and uses the same data as the prior illustration
...
Although it may not initially appear so, the plots are the same shapes
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The
location parameter specifies where the graph lies along the horizontal (x) axis
...
To illustrate this perhaps the easiest example uses the normal distribution, or the
familiar bell curve
...
One of those nice properties is
that the mean of the distribution corresponds to the location parameter (and the
standard deviation corresponds to the scale parameter)
...
Shifting the location parameter of the normal distribution shifts where the center
of the distribution lies
...
All other parameters are constant
...
78
Figure 6-14: The Effect of changing the Location Parameter
Even before specific models are introduced you may be wondering how to
choose a model to use and how to pick values of parameters
...
As models are introduced it will become more
apparent which models are used for which types of data
...
Sometimes the parameters can be
determined from the data, as is the case of the mean and standard deviation of
normally distributed data
...
Other times more sophisticated statistical
methods are used, the techniques of which are beyond the scope of this book and
consultation with a statistician is advised
...
Often
many random variables are measured at the same time
...
To compare these, suppose
you are examining the annealing temperature of a DNA PCR primer
...
Chapter 7 introduces some standard univariate models, and select multivariate
models are introduced in Chapter 8
...
79
7
Univariate Distributions
Different random variables have different probability distributions
...
This chapter
discusses some of the common discrete and continuous univariate distributions
commonly often encountered in modeling data in biological applications
...
Presented
here are two models frequently used in analyzing biological data: the binomial
model and the related Poisson model
...
A
Bernoulli random variable arises in an experiment where there are only two
outcomes, generally referred to as “success” and “failure”
...
The
probability of success is a value p, a proportion between 0 and 1
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Let’s consider the case of having a child and use a Bernoulli random variable to
represent whether the child has blue eyes
...
16 (not empirically verified!) and this is the
“success” outcome
...
Table 7-1: Distribution of outcomes of a Bernoulli Trial
Outcome
Random
Variable X =x
P(X=x)
Probability of
outcome
Blue eyes
1
p
0
...
84
What about if you really want a blue-eyed child, so you have 10 children and
you want to know the probability that k out of the 10 have blue eyes? This is a
more complicated question
...
Independence will be discussed
in more detail in the next chapter
...
16
...
Using the Bernoulli probability
distribution function equation above we can extend it to work for more than one
trial by changing the exponents to n=the number of trials and k=the number of
successes as follows:
p k (1 − p) n − k
Note this is not a probability distribution function anymore, as it will only model
the probability of a particular sequence of n=10 children k of which have blue
eyes
...
But when there are 10 children, the blue-eyed
children can be any one of the 10 children
...
81
Cn , k =
n!
k!( n − k )!
Remember that for a series of 10 children, one blue-eyed child could be
positioned in 10 different ways (the blue eyed child could be first of 10, second
of 10, etc
...
This symbol is commonly
described as “n choose k” and is also called the “binomial coefficient” in some
contexts
...
This is tedious to do
by hand and hence using R comes in handy
...
In R, all probability
distributions (or densities in the case of continuous random variables) use the
letter d as the first letter in the function and then part or the entire name of the
distribution for the rest of the function name
...
Using some simple commands in R to generate the probability values for the
binomial distribution for the number of children in 10 with blue eyes using p
=0
...
16)
> data
...
names=x)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
749012e-01
1 3
...
855530e-01
3 1
...
834760e-02
5 1
...
754108e-03
7 1
...
363738e-05
9 5
...
099512e-08
Thus given p=0
...
175;
the probability of one with blue eyes child is 0
...
Of course, writing out tables of probability as above is only practical for simple
scenarios and in most cases a graphical model for the probability distribution
will be used
...
16),,type='h',xlab="",ylab="Probability",
sub="Number of kids with blue eyes")
Figure 7-1 illustrates the graphic model of the probability distribution function
for this example
...
16
Note that this distribution is pretty skewed toward lower values of the random
variable (X=number of kids with blue eyes) because the value of p is 0
...
The
graph should seem reasonable given the value of p
...
83
Let’s re-run this example with probabilities that a child has blue eyes is 0
...
2, 0
...
8 and see how this changes the distribution
...
05),type='h',xlab="",ylab="Prob", sub="p=0
...
2),type='h',xlab="",ylab="Prob", sub="p=0
...
5),type='h',xlab="",ylab="Prob", sub="p=0
...
8),type='h',xlab="",ylab="Prob", sub="p=0
...
This should make sense because a higher p
makes it more likely a child will have blue eyes and therefore more children in
10 will have blue eyes, as represented by the shift of the graphical models with
higher p
...
5 that the distribution is symmetric
...
5 since it equally likely that success
or failure occurs
...
This is easy to do in R using the pbinom distribution function, which takes the
same parameters as the dbinom
...
84
type of the plot to ‘s’ for step and change the function used from dbinom to
pbinom:
>
>
>
>
>
par(mfrow=c(2,2))
plot(0:10,pbinom(0:10,10,0
...
05")
plot(0:10,pbinom(0:10,10,0
...
2")
plot(0:10,pbinom(0:10,10,0
...
5")
plot(0:10,pbinom(0:10,10,0
...
8")
Figure 7-3: Binomial CDFs Using Different Values of p
The pattern of cumulative probability for binomials produced using different
values of p is illustrated in Figure 7-3
...
05) the
cumulative probability reaches 1 quickly whereas a large value of p results in
the cumulative probability not reaching 1 until the higher range of values for the
random variable
...
For example,
suppose you want to know the probability that (exactly) 4 kids in 10 will have
blue eyes when p=0
...
Simply use the dbinom function in R as follows and it
calculates this value for you:
> dbinom(4,10,0
...
2050781
Thus, the chance of 4 in 10 kids with blue eyes is 0
...
5% with p=0
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Several other distributions are related to the binomial and also
based on Bernoulli random variables (success or failure experiments)
...
The negative binomial
distribution (dnbinom, pnbinom in R) considers X as a measure of the number of
failures before the rth success
...
The example
used with the binomial could easily have been modeled using one of the related
distributions
...
In many cases when you
have data to model you have some choices how to model it based on choice of
random variable measurement outcome and choice of distribution used
...
Poisson
...
Mathematically the Poisson is
actually a limiting case of the binomial, the details of which will not be dealt
with here but can be found in most mathematical probability books
...
In general, the Poisson is used to model the counts of events occurring randomly
in space or time
...
In bioinformatics some examples
where the Poisson could be used include: to model the instances of mutation or
recombination in a genetic sequence, the distribution of errors produced in a
sequencing process, the probability of random sequence matches, or in counting
occurrences of rare DNA patterns
...
Note also that in
this equation, there is one parameter, the Greek letter lambda λ that defines the
distribution
...
Thus, lambda represents a rate
...
The relation holds when p is small (rare events) and the number of trials n are
large
...
86
Suppose we are using a new sequencing technique and the error rate is one
mistake per 10,000 base pairs
...
What is the probability of 0 mistakes using this technique?
Of 1 mistake, 4 mistakes? The Poisson model can be used to model these
probabilities
...
In this case
the “trial” can be considered an individual base pair, so n=2000 trials for a 2000
base pair sequence
...
To calculate lambda, multiply n*p, or
2000*(10,000) which results in a rate of 0
...
2
...
If we were using
5000 base pair regions to sequence at a time we would use a lambda of 0
...
To calculate the probability of one mistake in the 2000 base pair sequence, we
could do this by hand with the following equation:
p(X = 1) =
e −0
...
21
=0
...
In R the
dpois function is used to compute Poisson distributions, and has parameters
dpois (x, lambda) where x is the value or vector of values of the random variable
to be calculated and lambda is the parameter
...
We have a little bit of a problem in
that in this case, theoretically X can be anywhere from 0 (no sequence errors) to
2000 (every bp an error)
...
2 (also the mean or
expected number of errors) the number of sequence errors is not likely to exceed
10, so the following code is used to generate the table:
> x<-0:10
> y<-dpois(0:10,0
...
frame("Prob"=y,row
...
187308e-01
1 1
...
637462e-02
3 1
...
458205e-05
5 2
...
277607e-08
7 2
...
198290e-11
9 1
...
310351e-14
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
818, the chance of 1 error (X=1) is 0
...
in a 2000 base pair sequence
...
This can be viewed graphically as
illustrated in Figure 7-4
...
2), type='h', xlab="Sequence Errors",
ylab="Probability" )
Figure 7-4: Poisson Distribution, Lambda =0
...
The cumulative distribution in this case may be a bit more interesting
...
> plot(0:10,ppois(0:10,0
...
2
...
88
As is clear from the CDF graph in Figure 7-5, the number of sequence errors
using this method in a 2000 base pair sequence is highly unlikely to be more
than 3
...
Below R is used to find the probability of 1 or fewer, 2 or fewer and 3 or fewer
errors in this example
...
2)
[1] 0
...
2)
[1] 0
...
2)
[1] 0
...
5),xlab="",ylab="Prob",type='h',main="Lambda 0
...
Considering the relationship λ=n*p this should come as no surprise
...
89
that the plots above only consider X=x for the range of [0, 10] so in the case of
λ=5 there is more of the distribution shifted to the right
...
Let’s do this
with the range X=x from 0 to 20 for all the values of lambda used in the
previous plot
...
5),xlab="",ylab="Cum Prob", type='h', main="Lambda
0
...
Be
careful though– although the graph for lambda=5 appear to level off at 1 around
x=10 there is still some significant probability of obtaining a value of X higher
than 10, which can be analyzed by doing some additional calculations in R as is
done below:
> ppois(10,5)
[1] 0
...
9979811
> ppois(15,5)
[1] 0
...
90
This concludes discussion, for now, of discrete univariate probability
distributions
...
These distributions will be used in applications in later
chapters
...
The
normal is also the distribution that is used to model the distribution of data that
is sampled, as will be discussed later in this book under the topic of inferential
statistics
...
It is a ritual that all introductory statistics students are
saturated with details about the normal distribution, far more than will be
covered here
...
Mu and sigma serve as the parameters for the distribution
...
However, this is not necessarily true for
other distributions
...
One of the tricks with the normal distribution is that it is easily standardized to a
standard scale
...
This is useful if you have a bunch of different
X’s and want to put them all on the same Z system so you can compare them,
with a scoring system called Z-scores (see your favorite introductory statistics
book for further discussion)
...
91
R contains functionality for both the probability density function and cumulative
distribution function for the normal model in the base package
...
The parameters of dnorm are the data vector of
x’s, the mean and standard deviation
...
But what happens if
we change the scale parameter? Remember that increasing the scale parameter,
which is the standard deviation in the case of the normal, increases how spread
out the data are
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The code below computers
some normal CDFs, using a few different scale paramters with the same mean of
0
...
93
Figure 7-10: Normal Distribution CDFs with Different Scale
Parameters
Notice in Figure 7-10 how increasing the scale parameter (standard deviation)
causes the CDF to increase much less precipitously, reflecting the less
concentrated (more spread out) data distribution from the probability density
...
For example suppose you take a
test and score 85, which sounds great until the professor announces the mean
score is 92 with a standard deviation of 8
...
1907870
This means you scored better than 19
...
You could have answered
this question looking at the other end of the distribution as well using 1-pnorm
...
809213
This means that 80
...
Knowing the mean and standard deviation of a normal distribution makes
such calculations easy
...
For example, suppose you chop up a piece of DNA with
an enzyme and get 20 fragments with sizes you measure on a gel (to the nearest
base pair) and you guess from the gel pattern that the size of the fragments is
normally distributed
...
94
> x<-c(321, 275, 345, 347, 297, 309, 312, 371, 330, 295, 299, 365,
+ 378, 387, 295, 322, 292, 270, 321, 277)
> mean(x)
[1] 320
...
16787
And now, knowing the mean and standard deviation, you have a probability
density function normal model for this data that you can use to perform further
statistical tests on
...
> par(mfrow=c(1,2))
> xx <- seq(200,450,by=
...
0
0
...
8
Restriction Fragment CDF
Cum Prob
0
...
000
f(x)
Restriction Fragments PDF
200
300
400
Frag Size in bp
200
300
400
Frag Size in bp
Figure 7-11: PDF and CDF for restriction fragment data
In addition, you now have a model you can make inferences on
...
A simply query R computes this
...
01180460
Based on the above result, there is a 1
...
Another useful feature in R is that all distributions have an associated quantile
function built in, designated by a q before the distribution name code (qnorm,
qbinom, etc)
...
For example, suppose
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Simply use the qnorm function in R to do this
...
10,mean(x),sd(x))
[1] 275
...
90,mean(x),sd(x))
[1] 365
...
Quantiles are
called quartiles for the 25%, 50% and 75% regions and also called percentiles on
standardized tests
...
Even though the normal is widely taught in statistics courses, and widely used in
many areas of statistics, it is not the only continuous probability distribution to
be familiar with for effective data analysis
...
For non-normally distributed continuous data modeling we now
turn to two important families of distributions, the gamma family and the beta
family
...
The
base distribution of the family is the gamma distribution, which provides a
versatile model for working with continuous data that may not be normally
distributed
...
The gamma distribution is
only defined for positive real numbers and it takes different forms depending on
the parameter values
...
Long gone are the days when anyone would
hand calculate f(x) values for this equation because computer packages such as
R are very happy to do the calculations for us
...
96
function of the parameter alpha is part of the equation
...
Like all the other distributions, the probability distribution for the gamma is
simple to use in R
...
First, let’s make a few graphs of changing the shape parameter, alpha, while
keeping the scale parameter, beta, is constant at 1:
> x<-seq(0,10,length=100)
> par(mfrow=c(2,2))
> plot(x,dgamma(x,shape=1,scale=1), type='l',xlab="x",
ylab="Prob",main="Shape 1")
> plot(x,dgamma(x,shape=2,scale=1), type='l',xlab="x",
ylab="Prob",main="Shape 2")
> plot(x,dgamma(x,shape=5,scale=1), type='l',xlab="x",
ylab="Prob",main="Shape 5")
> plot(x,dgamma(x,shape=10,scale=1), type='l',xlab="x",
ylab="Prob",main="Shape 10")
Note in Figure 7-12 that for shape=10 the distribution shifts toward the higher
end (and the graph doesn’t depict the entire distribution only the same x range as
the other graphs for comparison)
...
97
Next, let’s look at how the gamma changes when the shape parameter, alpha, is
held constant (at 2) and the scale parameter, beta, is changed
...
3,paste("Scale=",c(1,2,4,8)),lty=1:4)
0
...
0
0
...
3
Note that, although they might not look it, all of the graphs in Figure 7-13 are
the same shape
...
For the higher values the distribution extends beyond the x range shown in
the graph (but for comparison all the graphs are on the same x scale range)
...
Suppose you are measuring survival times of an enzyme in a solution (as
measured by some kind of assay for enzyme activity) and you get the following
data in hours: 4
...
4, 1
...
9, 2
...
4, 5
...
6, 2
...
25
...
98
Because you know you cannot always assume data is normally distributed
(although you often hope so) the first thing to do is to look at a plot of the data
...
It is a plot called a Q-Q plot and what it does is line quantiles of the
data against normal quantiles
...
All you have to do to run a Q-Q plot in R is enter the data and use the qqnorm
and qqline functions
...
75, 3
...
8, 2
...
2, 2
...
8, 2
...
4, 5
...
Figure 7-14: Q-Q plot of enzyme data
And you know, by looking at the squiggle pattern of the data that it does not
align nicely with the expected normal line
...
Knowing about the flexibility of the gamma distribution, you suspect there may
be a gamma distribution model for this data
...
Although the scale and location parameters for the
gamma model are not equal to the standard deviation and mean like in the
normal case, there is a mathematical relationship between the mean and standard
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The mean of the gamma distribution is equal to alpha * beta and the variance
(=standard deviation squared) is related to alpha and beta by being equal to
alpha*beta2
...
35
> var(x)
[1] 1
...
But based on this we know that:
Mean=3
...
35/β=α
and then substituting this into the variance equation for alpha allows you to
solve for beta:
3
...
6 (roughly)
and subsequently, you can solve for alpha
3
...
6) so alpha = 5
...
6 and scale (beta)=0
...
Note that
because we don’t have many data points (only 10) this might not be the best
possible fit, but since we only have such a limited amount of data it’s impossible
to assess the goodness of fit (collecting more data values would be better of
course)
...
However, there is a
requirement that both alpha and beta be positive values (so if you do the algebra
and get negative values for alpha and beta, you did something wrong)
...
100
f(x)
0
...
05 0
...
15 0
...
25 0
...
75, 3
...
8, 2
...
2, 2
...
8, 2
...
4, 5
...
6,scale=0
...
The CDF for the model used
in this example can simply be graphed in R as:
plot(x,pgamma(x,shape=5
...
6),type='l',ylab="P(X<=x)",
+ main="Gamma CDF Fit")
points(data,rep(0,n))
0
...
2 0
...
6 0
...
0
P(X<=x)
Gamma CDF Fit
0
2
4
6
8
x
Figure 7-16:Gamma CDF
It looks from the CDF plot in Figure 7-16 that there may still be some
probability density at x values higher than 5
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
6,scale=0
...
1263511
Thus, there 12
...
The Exponential Distribution
The exponential distribution, famous for modeling survival times (as in the case
with radioactive decay), is just a special case of the gamma distribution where
the shape parameter, alpha = 1
...
Although details will not be
discussed here, the R functions dexp (x, rate = ) is used with the rate parameter
lambda value to model the probability density, and the function pexp is used to
model the CDF
...
The Chi-Square distribution always uses a
value of beta=2 for the scale parameter and a value of alpha=k/2 for the shape
parameter where k is the number of “degrees of freedom”
...
In
R the probability density for this distribution is denoted as dchisq, and the
cumulative density is pchisq
...
The Beta Family
Like the gamma family, the beta family is group of distributions that use alpha
and beta parameters
...
102
f (x) =
1
x α −1 (1 − x ) β −1 , 0
And you ask, what is that β(α, β ) thing in the denominator? Well, that thing is
called the beta function, and the beta function is actually a ratio of gamma
functions, as follows:
β( α , β) =
Γ(α )Γ(β)
Γ ( α + β)
One very important thing to note – the range of x in the equation for the beta
probability density is clearly denoted as being between 0 and 1
...
The beta function is used to model data measured as proportions
...
Unfortunately the interpretation of the parameters with the beta are not as clear
as with the gamma, and with the beta distribution, the alpha and beta parameters
are sometimes referred to as the “shape 1” and “shape 2” parameters
...
As with the gamma,
the alpha and beta parameters have a relationship to the mean and variance of
the distribution, with the following mathematical formulas:
mean =
α
α+β
var iance =
αβ
(α + β) (α + β + 1)
2
The formula for the mean is not too bad, but remember to solve for alpha and
beta you need to solve for both – and algebraically it’s not quite so pretty to do
with the above relationships
...
Meanwhile, let’s learn how to
work with the dbeta density function and the pbeta cumulative distribution
functions in R
...
Assume we have already determined the parameters of the beta density that
models our data
...
A
graphical model of this density is made in R with the following command:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
11 0
...
10 0
...
20 0
...
01 0
...
07 0
...
25 0
...
11 0
...
08
[16] 0
...
08 0
...
09 0
...
of acidic residues", ylab="f(x)",
+ main="Beta PDF for residue data")
> points(data,rep(0,n))
Figure 7-17: Beta PDF
Similarly a CDF plot can be generated using the pbeta function:
> plot(x,pbeta(x,2,10),xlab="prop
...
prob",
+ main="Beta CDF for residue data")
> points(data,rep(0,n)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The
interested reader is advised to consult a probability and/or mathematical
statistics book for further information
...
For example, there is such a thing as
an inverse gamma distribution, which can be computed using gamma
distribution functionality
...
The t distribution will be introduced in the inferential statistics
chapter and is the distribution which introductory statistics students are taught to
use in the context of hypothesis testing
...
Simulations
One of the greatest powers of using a computer lies in the ability to simulate
things
...
It’s like doing an experiment, only
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
technique in computational statistics
...
Every distribution mentioned in this
chapter has a corresponding function in R that starts with “r” for random
...
For example, let’s simulate 20 values generated from a standard normal
distribution
...
37100652 0
...
83283766 -1
...
64050749 0
...
56016713 -0
...
24318977 -0
...
Once you have simulated data you often want to look at them graphically
...
Let’s try the above simulation a
few more times, this time with 50, 100, 500, and 1000 values:
>
>
>
>
y1<-rnorm(50,mean=0,sd=1)
y2<-rnorm(100,mean=0,sd=1)
y3<-rnorm(500,mean=0,sd=1)
y4<-rnorm(1000,mean=0,sd=1)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The
distribution with N=1000 appears to approximate a continuous standard normal
distribution quite nicely
...
There will be much more discussion of
simulations in coming chapters
...
107
8
Probability and Distributions
Involving Multiple Variables
The previous two chapters have looked at the basic principles of probability and
probability distributions of one random variable
...
Expanded Probability Concepts
Conditional Probability
Conditional probability is a powerful concept that allows us to calculate the
probability of an event given that some prior event, which we have probability
information about, has occurred
...
Understanding conditional probability, as we will see in the next chapter, is an
essential foundation for Bayesian statistics
...
Let’s illustrate the use of conditional probability with an example from classical
genetics by considering the case of pea color as a trait encoded by one gene that
has two alleles
...
The recessive allele we will denote by y, which codes for
green pea color
...
108
Assuming they are diploid, peas can have a genotype that is homozygote (yy or
YY) or heterozygote (Yy or yY)
...
Peas of genotype yy are green, with a probability of ¼, and peas of
genotypes Yy, yY, and YY are yellow, and therefore the probability of a yellow
pea is 3/4
...
In terms of probability theory, we are restricting the sample
space for the event pea color to the event that the pea is yellow, and creating a
new sample space consisting only of yellow pea color
...
In conditional probability jargon, we are conditioning the event of
heterozygous pea on the event of yellow pea
...
Mathematically we can look at this example using language and notation from
probability theory
...
We can then use the joint probability of the event “heterozygous and
yellow” and divide this by the event of “yellow” to calculate the probability of
being heterozygous and yellow as follows:
P(yellow and heterozygous) = 2/4
P(yellow)=3/4
P(heterozygous|yellow) =
P(yellow and heterozygous)
=2/3
P( yellow)
The notation P(heterozygous|yellow) is standard notation for conditional
probability where the event being conditioned on comes after the “|” notation
...
Using Trees to Represent Conditional Probability
Often looking at a graphical illustration helps to understand a concept
...
Initial branches
of the tree depend on the stem and finer branches depend on the previous
branch
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Figure 8-1: Conditional Probability Tree Diagram
Note that the second branch tree could have been written with branches for each
of the four genotypes but is drawn here using the more simplified two branch
version of heterozygous versus homozygous
...
Later in this chapter
joint probability will be explained in more detail
...
In the second branch, this stems from the fact that when we
condition on an event, we define a new conditional sample space and the
conditional probabilities obey the axioms and rules of probability within this
new (conditional) sample space
...
Independence
It often happens that knowledge that a certain event E has occurred has no effect
on the probability that some other event F occurs
...
This can be written mathematically as P(F | E) =P(F)
...
In fact each
equation implies the other
...
Here is an alternative way to define independence
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
It is important to note here that determining that events are independent is not
equivalent to determining that events are disjoint or mutually exclusive, which
was previously defined by two events having no common intersection (P
(E∩F)= ∅)
...
Three
events A, B, and C are independent if P(A∩ B∩ C)=P(A)P(B)P(C)
...
However, it is not always the case that the reverse is true, and it is
possible to have three events, A, B, and C where A and B are independent, B
and C are independent but A and C are not independent and therefore A, B, and
C are not mutually independent
...
Some
times it’s based on common logic
...
But
in general, you should not assume independence without good reason to do so
...
Although this issue is often debatable, assuming independence of
sequence elements is key in many data analysis algorithms commonly used
...
For example, suppose nucleotides in a DNA sequence are mutually independent
with equal probabilities (that is, P(A)=P(T)=P(C)=P(G)=1/4)
...
111
of observing a sequence ATCGA is simply P(A)P(T)P(C)P(G)P(A) or
(1/4)5=1/1024
...
However the
concept of independence can easily be applied to the case of nucleotides of
different frequencies
...
Then,
assuming independence, the probability of sequence ATCGA is (1/6)3(1/3)2
which calculates to 1/1944
...
In many instances, in sequence analysis or in analyzing other events, it is clear
events are not independent
...
For example, in analyzing the nucleotide sequence for a start codon
(ATG) or other sequence motif independence does not hold and the subsequent
nucleotides are dependent on the prior nucleotide within that sequence motif
...
Here we will look at the concept of joint probabilities,
which will serve as preparation for coverage of joint distributions later in this
chapter, but is also a subject of use on its own
...
Joint probability
was diagrammed with a Venn diagram in chapter 6 when we discussed the set
theory concept of intersection
...
Using this notation, P(A∩B) symbolizes the joint probability of
events A and B
...
For example, given a
particular DNA sequence we obtain the following (hypothetical) joint
probabilities for two adjacent nucleotide sites (Table 8-1):
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
2
0
...
1
T
0
0
...
1
0
...
1
0
0
...
1
0
0
Although not immediately obvious to the untrained eye, a skilled probability
practitioner can harvest from Table 8-1 a lot of useful information
...
For
example, in the first cell the entry is the joint probabilitiy of nucleotide A in
position 1 and nucleotide A in position 2
...
2
...
What if we want to know the probability of nucleotide A being at position 1
regardless of the nucleotide at position 2? This calculation is the column total of
the nucleotide A in position 1 column, or 0
...
This probability is called the
marginal probability
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
2
0
...
1
0
...
1
0
...
1
0
...
1
0
0
...
2
G
0
0
...
1
0
...
3
0
...
2
1
Nucleotide at position 2
Nucleotide at position 1
Marginal
probabilities for
columns
Calculating conditional probabilities from the information in the table is also a
breeze
...
P (T at P2 | G at P1)=
P(T at P2 ∩ G at P1) 0
...
2
Here, the joint probability, P(T at P2 ∩ G at P1), is obtained from the cell in
Table 8-2 containing the joint probability for nucleotide G at position 1 and T at
position 2, and P (G at P1) is the marginal probability of nucleotide G at position
1 obtained from the column total for that column in Table 8-2
...
Note that
the Bk are usually the different outcomes of a sample experiment (all possible
events in a given sample space)
...
Exhaustive means the entire sample space or
union of all events, B1 ∪ B2 ∪… ∪ Bn = the sample space
...
114
The law of total probability is best illustrated using the “Pizza Venn Diagram”
in Figure 8-2, and can be summarized mathematically as follows:
P( A ) = ∑
n
P( A | Bi ) P( B i )
i =1
In the above formula and in Figure 8-2, A is the union of disjoint (mutually
exclusive) sets, A∩Bi, for all i
...
In this case we used the
formula
P( A) = ∑i =1 P( A ∩ Bi ) where A is the nucleotide we are
n
calculating the marginal probability of and the Bi’s are the four nucleotides we
are calculating the marginal probability over
...
Doing the math, P (A in P1) is 0
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
However, modeling probability for most real life
phenomena, and certainly most phenomena in bioinformatics, usually requires
modeling the distribution of more than one random variable
...
Examples of phenomena being modeled with more than one variable are abound
in scientific applications
...
In
biochemistry it may be possible to develop a statistical model predicting tertiary
protein structure using random variables to model factors such as percentages of
certain amino acid residues or motifs
...
Joint Distributions of Discrete Random Variables
Let’s revisit our example of joint probability of nucleotides in two positions
discussed earlier in this chapter
...
Let’s step this up and model the scenario using random
variables
...
We can re-write our familiar table as a joint probability mass function (pmf) of
two random variables (Table 8-3)
...
116
Table 8-3: Joint Probability Mass Function for Nucleotides at
Two Adjacent Sites
Y=Nucleotide at position
2
X= Nucleotide at position 1
A
T
C
G
A
0
...
1
0
0
...
1
0
...
1
C
0
...
1
0
G
0
0
...
The table now models
probability distributions for the two random variables
...
Although we will not do so here, extending joint distributions to include more
than 2 variables is pretty straightforward
...
We could use the random variable Z to model third nucleotide
...
We write this as P ((X=A)
∩(Y=T)∩(Z=G))
...
The only novel idea here is that
we are assigning a random variable to the marginal probability, creating a
marginal probability mass function for a discrete random variable
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
3
0
...
2
0
...
Table 8-5: Marginal Probability Mass Function for Y
Y=Nucleotide at
position 2
A
0
...
3
C
0
...
1
The marginal distribution can be denoted using mathematical shorthand
...
Note that the sum of probabilities for each marginal distribution
adds to 1 (obeying the law sum of all probabilities in a sample space sums to 1),
which should always be the case (and serves as a good check for determining if
you did the correct calculations)
...
For example, we may be interested in the distribution of second nucleotides (Y)
given that the first nucleotide is an A
...
118
In this formula, the numerator is the joint probability of the first nucleotide
being A and second nucleotide being nucleotide yi
...
For example, using previous data from Table 8-1, to calculate the probability
that the second nucleotide is an A (Y=A) conditioned on the probability that the
first nucleotide is an A we perform the following probability calculation:
P (Y=A|X=A)=
P(Y = A ∩ X = A) 0
...
66
P(X = A)
0
...
Table 8-6: Conditional Distribution of Y given X=A
A
0
...
33
G
Y=Nucleotid
e at position
2 GIVEN
X=A
0
Again, note that the sum of conditional probabilities adds to 1 and fulfills the
law of probability that the sum of probabilities of events in a sample space adds
to 1
...
Joint, Marginal and Conditional Distributions for
Continuous Variables
Because of the importance of discrete data in bioinfomatics and the relative
simplicity of working with discrete data, only discrete joint, marginal and
conditional distributions have been discussed in detail
...
Conceptually the joint, marginal, and
conditional distributions are the same as for discrete variables, but the
mathematics is more complicated
...
The joint pdf of two continuous
random variables X and Y is a two dimensional area A and can be evaluated by
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
But conceptually the
probability that (X, Y) lies in area A is equal to the volume underneath the
function f(x,y) over the area A
...
In the case of the
discrete random variable, this is done by summing over the other variable(s)
whereas in the case of a continuous random variable, this is done by integrating
over the other variable(s)
...
The conditional probability distribution for two continuous random variables
can also be calculated using some simple calculus
...
This can be written
mathematically where the conditional distribution is denoted by f x| y ( x | y )
...
The
analytical methods and calculus for performing such calculations can become
quite tedious
...
Many examples presented in this book will
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Graphical models beyond two random variables are not very practical, so among
other applications, marginal and conditional distributions are often used to look
at graphics of higher dimensional distributions, as we shall in some examples in
the next section of this chapter
...
The distributions presented here are not should not seem entirely
novel, because they build on univariate distributions presented previously by
including more than one random variable in the model
...
These three distributions are selected because they are the key
multivariable distributions used in modeling data in bioinformatics
...
The multinomial is an extension of the
binomial distribution
...
Consider an experiment that models the outcome of n independent trials
...
The probability of any of the outcomes is constant, just as in
the binomial model the probability of success and probability of failure were
held constant for a particular model
...
If we count how many outcomes of each type occur, we have a set of r random
variables X1,X2,…,Xr
...
Note the sum of the values of the random variables is n, the total
number of trials, that is x1+x2+…+xr = n
...
⋅ prxr
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This can be written as:
n
n!
x , x ,
...
x !
r
1 2
1 2
r
Combining these results produces the joint distribution of observed events (a
formula that directly parallels the binomial case of two possible outcomes
described in the previous chapter) under the multinomial model
...
, xr ) =
p1 p2 ⋅
...
xr
Among its many applications in bioinformatics, the multinomial model is
frequently used in modeling the joint distribution of the number of observed
genotypes
...
If we sample n diploid individual in the population and
record their genotype at that locus, a number of individuals will be of genotype
AA, which we can represent as just as nAA
...
To formalize this into a
probability model, we can use the random variable X to represent nAA, the
random variable Y to represent nAa, and the random variable Z to represent naa
...
The multinomial distribution formula represents the joint distribution of the
three genotypes is given below
...
As an example, suppose we have 20 individuals and genotype them and find that
nAA=4, nAa=14, and naa=2
...
2, PAa=0
...
1
...
However, using our
empirical parameters we can extend our data set by doing simulations of more
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Later in chapter 13
we will do similar types of simulations using a technique called bootstrapping
...
>
>
>
>
>
>
>
>
>
## Define N to be 10 trials
N<-10
## Define n to 20
n<-10
## Define our p vector containing empirical values
# pAA=0
...
7, paa=0
...
2,0
...
1)
## Call function with these parameters, store in results
results<-rmnomial(N,n,p)
This produces the following matrix of simulated values of the three random
variables we are modeling:
> results
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
4
4
2
3
4
6
3
4
1
7
11
13
12
13
13
11
13
15
13
7
5
3
6
4
3
3
4
1
6
6
We could easily write some code to calculate proportions for the values for the
simulated random variable values:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
20
0
...
10
0
...
20
0
...
15
0
...
05
0
...
55
0
...
60
0
...
65
0
...
65
0
...
65
0
...
25
0
...
30
0
...
15
0
...
20
0
...
30
0
...
7,0
...
You could write your own functions like the above to sample from multinomial
distributions, but there is a package called combinat that contains some prewritten functions to sample from multinomial distributions
...
Note that if you were interested in doing some statistical tests, you could
simulate values from distributions with alternative parameters, and then perform
tests to determine whether the empirical values differ from this theoretical
distribution
...
25, PAa=0
...
25
...
The marginal distributions for each of the random variables X, Y and Z can
easily be obtained from the multinomial
...
If we are only interested in the number of outcomes that result in the first type,
X, then we simply lump all the other types (Y and Z) into one category called
“other”
...
This
should ring a bell of familiarity, as it has now become a case of Bernoulli trials
...
Note that
the probability of “failure” = prob(“other”) = 1 - p1 = p2 +…+ pr (sum of all the
others)
...
124
n x
n−x
p X j ( x j ) = p j j (1 − p j ) j
xj
To examine this more concretely, let’s continue our example and do some
illustrative simulations
...
Instead of considering both Y and Z variables, let’s consolidate them into a
variable W, representing “everything else” which is not genotype AA, and make
W=X+Y=(nAa+naa) and pW=(1-pAa-paa)=0
...
We now have a binomial model
that models X (counting the trials of AA as “successes” and all other outcomes
as “failures”) alone
...
First let’s simulate 10,000 values using the multinomial model and all 3 random
variables
...
>
>
>
>
>
>
>
results<-rmnomial(10000,20,c(0
...
7,0
...
19913
> mean(W)
[1] 0
...
2 and the mean (the proportion of “successes” in
the binomial) and the mean for W is 0
...
These values should seem logical
...
>
hist(X,nclass=10,main="Simulated prop
...
125
Figure 8-3
For comparison, a quick simulation from the binomial of 10,000 values using
p=0
...
2
B<-rbinom(10000,20,0
...
AA using Binomial")
The histogram using the binomial simulation is shown in Figure 8-4
...
Figure8-4
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Often
data are mathematically transformed to fit a normal model
...
Nonetheless, most of inferential
multivariate statistics utilize the multivariate normal
...
Because most of the mathematical aspects of dealing with the multivariate
normal involve advanced techniques of calculus and matrix algebra, we will
consider the details of only one limited example of the multivariate normal
...
Recall, from the previous chapter, the mathematical formula for the normal
distribution:
f (x) =
1
σ 2π
−
e
( x −µ )2
2σ 2
We can model their joint distribution of the two random variables X and Y by
simply looking at the product of the marginal distributions of their two marginal
distributions since they are independent variables
...
Let’s reduce
the case even further by considering only a standard bivariate normal
distribution
...
Thus, we can further re-write the equation above
considering X and Y as standard normal random variables, each having mean 0
and standard deviation 1:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
To do this, we can write a function in
R to simulate draws from a bivariate standard normal
...
The results of the function call are stored in a variable matrix z
...
>
>
>
+
+
>
>
+
>
x<-seq(-2,2,length=20)
y<-x
bvn_function(x,y){
(1/2*pi)*exp(-0
...
Along the X-axis (front view of the plot) the perspective lines show a
pattern of the normal distribution reflecting that any “slice” of the cone would
have the shape of a normal distribution
...
128
Once again, we didn’t have to write a function, because random generations of
multivariate normal distributions of any dimension can also be done using the
package mvtnorm
...
Let’s use the package’s function rmvnorm
to generate 1000 values from a bivariate standard normal distribution:
> data<-rmvnorm(1000,mean=c(0,0))
> data
[,1]
[,2]
[1,] -0
...
5950708803
[2,] -1
...
3079036163
[3,] -1
...
9042616927
…
[999,] 2
...
7003787054
[1000,] -0
...
0056516042
In our result we have a matrix of data where the first column has values of
random variable X values from standard normal simulation, and the second
column is random variable Y values from a standard normal simulation
...
Figure8-6: Scatter plot of X and Y values from bivariate
standard normal simulation
If we want to look at marginal distributions of X and Y, we can break down the
data matrix into an X and Y matrix and look at the marginal distributions of X
and Y very easily
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
If we did a simulation of 1000 values from a standard
univariate normal distribution we would obtain a virtually identical result
...
Recall that the
beta distribution is the distribution often used to model data in the form of
proportions, i
...
values between 0 and 1
...
We now denote by X1…Xk a set of proportions noting that X1+…+Xk = 1 (and
each Xi > 0)
...
X K ) =
i =1
k
∏
Γ(α i )
k
∏
X iαi −1
i =1
i =1
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Capitol sigma ∑ is the
symbol for addition (sum of), and the capitol pi, Π is the symbol for
multiplication
...
The constraint on the distribution is that the sum of proportions adds to one, that
k
is,
∑X
i
= 1 which should seem logical since when modeling the joint
i =1
proportions of nucleotides in a sequence, for example, the total of all proportions
is always 1
...
As a simple example where we might use the Dirichlet, let’s model the joint
proportions of purines (A or G) and pyrimidines (C and T) in a given sequence
...
Let’s use the arbitrary choice of alpha=1 as a
parameter for X1 and alpha=2 for X2
...
Mathematically, with k=2 since there
are two random variables being modeled, the joint distribution Dirichlet model
is:
2
Γ( ∑ α i )
2
i =1
f ( X1 , X 2 ) =
∏ Xi α i −1
2
∏ Γ(α i ) i =1
i =1
Simplifying the above expression by substituting in the alpha values and pi’s and
then performing addition and multiplication produces a formula that is not very
intimidating and just involves algebra and calculation of a few gamma
functions:
f (X 1 , X 2 ) =
Γ(3)
(p1)1−1 (p 2) 2 −1
Γ(1)Γ(2)
Note that since X2 = 1-X1 we could simply view this as a marginal distribution
of X1 because for any given X1, X2 would be completely determined
...
131
We note that this is simply the Beta distribution for X1
...
Therefore the joint Dirichlet distribution for two random
proportions X1,X2 (X1 + X2 = 1) is equivalent to the univariate Beta distribution
for X1 alone
...
Because calculating the
gamma functions in this equation can be computationally intense (even for quite
high powered computers), sometimes the distribution is evaluated by taking
logarithms (which make it computationally more efficient)
...
The right side of the Dirichlet model can be calculated using
logarithms like this:
k
k
=log ( Γ(
∑α ) )-log ( ∏
i
i =1
i =1
k
Γ(α i ) ) + log ( ∏ X i α i −1 )
i =1
Another computation quirk of the Dirichlet distribution is that it is simpler to
sample from the Dirichlet indirectly by using a method that draws k independent
gamma samples and then computing random proportions (Xi) as the value of
each sample divided by the sum of the k samples
...
,Xk have a Dirichlet distribution
...
To simulate from the Dirichlet in R you could write your own function
...
vector(sum)
}
Using this function to simulate values for n=20 with a vector of parameters
(alpha values) for 3 random variables where alpha=1 for all three variables
produces matrix of results where each column represents simulations for each
random variable:
x<-rDir(20,c(1,1,1))
> x
[,1]
[,2]
[1,] 0
...
005894e-01
[2,] 0
...
894494e-03
[3,] 0
...
437322e-02
[4,] 0
...
830641e-03
…
[,3]
0
...
66363492
0
...
47721011
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Performing the same
computation as above using the function rdirichlet from this package produces
the following:
> x <- rdirichlet(20, c(1,1,1) )
> x
[,1]
[,2]
[,3]
[1,] 0
...
034630906 0
...
61271253 0
...
02801983
[3,] 0
...
180993424 0
...
77386208 0
...
22128695
…
Let’s return to our example of modeling the joint distribution of the proportion
of purines and pyrimidines and simulate 1000 values using the rdirichlet
function:
> x<-rdirichlet(1000,c(1,2))
If we look at the mean values simulated for p1 and p2, these are the simulated
proportions of x1 purines and x2 pyrimidines given that our parameters alpha=1
and alpha=2
...
3354725
> mean(x[,2])
[1] 0
...
If we plot the marginal of x[ ,1] we
get the following as depicted in Figure 8-8:
> hist(x[,1],nclass=20,main="Marginal of x[,1]")
Figure 8-8
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
It is important here to
grasp the basics of the Dirichlet and its mathematical model and how to draw
simulations from the Dirichlet using R
...
If you can understand the different models, what they are used for, how to
perform simulations and how to graphically analyze results, you should be
prepared to utilize these capabilities in some more advanced applications
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
We discussed frequentists statistics being
the historic mainstream type of statistics usually taught in introductory statistics
course and Bayesian being a school of statistical thought which build upon
frequentists methods but incorporates subjective, as well as objective, thinking
about probability
...
The coverage here is primarily conceptual and only serves as a brief
introduction to the extensive world of Bayesian statistics
...
These methods will be applied
in Chapters 11 and 12 in the study of Markov Chain methods
...
This book does not present a biased view of either approach being
superior to the other, but presents each view in terms of strengths and utilizes
the approach most likely to be of use in bioinformatics applications discussed
...
However, after Chapter 12 we
will return to working with frequentist statistics to study how to use R for
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Some topics discussed after
that will utilize a combination of frequentists and Bayesian thinking
...
Since probability is the language and foundation of
statistics, this difference affects not only how Bayesians think about and model
probability, but also how they make inferences about data as well
...
However we
are mainly concerned with Bayesian statistics in the context of working with
probability models and will only touch on the differences in making inferences
...
To a frequentist, this has an objective interpretation
...
This probability value is a
true, fixed measure of the chance that a coin lands on its head, mathematically
approximated by repetitions of an experiment
...
When we return to
inferential statistics in Chapter 13 this will be studied in more detail
...
To a Bayesian there is no such thing
as a fixed probability statement
...
For a Bayesian,
there are no true measures, only certain probability distributions associated with
their subjective beliefs
...
Using subjective knowledge, the Bayesian would first guesses on what the
probability of a parameter is (in this case, the outcome of a coin toss)
...
Then the
Bayesian would collect data (tossing the coin) and based on the data collected
would update the model
...
This process is the paradigm of Bayesian statistics
...
Perhaps the Bayesian has experienced in the past that coins do not
land on their heads and has previously experimented with unfair coins
...
136
Bayesian may have a different belief about a prior distribution from another
Bayesian’s belief of prior distribution for the same process
...
It is this incorporation of subjective belief into the
model that distinguishes the Bayesian from the frequentists
...
Here are some good reasons to use the Bayesian approach, presented here
briefly and without technical details
...
Anyone who has ever interpreted a confidence interval in frequentists
statistics course (discussed in Chapter 13) knows that frequentist’s ways of
interpreting things can become a bit confusing
...
•
Less Theoretical
...
The forms of distributions modeled may vary, but the
paradigm for working with them is essentially the same
...
The Bayesian approach is ideal for utilizing computational power
...
•
More Flexible
...
•
Deals Well with Missing Data Values
...
•
Effectively Models High-Dimensional Data
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
As we shall see with Bayes Theorem, the Bayesian approach involves
constantly updating the model based on new data
...
Artificial intelligence and other disciplines take
advantage of the Bayesian approach and utilize it in learning algorithms in
data mining and other applications
...
Algorithms based on Bayesian theory,
including Markov Chain Monte Carlo, can be carried out and produce
results with small samples very effectively
...
This may be viewed as nonscientific and biased and is the most frequent
source of dispute of Bayesians and Frequentists
...
•
Misunderstood
...
Therefore others may not understand the Bayesian model and may
misinterpret it
...
However, whether they are more complex than frequentist
statistics is debatable
...
The bases for all Bayesian models are
fundamentally the same algorithm, which may be simpler than Frequentists
models
...
Bayes’ rule is an
accounting identity that obeys the axioms of probability
...
138
uncontroversial, but it is this simple rule that is the source of the rich
applications and controversy over subjective elements in Bayesian statistics
...
C
B
A
Figure 9-1
However, the probability of A and B can also be rewritten as:
P(A∩B)=P(A|B)*P(A)
Or as P(A∩B)=P(B|A)*P(B)
We can rewrite the relationship between conditional and joint probability of A
and B given earlier as:
P (A|B)=
P(A ∩ B) P(B | A ) * P(A )
=
P(B)
P(B)
Where the relation
P (A|B)=
P( B | A ) * P( A )
P(B)
Is known as Bayes’ rule
...
This is because it shows how a conditional probability P(B|A) can
be turned into, or inverted, into a conditional probability P(A|B)
...
139
The denominator of Bayes’ rule P(B) is the marginal probability of event B, that
is the probability of event B over all possibilities of A where there is joint
probability
...
Sometimes Bayes’ rule is written using E and H, where E stands for “evidence”
and H stands for “hypothesis”
...
P(H|E) is the updated probability of belief in the hypothesis given
the evidence
...
This is where the usefulness of Bayes rule
and Bayesian statistics in learning comes from, and this idea is a foundation of
the usefulness of Bayesian statistics
...
In the first case, we will have
complete information about the joint probability of two events
...
Table 9-1 shows the joint probability of two events, event A being a membrane
bound protein and event B having a high proportion of hydrophobic (amino
acid) residues
...
140
or Ac), not being a membrane bound protein
...
Each cell represents the joint probability of two
events
...
To do this we
can apply Bayes’ rule in this form:
P(B|A)=
P(A | B)P(A ) P(A ∩ B)
=
P(B)
P(B)
P(A∩B) is the joint probability of A and B is simply found from the cell in the
joint probability table and is 0
...
P(B) can be calculated by the law of total
probability, calculating the P(B|A)P(A)+P(B|~A)P(~A) or since we have the
table by the sum of the row for event B, which is 0
...
Therefore
P(B|A)=
P(A ∩ B) 0
...
6
P(B)
0
...
Table 9-1
Type of Protein
Prop
...
3
0
...
1
0
...
Now let’s
consider a second example, where the computation would be quite difficult were
it not for Bayes formula
...
Suppose the prevalence of having the gene X is 1/1000 and the prevalence
of having a particular disease is 1%
...
141
We could use this information and try to produce a joint probability table for the
two events – having the gene and having the disease
...
From the information above we are given:
P(Gene)=1/1000
P(Disease)=1/100
And
P(Disease|Gene)=0
...
The P(Disease) can be assumed to be the
marginal probability of having the disease in the population, across those with
and those without the gene X
...
5 * (1/1000)
=0
...
Note that this is very different from the condition
we started with which was the probability that you have the gene, then there is a
50% probability you will have the disease
...
In such a sequential framework, the
posterior from one updating step becomes the prior for the subsequent step
...
However, since we
are interested in Bayes formula with respect to Bayesian statistics and
probability models, our discussion here will continue in this direction
...
142
understanding how these principles apply to distributions, here we first looked at
Bayes rule (above) but now, we will extend Bayes’ rule and apply it to working
with probability models (distributions)
...
P(H|E)=
P( E | H ) P( H)
P( E)
We have already discussed that this can be written as a proportional relationship
eliminating P(E), which is a constant for a given scenario, so:
P(H|E)∝ P(E | H )P(H )
Recall that P(H) represents the prior degree of belief in the hypothesis before the
evidence
...
Recall
that the Bayesian would view this as a probability distribution
...
We refer to the distribution modeled
with P(H) as a prior
...
All we
have is the Bayesian’s subjective belief of the proportion of times that the coin
will land on its head
...
Recall that the beta distribution is used to model probability distributions of
proportion data for one random variable
...
25)
...
Let’s go ahead in R and graph the prior distribution
...
P(H)")
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Is this an adequate prior? It is
centered on 0
...
It is not too precise and has a
reasonable spread, reflecting both some certainty on the part of the Bayesian that
the value is around 0
...
5, reflecting the Bayesian is fairly certain of his beliefs, but not
so certain as to use an overly precise (invariable prior)
...
The
purpose of a prior is to model the initial subjective belief of the Bayesian
regarding the parameter of interest, before data are collected
...
Figure 9-2
We now have the P(H) part of the model P(H|E)∝ P(E | H )P(H )
...
In order to get this, we need some evidence
...
In order to get data, we need to perform an experiment, so let’s collect
some evidence
...
Each trial of tossing the coin is a Bernoulli trial, and the combination of 10 trials
can be modeled using the binomial distribution, which is (from chapter 7):
n
p(k ) = P(k successes in n trials) = p k (1 − p) n − k
k
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In frequentist statistics our estimate for the Binomial parameter p would
be p=0
...
5 *0
...
5),,type='h',xlab="",main="Data dist
...
Figure 9-3
Since in the Bayesian paradigm p is a random variable, there are infinitely many
plausible models for the data, namely one for each value of p
...
3
...
In
our case since the experiment resulted in k=5 the likelihood is
Likelihood(p) = P(5 successes in 10 trials | parameter p) =
10 5
5
p * (1-p)
5
Likelihoods will be discussed more later in this chapter, but let’s continue
working with our extension of Bayes’ rule to apply it to distributions
...
145
P(H|E)∝ P(E | H )P(H )
Now what we want is P (H|E), the updated probability of belief in the hypothesis
given the evidence
...
Recall our beta model for P(H)
f (x) =
1
x α −1 (1 − x ) β −1 , 0
we concluded that alpha=2, and beta=5 are adequate parameters for modeling
our P(H) with an Bayesian estimate of 0
...
Hopefully you recall a day in math
class sometime in secondary school when you encountered the rule for
multiplying exponents of the same base, aman=am+n
Let’s combine our models:
P(H|E)∝ P(E | H )P(H )
n k
1
p (1 − p) n − k
p α −1 (1 − p) β −1
β (α , β )
k
P(H|E) ∝
Multiplication produces the posterior:
n
1
k+α-1
n-k+β-1
(1-p)
β (α , β ) p
k
P(H|E) ∝
Combining constant (for this particular) terms and replacing them with c (c is a
normalizing constant which we will not be concerned with evaluating here):
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Thus our posterior distribution has parameters alpha=5+2=7 and beta=105+5=10, so we have a beta (7,10) distribution to model the posterior distribution
of P(H|E)
...
The plausibility is naturally represented by
the posterior probability density function of the Beta (7,10) distribution
...
> p <- (0:1000)/1000
> plot(p,dbeta(p,2,5),,ylim=range(0:4),type=”l”
+ ylab=”f(p)”, main="Comparing Prior and Posterior")
> lines(p,dbeta(p,7,10))
> legend(0,3,legend="Prior")
> legend(0
...
6,legend="Posterior")
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Note though that the
posterior is still not nearly centered on 0
...
We will return
to discussing how the likelihood and prior influence the posterior distribution
calculation later in this chapter
...
The readers are encouraged to do
this experiment on their own with plots resulting in R
...
5
...
0
0
...
4
0
...
8
1
...
Although the mathematical complexity of the applications of this algorithms are
broad and can be complex, this algorithm remains the same for the Bayesian
methods
...
148
computational complexity involved
...
The rest of this chapter discusses this algorithm further, taking a closer look at
the prior choices, what a likelihood function is, and how the posterior is
calculated
...
The use of a prior introduces
subjective information from the statistician into the model and there are not hard
and fast rules as to the “correct” prior to use as it is more of an educated
guessing game
...
In the coin-tossing example above, the prior was an estimate
lower than the empirical data suggested, so based on that you may argue that the
prior biased the data and the result would have been more accurate if no prior
were used in calculating the posterior
...
5 and the empirical data result for the proportion
of heads in a 10 toss trial is 0
...
Another counterargument is that there is always subjectivity in
any model – data can be collected through a variety of experiments with
different biases
...
Priors however are not just about introducing subjective knowledge into the
model; they are about introducing previous knowledge into the model
...
Previous knowledge may be entirely subjective – based on a feel or a
guess about what the outcome of a novel experiment may be, often in regard to
the potential outcome of an experiment that has never been done
...
Meta analysis models can be done in Bayesian
combining data from new experiments with data from old experiments
...
We will
only deal with the simplest models of standard form, merely touching on the
vast world of Bayesian models
...
However, in the Bayesian model the use of a
prior is key
...
Sometimes, as in the example above the choice of prior is simple because it
follows a convenient model pattern, as is the case with the beta prior for the
binomial model
...
Other times the basis
of a prior is made for mathematical reasons (beyond our discussion) in a
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Such priors are referred to as noninformative
...
Conjugate Priors
The type of prior we will use for all our examples in this and coming chapters
falls into the type of prior referred to as an informative prior or a conjugate
prior
...
This property is called conjugacy
...
This
is always the case given these models
...
The example used earlier in this chapter for a coin toss
can be applied to numerous scenarios
...
That’s OK
...
There is no requirement for conjugate pairs to be both
continuous or both discrete distributions
...
There are other conjugate pairs of prior-likelihood to yield a posterior of the
same form as the prior
...
Recall the form of the Poisson distributions (from chapter 7)
...
Let’s use the symbol theta (θ)
instead of lambda, where theta represents the rate parameter (all we are doing
here is changing the symbol used, a perfectly legal mathematical maneuver)
...
Note that the rate can be any positive (real) number
...
150
e−θθ xi
P(data|model)=P(x1,x2,…,xn|θ)= ∏
xi !
i =1
n
Leaving out the constant term (factorial in denominator):
∑xi -nθ
P(data|model)∝θ
e
Recall also the gamma model, a continuous model with great flexibility of
particular shape and location parameters, alpha and beta, used to model data
which are greater than zero and continuous in nature but do not follow the
normal distribution
...
According to prior knowledge, the rate per 1000
people at which this mutation occurs is 0
...
Based on this we can model this
with a gamma distribution with alpha parameter of 2 and beta parameter of 5
(the mean of which is 0
...
That is the
distribution of theta (here denoting the rate per 1000 people) for the prior is
gamma (2,5)
...
Since in our 10
samples we found a total of eight people with the mutation, Σxi is 8
...
151
updated parameters are alpha (post)=10 and beta (post)=15 with a gamma
(10,15) distribution for theta
...
001)
y<-dgamma(theta,2,5)
plot(theta,y,ylim=range(0:3),ylab="",xlab="theta")
y2<-dgamma(theta,10,15)
lines(theta,y2)
legend(0,2
...
6,2
...
8 per thousand) but still not
as high as the data would suggests
...
Working
with conjugate pairs greatly simplies the mathematics and modeling involved in
Bayesian data analysis
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The study of
noninformative priors in Bayesian statistics is relatively complicated and an
active area of research
...
Basically, a noninformative prior is used to conform to the Bayesian model in a
correct parametric form
...
Noninformative priors may be special
distribution forms, or versions of common probability distributions that reflect
no knowledge about the distribution of the random variable(s) in the prior
model
...
If there is no specific prior information known about
the proportion of something (say a proportion of nucleotide in a sequence
motif), then we can use a noninformative form of the beta distribution
...
This is a special case of a
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
If you want
to use a Bayesian model and have proportion data to model with no prior
information, you should use a beta(1,1) prior
...
> x<-seq(0,1,by=
...
There is a difference between the likelihood and a
distribution model
...
With a distribution model we are asking what the probability of the sample
outcomes are given the parameter values
...
The parameter, p, is fixed, and the observations (x’s) are
variable
...
In essence, the interpretation of the likelihood is the
reverse of the distribution
...
154
trying to determine the parameters given the data
...
A popular estimate is the value that is
most likely, that is, the value that makes the likelihood function as big as
possible (which you could prove by taking the likelihood and finding its
maximum using calculus)
...
The interpretation is that the MLE is the
most plausible value of the parameter, in the sense that it produces the largest
possible probability of the observed data occurring
...
Obviously making such a “true” estimate is more of a frequentist than a
Bayesian thing, but it is of note here because it is an important use of the
likelihood function
...
In the examples used so far, the posterior distribution was quite simple
and the trick of using conjugate pairs resulted in a familiar posterior distribution
whose parameters could easily be determined by taking advantage of the
property of conjugacy
...
Let’s take a look at some of the issues involved in the posterior
outcome of the Bayesian model
...
Let’s investigate the role of the likelihood
and role of the prior in determining the posterior
...
It is pretty straightforward that the
more data there are, the more the data will influence the posterior and eventually
the choice of the prior will become irrelevant
...
In
this example we had a prior estimate for p of 0
...
We collected n=10 data values and obtained k=5
successes (coin landed on head) for the binomial likelihood model
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
With n=100 and x=50 we get alpha (new)=52 and beta (new)=55
...
001)
plot(p, dbeta(p, 2, 5), ylim = range(0:10), ylab = "f(p)",
type = "l", main = "Effect of n on Posterior")
lines(p, dbeta(p, 7, 10))
lines(p, dbeta(p, 52, 55))
legend(0, 4, "Prior")
legend(0
...
1")
legend(0
...
2")
Figure 9-8 shows the result of this analysis
...
5 and is also much more “precise” on this
value with less spread
...
Figure 9-8
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Let’s see what would happen in the example above if
we used the same data for n=10 but used instead a beta(1,1) noninformative
prior to start
...
Let’s analyze and graph this model and compare it
with the prior and first posterior in Figure 9-8
>
>
+
>
>
>
p <- seq(0
...
999, by = 0
...
5, 4, "Posterior")
legend(0, 3, "Prior")
Figure 9-9 shows the result of the posterior when a noninformative beta prior is
used
...
Posterior 1 in Figure 9-1 is updated with the same data as the
posterior distribution in Figure 9-9 except in Figure 9-9 a noninformative beta
prior is used
...
When the noninformative prior is used the data dominate the model, whereas
when the informative prior in 9-8 is used the informative prior influences the
posterior along with the data
...
157
The important point is that in Bayesian analysis the posterior distribution will
reflect with relative weights the data model and the prior model
...
This relative influence of data and prior knowledge is
important aspect of Bayesian analysis
...
Using analytical methods (the paper and pencil method) without
a computer it is nearly impossible to evaluate a posterior except in simple cases
...
We shall see that the focus of Markov Chains and other methods
are in fact tools for evaluating a posterior distribution of interest
...
In order to
make the models fully explicit we would need to add back in a constant term c:
(Note that the notation “c” is generic
...
Posterior =c * Likelihood * Prior
P(model|data) = c* P(data|model) * P(model)
Note that the constant terms, grouped as c, are called the normalizing constant
and this can be calculated so that the posterior distribution integrates to 1, but it
is only a scaling factor and does not affect the distribution otherwise, so it is
acceptable to ignore the c term in many calculations and only calculate it when
necessary
...
Bayesian
statistics can of course be utilized in models with multiple parameters, and those
are of primary interest in bioinformatics
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Nuisance
parameters can be eliminated easily in Bayesian models by calculating marginal
and conditional probability distribution for the main parameters Rather than
introduce techniques here for dealing with multiparameter situations we shall
only note that they exist and we will work with them in coming chapters when
we learn computationally intense algorithms that do most of the work for us in
evaluating such models to understand the parameter of interest
...
We also cover Markov chain Monte
Carlo methods, which are of growing use and popular in bioinformatics
applications
...
Also related are
artificial intelligence related subjects, logic, and of course, biology related areas
...
The appendix lists
some reference papers from the literature where Bayesian methods are used
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Markov chains have many applications in many areas of bioinformatics and
other fields
...
Ten years ago, if you brought up the term MCMC to a room of
biologists, few if any would have heard of such a technique
...
Understanding of the
mathematical techniques behind MCMC is still mostly the domain of
statisticians and mathematicians
...
References given in the appendix for further study are
provided for the interested reader
...
Before
introducing the basics of the MCMC technique, let’s discuss what the ultimate
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The previous chapter introduced Bayesian
statistics in which the major goal is to determine a posterior distribution given a
likelihood data model and a prior distribution model
...
However, determining a posterior distribution is easier said then done and
traditional analytical methods (using advanced calculus, numerical methods, etc)
are often extremely difficult
...
The prior is:
k
Γ( ∑ α i )
k
i =1
f ( p1 , p 2 ,
...
X K ) =
X X
...
p k
1 2 K
The posterior distribution is:
Posterior ∝ Likelihood * Prior
k
n
X1 X 2
Xk
Posterior =
X X
...
p k
1 2 K
Γ(
∑α )
i
i =1
k
∏ pα
i
k
∏ Γ(α )
i −1
i =1
i
i =1
It is easy to evaluate the Dirichlet - that’s why it is used as a conjugate model
...
post = alpha +
Xi, beta
...
Perhaps we should
use another model to illustrate “difficult to evaluate”
...
In this example we consider what is
actually a rather simple case using a normal likelihood with parameters µ and
σ2
...
161
It turns out to be a bit easier for Bayesian computations to work with the inverse
of the variance parameter, the so-called precision parameter φ = 1/σ2
...
For
this example the likelihood function is, remembering that we use φ = 1/σ2:
n
φ
φ n / 2 − 2 ∑( X −µ )
lik( µ , φ ) =
e
n/2
( 2π )
2
i
i =1
The joint prior density of the parameters µ and φ is:
f ( µ , φ ) = f1 ( µ ) f 2 (φ ) =
1
−
2πτ
2
e
1
2τ 2
µ2
⋅
1
Γ (α ) β
φ α −1e
α
( β)
− φ
The posterior distribution is:
Posterior ∝ Likelihood * Prior
−
posterior = f ( µ , φ | X 1 ,
...
Even those in love with
multidimensional calculus are not very interested in performing such
evaluations
...
But, the
dilemma is, in bioinformatics and other areas working with high-dimensional
datasets, what we want is the information in the posterior distribution – the one
we can’t evaluate directly!
Do not despair, because there is a clever trick
...
MCMC techniques simulate a posterior that can be explored
...
A brief view of how MCMC works is shown in Figure 10-1
...
The approximation starts at the origin and
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In the upper left corner, n=10, the samples are
sequentially joined
...
As the
iterations continue the process produces a more refined approximation to the
desired posterior distribution
...
Figure 10-1: Posterior MCMC Simulation of a Bivariate
Normal Distribution
By 1000 cycles the simulation in Figure 10-1 clearly resembles Figure 8-7,
which depicts a directly simulated bivariate normal distribution plotted in the
same way
...
Utilizing MCMC techniques require an understanding of Bayesian theory
(covered in Chapter 9), Stochastic and Markov Processes described in this
chapter, the algorithms covered in Chapter 11, and can be implemented using R
and the auxiliary software tools introduced in Chapter 12
...
163
the power of MCMC
...
The idea here is to introduce the reader to the power of this technique so that
he/she may understand the terminology in the literature, he/she may be able to
perform simple exploratory applications using R and auxiliary programs, and to
encourage the reader interested to collaborate with statisticians who can assist
them in their modeling and data analysis
...
Stochastic Modeling
As we have mentioned before, models are used to represent behaviors and
properties of real world systems
...
The use of statistical models in bioinformatics is rapidly growing,
but many other fields use such models as well
...
Models can be generally classified as stochastic or deterministic
...
A
deterministic model is a model where the input components of the model are not
random
...
A
deterministic model has an outcome that can be computed by direct calculation
or numerical approximation
...
The output
of a stochastic model is also random, and is a snapshot or estimate of the
characteristics of a model for a given set of inputs
...
Stochastic models work well in the Bayesian
paradigm
...
A deterministic model gives results of a single
scenario, whereas a stochastic model can be used to give a distribution of results
for a distribution of scenarios
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Recall that in probability, a single random phenomenon (such as the value of the
role of a die) is modeled using a random variable representing the outcome of an
experiment
...
In a stochastic process, the “game” changes in time or space in some way
...
Each random variable in
the sequence represents the current state of the process
...
In a
stochastic process the subsequently indexed random variables are not
independent and there is some pattern of dependency from Xn to Xn+1
...
Two major categories of
stochastic processes are arrival-time processes and Markov processes
...
We will not cover arrival times
proceses
...
Markov Processes
A Markov process is a stochastic process where the set of random variables
{X1, X2, X3…Xn…} models a certain process over time where it is assumed
these random variables represent measurements (or counts) that were taken at
regularly spaced time points
...
The next value (Xn+1) depends only on the current value (Xn)
...
Mathematically this is written in terms of conditional probability:
P(Xn+1=xn+1|Xn=xn, Xn-1=x n-1, … X0=x0)=P(X n+1=x n+1|Xn=xn)
This is known as the Markov condition
...
In Figure 10-2 the state of the sequence depends only on the time
interval prior to it (t=4 depends only on t=3, etc)
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This is an alternative Markov process
model
...
Each state of the nucleotide {A, T, C, G} is sequentially dependent on the
nucleotide adjacent and immediately upstream of it, and only that nucleotide
...
5'
A TC G C C
X0
3'
X1
Figure 10-3: 5’ to 3’ Nucleotide Dependencies as a Markov
Process
Classifications of Stochastic Processes
Time (Position)– Discrete or Continuous?
This refers to the index of the random variables
...
If
the index is discrete then it’s discrete time/space
...
166
continuous time/space
...
State Space – Discrete or Continuous?
This refers to the space of values for the random variable
...
For example, a discrete state space is the set of
four nucleotides in a DNA sequence and modeled using random variables that
will be discrete random variables (since they can only take on four distinct
values)
...
Four Combinations of Possibilities
Of course, you can have any combination of time and state space
...
Towards the end of the chapter, we will discuss
briefly how the discrete models generalize to continuous models
...
Given that the system
is in state x (where x can be any integer value) at time n, the system will at time
n+1 move either up, or down one point (integer) usually with equal probability
...
The simple function below simulates a random walk with respect to the y-axis
using R
...
At any given time a
random step up or down one integer amounts to sampling with equal
probabilities from the numbers (+1,-1) and adding the result to the current state
...
The results of one
simulation are shown in Figure 10-4
...
167
20
10
0
-20 -10
y
0
20
40
60
80
100
Figure 10-4: Simple Random Walk
Likewise, it is simple to simulate a two dimensional random walk in R as coded
below
...
This code could
represent (among other things) an experiment of flipping two coins
...
For the x and y direction if
the coin is 1, you move + 1 in that direction, if the coin is -1 you move –1 in that
direction
...
The “cumsum” function accumulates
these movements
...
168
Figure 10-5: Two Dimensional Random Walk Outcomes
(same simulation)
Real world applications of random walks include fractals, dendritic growth (for
example, tree roots), models for stock options and portfolios, and gambling
game outcomes
...
Let’s
now explore Markov Chain models in more detail
...
Because some matrix math is required to compute Markov Chain
probabilities, we will first review how to do this using R
...
The first example is
very simple and will be presented mostly conceptually
...
The next section of
this chapter will investigate the mathematical details of Markov Chain models in
more depth
...
The numbers are called entries (or
elements) of the matrix
...
You have likely encountered matrices before in a math
course (although you may not remember them!)
...
169
For our purpose here, we will be concerned mostly with square matrices
...
And the only
operation we will be concerned with is multiplying matrices
...
To multiply matrix A by matrix B to create a new matrix
C, we multiply row 1 elements of A by corresponding elements of column 1 of
B and sum the results to get the first entry in C (c11)
...
a11* b11 + a12 * b21 a11* b12 + a12 * b22 c11 c12
C = AB =
=
a 21* b11 + a 22 * b21 a 21* b12 + a 22 * b22 c21 c22
In R, matrices are a distinct type of object with distinct behaviors declared using
the matrix function call
...
To declare a matrix
in R use the matrix function below with parameters data (for the data vector),
nrow (for number of rows) and ncol (for number of columns)
...
Data can be a
preexisting data vector, or can be entered as a parameter directly in the function
call
...
Thus, if 4 data values are
entered as a data parameter, the first two become the first column and the second
two values become the second column
...
matrix(x) will return true or false depending on whether
object x is a matrix or not
...
Let’s declare two square matrices and multiply them in R
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
At each time point the frog (who is disabled and
cannot swim but can only leap from pad to pad) is on one of the Lilly pads (state
space consisting of A and B)
...
What if we know the frog is on Lilly pad A at the initial time, and we want to
know the probability that he is on pad A in 10 minutes?
We can model this scenario as follows
...
8, the probability of
the frog going to pad B is 0
...
If the frog is on pad B, the
probability of staying on pad B is 0
...
4
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
6
B
0
...
2
0
...
We call this a transition matrix
...
The matrix below is for
transition probabilities for the first move given the initial state
0
...
2
Transition matrix for initial state =
0
...
6
To determine how he could be on pad A in 2 time periods given he is on pad A
in the initial state, we can observe that in order for this to happen, there are two
possibilities
...
The second is that the frog goes from pad A to pad B
and then back to pad A
...
Each transitional event is independent (frog can
go from A to A or A to B, these events are independent) so we can multiply
them
...
P(Frog on A after two time intervals)
= P(AA)P(AA)+P(AB)P(BA)
=(0
...
8)+(0
...
4)
=0
...
08=0
...
172
To get the next transition matrix (t=1 to t=2) we can use R to multiply the first
transition matrix by itself, using the logic described above
...
8,0
...
2,0
...
8 0
...
4 0
...
72 0
...
56 0
...
Figure 10-7: Transition Probabilities for Frog’s First Two
Moves
Let’s consider the longer-term behavior of the transition probability matrix
(which can be referred to as “the chain”)
...
) a peculiar thing happens…
...
688 0
...
624 0
...
6752 0
...
6496 0
...
173
> frog10
[,1]
[,2]
[1,] 0
...
3332984
[2,] 0
...
3334032
>
>
>
>
>
>
frog11<-frog10%*%frog1
frog12<-frog11%*%frog1
frog13<-frog12%*%frog1
frog14<-frog13%*%frog1
frog15<-frog14%*%frog1
frog15
[,1]
[,2]
[1,] 0
...
3333330
[2,] 0
...
3333340
>
>
>
>
>
>
frog16<-frog15%*%frog1
frog17<-frog16%*%frog1
frog18<-frog17%*%frog1
frog19<-frog18%*%frog1
frog20<-frog19%*%frog1
frog20
[,1]
[,2]
[1,] 0
...
3333333
[2,] 0
...
3333333
After around 20 intervals the transition matrix no longer changes
...
0
...
666
A
B
0
...
666
Figure 10-8: Transition Matrix for Frog’s First 20 Jumps
After 20 jumps the frog’s position is in a so-called stationary distribution and the
chain has converged (ended) with this stationary distribution
...
We specified only the transition probabilities for his moving from pad to pad
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
5 for starting on A or B)
...
> start0<-matrix(c(0
...
5),nrow=1,ncol=2)
> start0
[,1] [,2]
[1,] 0
...
5
> start1<-start0%*%frog1
> start1
[,1] [,2]
[1,] 0
...
4
> start2<-start1%*%frog2
> start2
[,1] [,2]
[1,] 0
...
344
> start3<-start2%*%frog3
> start3
[,1]
[,2]
[1,] 0
...
334016
By the third iteration, the starting probability distribution is converging to the
stationary distribution
...
1 for
the frog starting on pad A and p=0
...
1,0
...
1 0
...
44 0
...
6304 0
...
6643456 0
...
This can be interpreted as the process eventually “forgets” the
starting condition and no matter where it starts, converges to some stationary
distribution given the initial transition probabilities
...
175
probabilities and how to mathematically (using R) manipulate the matrix to
compute transition probabilities for k>1 subsequent states and understanding the
idea of convergence to a stationary state
...
Modeling A DNA Sequence with a Markov Chain
Specifying the Model
The first step in working with a Markov Chain model is to determine what the
model is
...
Each place in the sequence can be thought of as a state
space, which has four possible states {A, T, C, G}
...
If we follow Figure 10-3 and go in the 5’ to 3’ direction then X0=A, X1=T and
so forth
...
In other words, the nucleotide sequence is not a
series of independent nucleotides, but that each nucleotide is dependent on the
nucleotide immediately upstream
...
Perhaps we sequence a few 100 base pair stretches
from the DNA sample in question and determine the following for the
probability of nucleotides adjacent to each other
...
3
T 0
...
2
G 0
...
2
0
...
2
0
...
2 0
...
4 0
...
2 0
...
1 0
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Row 1, column 1
represents the probability of going from nucleotide A in position 0 (X0=A) to
nucleotide A in position 1 (X1=A) or P(X1=A| X0=A)
...
The transition matrix
contains all information needed to compute all the probabilities for future states
...
The
columns do not have to sum to 1 and usually do not
...
3
0
...
2
0
...
2
0
...
2
T
0
...
3
0
...
4
0
...
4
0
C
0
...
2
Figure 10-9
Determining Convergence and Stationary Distributions
Let’s go ahead and enter this model in R to calculate a stationary distribution
using R’s capabilities to multiply matrices:
> DNA
1,0
...
1,0
...
2,0
...
8,0
...
4,0
...
1,0
...
3,0
...
3 0
...
2 0
...
1 0
...
4 0
...
2 0
...
2 0
...
1 0
...
1 0
...
1567 0
...
2424
[2,] 0
...
3476 0
...
1572 0
...
2380
[4,] 0
...
3458 0
...
2497
0
...
2488
0
...
177
> DNA8<-DNA4%*%DNA4
> DNA8
[,1]
[,2]
[1,] 0
...
3497508
[2,] 0
...
3497695
[3,] 0
...
3497173
[4,] 0
...
3498298
> DNA16<-DNA8%*%DNA8
> DNA16
[,1]
[,2]
[1,] 0
...
3497689
[2,] 0
...
3497689
[3,] 0
...
3497689
[4,] 0
...
3497689
[,3]
0
...
2450017
0
...
2449286
[,4]
0
...
2496081
0
...
2496041
[,3]
0
...
2449923
0
...
2449923
[,4]
0
...
2496148
0
...
2496148
16
DNA16 (P ) appears to have converged and represents a stationary distribution
...
> DNA32<-DNA16%*%DNA16
> DNA64<-DNA32%*%DNA32
> DNA64
[,1]
[,2]
[1,] 0
...
3497689
[2,] 0
...
3497689
[3,] 0
...
3497689
[4,] 0
...
3497689
[,3]
0
...
2449923
0
...
2449923
[,4]
0
...
2496148
0
...
2496148
We can use the converged transition matrix to conclude that our stationary
distribution of nucleotides, which we represent with the letter pi, is:
π = [0
...
35 0
...
25]
We can interpret this as, given our initial distribution the posterior distribution
of nucleotides is 15% A, 35%T, 25%C and 25%G, regardless of position
...
In a sample of 1000 nucleotides from this sample,
we would expect 150 A, 350 T and 250 to be C or G
...
Statistical
comparisons of sequence elements can be made where the nucleotide
composition of different types of sequence elements is distinguishable
...
You could then perform statistical analysis to determine
if there is a significant difference in nucleotide composition in coding or noncoding regions
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
When
using MCMC, we are interested in chains that result in a stable and unique
stationary distribution
...
These terms and other important terms are explained below
...
Clearly in the
case of the Lilly pads there were only 2 states, and in the case of a DNA
sequence there are only 4 states
...
Finite does not
necessarily mean small, it only means the number of possible states is not
infinite
...
Aperiodicity
Aperiodicity means the chain is not periodic
...
Consider a three state chain {A, B, C} where the only possible -transitions are
A->B, B->C and C->A as in Figure 10-10
...
Again in time 6 and time 9 the chain will be in state A
...
Since the greatest common divisor of
these is 3, we say that this chain has period of 3
...
179
> periodic<-matrix(c(0,1,0,0,0,1,1,0,0),nrow=3,ncol=3)
> periodic
[,1] [,2] [,3]
[1,]
0
0
1
[2,]
1
0
0
[3,]
0
1
0
> periodic1<-periodic%*%periodic
> periodic1
[,1] [,2] [,3]
[1,]
0
1
0
[2,]
0
0
1
[3,]
1
0
0
> periodic2<-periodic%*%periodic1
> periodic2
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1
> periodic3<-periodic2%*%periodic
> periodic3
[,1] [,2] [,3]
[1,]
0
0
1
[2,]
1
0
0
[3,]
0
1
0
If we keep going the chain will just rotate around and around without ever
converging to a unique distribution
...
Instead, we want chains that are
aperiodic
...
To illustrate
this with a counter example, consider the chain in Figure 10-11
...
If in a state space the transition
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
An example of what a stationary state transition matrix for a reducible state
space {A, B, C, D} would look like is:
0
0
...
6 0
0
0
...
7 0
0
0 0
...
6
0
0 0
...
7
In this case there are two stationary distributions
...
Therefore we desire our
systems to be IRREDUCIBLE when using them for determining a unique
posterior distribution
...
An
ergodic chain has a unique stationary distribution that is positive on all states
when the number of states is finite
...
Good mixing means the chain does not get “stuck” for periods
of time around one state
...
Reversibility
Reversibility is a special and somewhat mathematically magical property of a
Markov Chain
...
In other words, an ergodic chain can be
reversible, but does not have to be reversible
...
An ergodic Markov chain is reversible if it satisfies the detailed balance
equation:
π ( x ) p ( x , y ) = π ( y ) p ( y, x )
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The transition probability
from x to y is p (x, y) and the transition probability of going from y to x is
p(y,x)
...
In essence the property of reversibility means that the probability of going
forward in time (or space) equals the probability of going back in time (or space)
for a given state space pair
...
Going back to the
frog example, what if the one-step transition matrix for the frog is as follows:
P0= Start
A 0
...
9
B 0
...
1
Although it takes a while (about 60 transitions, try it in R and check) the
stationary matrix for this system is:
n
P = Start
A 0
...
5
B 0
...
5
Thus the unique invariant distribution for states A and B is:
π = [0
...
5]
To check for reversibility we check the detailed balance equation:
π (A )p(A, B) = π (B)p(B, A)
0
...
9 =0
...
9
Therefore this system is reversible
...
The Stationary State
The stationary state is the gold in the mining process of Markov chain models
...
In the Bayesian sense, the stationary state
is the posterior distribution (or the posterior distribution of interest can be
determined using the stationary state)
...
182
We have seen that the existence of a stationary state is NOT guaranteed, but is
conditional on various properties of the Markov Chain
...
The convergence of Markov chains is a mathematical topic in and of itself, the
details of which are beyond our discussion
...
The package CODA
will be introduced in chapter 12 and contains functionality to perform some
convergence diagnostics
...
This is difficult to analyze graphically or to interpret, so ordinarily
we will be interested in analyzing individual variables and parameters
separately
...
Recall the discussion and
graphical illustrations in Chapter 8, which looked at these distributions for the
multivariate normal, multinomial, and Dirichlet distributions
...
Continuous State Space
So far we have considered discrete state spaces only – those where the states are
distinct such as {A, T, C, G}
...
In the case of continuous state space the transition matrix is replaced
with a transition density often referred to as the transition kernel
...
Transition probabilities are calculated with integrals, not sums
...
We
will work with continuous state space models in some of the examples used in
Chapters 11 and 12
Non-MCMC Applications of Markov
Chains in Bioinformatics
In this book, our primary interest is in working with probability models and
using Markov Chains for modeling so that we may utilize MCMC methods to
simulate posterior distributions in order to harvest results of interest
...
183
area where R and WinBugs can be very useful tools and applications presented
in this book will focus on this type of analysis
...
For these applications, the
interest is not in posterior distribution determination
...
Many books of interest are listed in the appendix for those
interested in further exploration of the use of Markov Chains in sequence
analysis; in particular the book by Durbin is devoted to this subject
...
However the
basic properties of Markov Chains introduced in this chapter apply to other
types of applications and should serve as a foundation for further study
...
This is especially true if you
come from a background using primarily analytical and more “exact” methods
of measuring and working with data
...
A major area of modern
statistics is to develop better models that more accurately reflect reality, an
ongoing process where models are iteratively refined
...
In statistics, models take
randomness and some inaccuracy into account
...
An emerging area of
bioinformatics involves modeling systems at a higher level (systems biology is
already emerging as a science in and of itself) and statistical models will play a
key role in these models
...
We focus on methods of
modeling and on how R can be used as a software tool to work with these
models
...
We will later work with
linear models and methods of organizing large data sets in meaningful ways
...
184
11
Algorithms for MCMC
This is a chapter introducing some commonly used algorithms that take
advantage of Monte Carlo methods to simulate distributions
...
These
algorithms can easily be used in R and are relatively mathematically simple yet
extremely versatile
...
However, we have also introduced the powerful techniques of Bayesian
statistics and the idea of a complicated high-dimensional posterior distribution
that is difficult to determine
...
Determining
solutions for such complex posterior distributions is beyond the limits of
traditional analytical mathematical methods, and hence the need for Monte
Carlo simulation
...
The goal now is to work with Bayesian methodology and Markov Chain
techniques to be able to simulate and study posterior distributions of interest
...
To do
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
WinBugs utilizes the Gibbs sampler, one of the algorithms covered in this
chapter
...
Therefore a focus of this chapter is to cover the theory of why the program
works
...
Monte Carlo Simulations of Distributions
Monte Carlo method in general means any technique for obtaining solutions to
problems using random numbers
...
Random numbers were first studied
in the context of games of chance and gambling
...
Before the availability of fast computers methods
of generating random numbers were unreliable or tedious
...
Monte Carlo techniques are now standard in
statistics and all sciences to solve many types of problems
...
This section explores two methods of simulating simple distributions
...
Inverse CDF Method
The inverse CDF method, or inverse cumulative distribution function, method is
the simplest method to simulate a probability distribution directly
...
A random number generator is an algorithm
that produces random sequences of the digits 0,1,2,…,9 – imagine repeated
shuffling of these digits and then drawing one digit at a time between shuffling
...
If many such numbers are generated the resulting sample is
that of a uniform distribution with range [0,1]
...
The inverse CDF method transforms a set of uniform(0,1) random draws so that
the resulting values represent random draws from any desired distribution
...
ecall
also from precalculus mathematics that if f is a 1 to 1 function with domain A
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
B
A
y=f(x)
x
Figure 11-1
Of course, for a probability distribution function we know that the range of
values in B is between 0 and 1 (obeying the probability theory that all
probability values are between 0 and 1)
...
In R the function runif can be used to draw a random
uniform sample and then to transform the sample by the inverse CDF method to
obtain a simulation of the desired distribution
...
Recall that for the exponential, the
CDF is
F(x)=1-e
-λx
Letting F(x)=u
u=1-e
-λx
Solving for x
1-u= e
-λx
log(1-u)=-λx
x=-
1
λ
log(1 − u )
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
We can call this function with the desired n, the number of values
to be randomly drawn, and parameter lambda desired
...
In fact R
utilizes the inverse CDF method internally to provide simulation of several
distributions
...
>
>
>
>
>
x1<-simExp(1000,2)
x2<-rexp(1000,2)
par(mfrow=c(1,2))
hist(x1,30,main="Inverse CDF")
hist(x2,30,main="Direct Simulation")
Figure 11-2
Rejection Sampling
The second method of using Monte Carlo methods to simulate a distribution that
we will look at is called rejection sampling
...
This is the case even with some univariate distributions such as
the beta and gamma
...
Rejection sampling is a technique of approximating a distribution using another
close distribution for the actual sampling and a rule to determine (accept or
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The
distribution sampled from is sometimes called an envelope distribution, since it
must frame the distribution of interest and thus contain all values for the
distribution of interest
...
Let’s simulate via
rejection sampling a beta distribution with alpha=2 and beta=2
...
f (x) =
Γ(α + β ) α −1
x (1 − x ) β −1
Γ(α )Γ ( β )
For alpha=2, beta=2 this is
f ( x ) = 6 *x *(1-x)
We note from the plot of a beta (2,2) in Figure 11-3 that the maximum of the
beta curve is 1
...
Therefore we can use a multiple of the uniform pdf with y=1
...
Here
we simply sample from a uniform(0,1) ignoring the multiplicity factor 1
...
Figure 11-3
The method we will use to do the rejection sampling is as follows:
1
...
The
multiplier is M=1
...
Draw a sample u from the uniform 0 to 1
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Compute f(x) – that is beta(2,2) pdf evaluated at the value x
4
...
Otherwise reject x
and repeat steps 1 through 4
...
Here we give an example of 5000 random draws:
>
>
>
>
x <- runif(5000,0,1) # Step 1
u <- runif(5000,0,1) # Step 2
fx <- dbeta(x,2,2)
# Step 3
acceptx <- x[fx > 1
...
8))
> points(x,fx)
Figure 11-4 is convincing evidence that the rejection sampler works
...
0
0
...
0
Density
1
...
0
0
...
4
0
...
8
1
...
Since then, probably no other
algorithm has received so much attention and become so popular in statistical
applications
...
Basically, the Gibbs sampler is a tool for sampling from a multivariate
distribution, say of a m-dimensional set of variables U1,…,Um
...
190
iterating through the individual variables Ui by sampling from the full
conditionals f(Ui| U(-i) and then updating, and repeating this for all Ui
...
Here the full conditional is the
conditional distribution of a single parameter given the data and all the other
parameters
...
Indeed, many times this is the only way to look at such distributions
especially beyond 3 dimensional distributions since it impossible to visualize or
graph higher-dimensional distributions
...
A high dimensional posterior distribution can be broken down
into conditional distributions for each of the parameters of interest
...
For
example, let’s say we have a hypothetical distribution which has three
parameters that we will call A, B, and C
...
The Gibbs sampler
enables us to obtain samples from this distribution by simply using the three
full-conditional distributions: p(A|B,C), p(B|A,C) and p(C|A,B)
...
Ultimately we will not be studying the joint posterior distribution of
the sampled values but rather marginal distributions of individual parameters
since that is what we are mainly interested in
...
Essentially the Gibbs sampler is a Markov Chain process, illustrated in Figure
11-5 for a two parameter, x and y scenario
...
To illustrate use of the Gibb’s sampler we present two examples
...
In this case the posterior distribution of
interest is known, but all of the steps can be illustrated so this example is
presented as a review of probability and as an illustration of the Gibb’s sampler,
which can be easily followed
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
y2
yn
Figure 11-5
A Gibbs Example Using Simple Probability
To illustrate the Gibbs sampler, let’s do a simple experiment using only basic
probability distributions
...
3
0
...
1
2
B
2
0
...
1
0
...
The marginal
probabilities of A are illustrated in Table 11-2
...
5
0
...
2
The marginal probabilities of B are illustrated in Table 11-3
...
6
2
B
0
...
From the joint distribution the full
conditional distribution of B given A is given in Table 11-4
...
6
...
192
Table 11-4
A
1
3
1
0
...
667
0
...
4
0
...
5
Likewise, the full conditional distribution of A given B is presented in Table 115
...
5
...
Table 11-5
P(A|B)
1
3
1
0
...
333
0
...
5
0
...
25
Here is a little program in R, which uses the Gibb’s sampler to compute the
marginal distributions of A and B, by sampling from the full conditionals
...
>
>
>
>
>
>
>
>
#initialize a and b to (1,1) first row, first column
a<-1
b<-1
#initialize variables to hold results
margA<-NULL
margB<-NULL
joint<-matrix(c(0
...
2,0
...
1,0
...
1),nrow=2,ncol=3)
samples<-matrix(0,nrow=2,ncol=3)
Next, run a loop of 1000 runs which chooses values for the variable conditioned
on using the sample () function (not to be confused with the samples results
matrix variable!) and then stores the selection in the appropriate marginal results
variable using the append function
...
193
The samples matrix contains the counts of a’s and b’s from the sampler above:
> samples
[,1] [,2] [,3]
[1,] 288 206 117
[2,] 194
95 100
Converting the samples data to probabilities and comparing with the original
joint distribution shows the results obtained using the iterative Gibb’s sampling
from the conditional distributions reproduced the joint distribution
...
288 0
...
117
[2,] 0
...
095 0
...
3 0
...
1
[2,] 0
...
1 0
...
Compare these results to the marginal distributions in Tables 11-2 and
11-3
...
Dist of A")
> hist(margB,main="Marg
...
In chapter 8 we only considered the case where X and Y are
independent
...
That is, there is a statistical
relationship between X and Y
...
If rho=0 then the
variables are uncorrelated
...
194
and Y
...
If the correlation coefficient is considered, the joint probability distribution of
two normally distributed random variables can be written as:
0 1
(X,Y)~ N ,
0 ρ
ρ
1
This means X and Y are distributed normally with means 0 and 0 (the first
parameter matrix) and variances of σ2 and correlation of ρ between XY
...
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
gibbsBVN_function(x,y, n, rho){
#create a matrix to store values
m<-matrix(ncol=2,nrow=n)
#store initial values in matrix
m[1,]<-c(x,y)
#sampling iteration loop
for (i in 2:n){
#rnorm takes sd not variance
#update x conditional on y
x<-rnorm(1,rho*y,sqrt(1-rho^2))
#update y conditional on x from above
y<-rnorm(1,rho*x,sqrt(1-rho^2))
#store values in matrix
m[i,]<-c(x,y)
}
m
}
(0)
This works because it is a Markov chain
...
If X =x0
(n)
2n
4n
then the distribution of X is N(ρ x0, 1-ρ )
...
Therefore after
enough runs, no matter where we start X and Y the marginal distributions of X
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Let’s run this function with different starting values for X and Y, using rho=0
...
Figure 11-6
Next, let’s see how the sampler behaves using different correlation coefficients
(rho values)
...
>
>
>
>
>
>
>
>
>
corr0<-gibbsBVN(0,0,1000,0)
corr3<-gibbsBVN(0,0,1000,0
...
5)
corr98<-gibbsBVN(0,0,1000,0
...
3")
plot(corr5[,1],corr5[,2],main="XYCorr=0
...
98")
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Let’s
plot marginal distributions for some of these distributions
...
3")
hist(corr3[,2],nclass=20,main="Y Marg, Corr=0
...
98")
hist(corr98[,1],nclass=20,main="Y Marg, Corr=0
...
This is not only a small study in how correlation coefficients affect the bivariate
normal, but also an important point about the Gibb’s sampler
...
Although
in this particular example we can look at the joint distribution, since it is only
two dimensional, in many cases of higher-dimensional distributions we cannot
study directly the joint distribution
...
Note in Figure 11-6 if
you look only at the X or Y axis and the distribution of data along one axis, it is
normally distributed and the joint distribution pattern and correlation of
variables do not matter when we are studying X or Y individually
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Our interest is in any
multivariate distribution of the n parameters θ1, θ2, θ3,… θn
...
To perform the Gibbs sample follow the following
algorithm:
(0)
(0)
(0)
(0)
1
...
Simulate a value for θ1
(1)
from the full conditional θ1|θ2 , θ3 ,… θn
3
...
…
...
Simulate a value for θn
6
...
(1)
(0)
(0)
(0)
(1)
(0)
(0)
(1)
(1)
from the full conditional θn|θ1 , θ2 ,… θn-1
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Usually the Gibbs sampler will be useful, but if the full conditional
distributions cannot be sampled from there needs to be an alternative
...
In this case a
more general algorithm, the Metropolis-Hastings algorithm can be of help
...
The
Gibb’s sampler is actually a special case of this algorithm where the sampled
value is always accepted
...
For both algorithms we will be concerned with two distributions
...
The proposal distribution is in essence a transition probability distribution
as we discussed in the context of Markov chains
...
Often there are many proposal distributions that can be used –
pick the simplest one! Sometimes the proposal distribution is called a jumping
or sampling distribution
...
This is our distribution π or stationary
distribution we get as a result of an ergodic Markov chain
...
The beauty of the Metropolis algorithm
lies in the fact that for the acceptance/rejection criterion we only need to
calculate ratios of the posterior densities, which is much easier than calculating
the posteriors densities themselves or simulating from the joint posterior
distributions
...
In general terms (which
will be translated into specific distributions later) what we do is:
1
...
Calculate the ratio of the posterior of the new value to the old value =
p (θ * | data )
p (θ old | data )
Note since p (θ | data ) ∝ f ( data | θ ) p (θ )
f (data | θ *) p (θ *)
this is equals R =
f (data | θ old ) p (θ old )
R=
3
...
Accept the new value θ* if u < min (1, R)
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This can be thought
of as looking for hills on a flat landscape
...
Rule 1 – Moving UP a hill is always good
...
For example, in Figure 11-9 moving from a value of about 0
...
5 is good
...
5/0
...
Therefore
the min(1,R)=min(1,3)=1
...
Rule 2 – Going SLIGHTLY downhill is OK, as illustrated in Figure 11-10
...
4/1
...
82
...
82
...
82, so usually a slight downhill move will be accepted although not
always
...
200
Start
Stop
Figure 11-10
Rule 3 – Dropping off a cliff is generally not good
...
3/1
...
176
...
6% chance of
being this small
...
Start
Stop
Figure 11-11
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The distributions we are interested in are more complex and ratios more
complex as well, but the general idea conveyed with the hill climber story is the
basis for how both the Metropolis and Metropolis-Hastings algorithms work
...
It is also of interest to note that many
related algorithms exist as well and developing more specific versions of these
algorithms is an active area of statistical research
...
, as in Figure 11-12
...
In a symmetric distribution the ratio going up or down the hill
doesn’t matter which side of the hill you are on, whereas in a non-symmetric
distribution the R values will be different for different sides of the hill
...
Figure 11-12
Because of the proposal distribution symmetry, with the Metropolis algorithm,
the acceptance ratio R depends only on the value of the ratio of target
distribution values
...
To make the general algorithm given earlier specific for the Metropolis
algorithm
1
...
202
2
...
That is:
R=
p(θ *)
p(θ )
3
...
Accept the new value if u < min (1, R)
...
The constant updating of the values of the posterior distribution is a Markov
chain that results in a stationary posterior distribution
...
We assume that we have one data outcome y=3 from a
binomial with n=5 trials
...
We have seen earlier that the posterior distribution in this case is
a beta distribution with parameters alpha=1+y=4, and beta = 1 + n –y = 3, and a
posterior mean of 4/7 =
...
Without knowing this posterior distribution, for
the Metropolis algorithm, we use the simple representation:
f (θ | Y = 3) ∝ Pr(Y = 3 | θ ) f (θ ) = θ 3 (1 − θ )2 ⋅1
First, initialize values
...
04 for the first run of the
Metropolis algorithm, an arbitrary starting value
...
04
Next run a loop sampling thetaStar from the uniform distribution, a symmetric
proposal distribution, and calculating the value of the ratio of
p(θ*|Y=6)/p(θ|Y=6)
...
Iterate through the loop the predetermined nchain
length
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The first 100 runs are depicted
separately and you can see that from runs 0 to 20 the chain is more erratic in
behavior and takes about 20 cycles reach a more stable state
...
Typically we discard data from the burn in period –
often it is the first few hundred runs
...
In more complex scenarios it is hundreds of runs that compromise the
burn-in period
...
Time series plots also are good visual diagnostics of how well a chain mixes
...
There are no hang-ups where the
chain spends unusually long periods of time in areas that are substantially above
or below the mean value of the posterior distribution As a check we can plot a
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
0
0
...
0
Histogram of theta[101:1000]
0
...
4
0
...
8
1
...
Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm extends the use of the Metropolis algorithm
to situations where the proposal distribution is not symmetric
...
Using the Metropolis Hastings
algorithm the acceptance ratio is:
R=
p(θ *) q (θ | θ *)
p(θ ) q(θ * | θ )
The added term is the ratio of the proposal distribution q of theta (old value)
conditioned on the new value thetaStar over the value of q of thetaStar
conditioned on theta
...
We will not do a full example using the Metropolis Hastings algorithm here, but
examples are provided in many of the references listed in the appendix and other
examples abound in the research literature
...
205
Putting it All Together: Markov Chain
Monte Carlo
In Chapter 10, we learned that Markov chains which are ergodic (aperiodic and
irreducible) converge to a stationary equilibrium distribution
...
The two Markov chain methods discussed in this chapter are often the only way
we have of implementing many Bayesian methods to determine a posterior
distribution of interest
...
Gibbs versus Metropolis-Hastings
As mentioned previously, the Gibbs sampler is actually a special case of the
Metropolis-Hastings algorithm, a statement that may make no sense if no one
tells you that with the Gibbs sampler the proposal distribution is the target
distribution
...
The Gibbs sampler has many advantages over the Metropolis-Hastings sampler
in that it is computationally much simpler
...
This is because the Gibb’s only computes from the conditional distributions,
which is less computationally intense
...
The next chapter introduces WinBugs, which is a program you can use
in conjunction with R that utilizes the Gibbs sampler
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
These issues apply to both Gibbs and
Metropolis-Hastings algorithm
...
All of these issues are
areas of active statistical research
...
Mixing
refers to how well the chain moves, or speed at which chain forgets its past
...
It is generally
impossible to tell in advance how well a chain will mix
...
Rapid mixing can indicate that the chain requires a
shorter number of runs to reach a posterior distribution
...
Some
suggest many short runs (Gelfand and Smith, 1990), whereas others suggest
several long runs (Gelman and Rubin 1992), and others suggest one very long
run (Geyer 1992)
...
No official comment is made here about which method is correct,
but running more than one chain is generally a good idea
...
How long the burn in is varies tremendously
...
Therefore the run
length (number of times you run the chain) should be long enough to ensure than
chain has indeed reached a stationary state
...
If in doubt, do more runs
...
Many models will work, but some may be slow to converge
...
207
runs
...
This can be done using the
package CODA
...
208
12
MCMC using BRugs
This chapter applies the theories of MCMC and Gibbs sampling by introducing
the use of the package BRugs and associated package CODA
...
About BRugs
Package BRugs incoroporates what used to be a separate program (WinBUGS,
still available from the website http://www
...
cam
...
uk/bugs/
...
BUGS is an acronym for Bayesian analysis
Utilizing Gibbs Sampling
...
This version is also still available to interested
users
...
Genetics aplications include including linkage analysis, recreating
evolutionary trees, pedigree analysis, haplotype analysis, and other applications
...
Other applications of WinBUGS in bioinformatics are
fathomlessly possible, and those skilled in Bayesian modeling can take
advantage of WinBUGS to assist with MCMC analysis
...
Instead, convergence diagnostics are performed using
the CODA package, a package designed for use in S-Plus or R
...
209
ABO Blood Types Example
The Model
Different markers determine blood types in humans
...
Humans are typed according to one of the four
possible types: A, B, AB, and O
...
For the ABO blood type, there are three possible alleles, which
we will refer to as A, B, and O
...
In this system there is a pattern of co-dominance, where A and B are codominant
...
This means that if a person has the AB
allele combination, he/she will be AB blood type since both markers show up
(co-dominance)
...
Likewise if a person is type BO only marker B shows up
...
We call the allele combination a person has a genotype, with possible genotypes
for blood types being AA, AB, BB, AO, BO, and OO
...
If we consider the three possible alleles in the gene pool as proportions and
assume these are the only alleles in the gene pool, the sum of proportions is 1
...
p+ q + r =1
Mathematically we can relate the allele frequency to phenotype frequency using
the Hardy-Weinberg equilibrium for a three-allele locus:
2
(p+ q + r) =1
Doing the algebra:
2
2
2
p + 2pr + 2pq + q + 2qr + r = 1
Table 12-1 summarizes the relationship between alleles (p, q, r), genotype
frequencies, genotypes and phenotypes for the blood types
...
210
Table 12-1
Genotype
Frequencies
Genotype
Phenotype (Blood
Type)
2
AA
A
2pr
AO
2
BB
2qr
BO
2pq
AB
AB
2
OO
O
p
q
r
B
The easiest data to collect on blood types for a population under study is the
phenotype data
...
However, it is not technically or
mathematically easy to determine frequencies of the individual alleles
...
Therefore a Bayesian MCMC method works better to solve this type
of problem
...
Recall that the discussion in
Chapter 8 of using the multinomial distribution to model the distribution of
phenotype data, based on phenotype counts
...
Using Hardy-Weinberg equilibrium we can convert the
proportions of phenotypes to allele proportions in terms of p,q, and r
...
211
P (A, B, AB, O) α (p 2 + 2pr ) nA (q 2 + 2qr ) nB (2pq) nAB (r 2 ) nO
Our prior distribution of interest is the distribution of individual alleles A, B, O
that are modeled respectively with p, q, and r
...
For this
example we have no specific knowledge of the proportions so we will use a
noninformative Dirichlet prior assuming all proportions are equal (the
equivalent of a beta (1,1) distribution for all priors)
...
We note that the posterior is of the form of a high-order polynomial
in p,q,r which is actually a complicated mixture of several Dirichlet
distributions
...
The chain iterates like this updating each individual parameter for
that step by sampling from the full conditional distribution
...
If the chain is ergodic the chain will reach
a steady state distribution that is our posterior distribution of interest, the
posterior distribution of the alleles given the phenotype count data
...
212
Running the Model in BRugs
Create the Model
Open a new notepad document and enter the following model:
model BloodTypes {
##Prior for allele frequencies
#Gamma trick to generate uniform Dirichlet dist
a~dgamma(1,1)
b~dgamma(1,1)
o~dgamma(1,1)
#Scaled frequency
#frequence allele
p<-a/(a+b+o)
#frequency allele
q<-b/(a+b+o)
#frequency allele
r<-o/(a+b+o)
of alleles
A
prior
for each allele
B
O
##Multinomial likelihood models data
#A phenotype frequency
x[1]<-p*p+2*p*r
#B phenotype frequency
x[2]<-q*q+2*q*r
#AB phenotype frequency
x[3]<-2*p*q
#O phenotype frequency
x[4]<-r*r
n[1:4]~dmulti(x[ ],total)
Save this document as “bt
...
Create the Data Set
Likewise, create a small file “btData
...
#data
list(n=c(750,250,75,925),total=2000)
#inits
list(a=1,b=1,o=1)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
To do this load the BRugs package and
type the following:
> modelCheck("bt
...
Note that the
data are given as a list
...
> modelData("btData
...
> modelCompile(numChains=2)
model compiled
Initialize Values
With a successfully compiled model, you are now ready to initialize values of
parameters in preparation for running the sampler
...
txt” containing the
following:
#inits
list(a=1,b=1,o=1)
Do the initializing in R with function modelInits
> modelInits("btInits
...
The
function samplesSet tells which parameters should be monitored and the
modelUpdate function runs the sampler the specified number of times:
> samplesSet(c("p","q","r"))
monitor set for variable 'p'
monitor set for variable 'q'
monitor set for variable 'r'
>
>
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Remember that our result of interest is the Dirichlet posterior joint
distribution of p, q, and r – the allele proportions for the blood type alleles given
the data
...
To produce such a plot, simply use function
samplesHistory() with the parameters of interest
...
Figure 12-1: Time Series of Parameter r
A time series trace gives quick visuals check for two things – how well the chain
mixes and whether the chain has converged
...
Note that a
time series trace is NOT a formal statistical analysis, but a visual check, and
although quite good at diagnosing good and bad runs and convergence, should
not be used as the sole diagnostic criteria
...
samplesDensity("r", mfrow=c(1,1))
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
These results can also be used in statistical inference testing
comparing parameters from different models, etc
...
Function samplesStats will give this information for all parameters of interest:
> samplesStats("*")
mean
sd MC_error val2
...
5pc start sample
p 0
...
007205 0
...
22100 0
...
24860
502
998
q 0
...
004643 0
...
07661 0
...
09471
502
998
r 0
...
007752 0
...
66640 0
...
69570
502
998
CODA
The most important diagnostic to do with MCMC output however is to make
sure the chains have really converged
...
CODA, the R package that is used in collaboration with BRugs
will utilize various convergence diagnostics techniques
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Often the reason for doing this is to fit a
mathematical model to the data, which can be either a probability model or some
type of predictive model, like a regression model, as discussed in Chapter 15
...
The value derived from fitting a useful model
is often the payoff of laborious experimentation and data collection
...
It should be noted here that
classical and not Bayesian techniques are presented in this and subsequent
chapters, but that there are parallel Bayesian methods
...
Therefore the emphasis here is on a review of topics and
how they work in R
...
Sampling Theory
Usually when we collect data we are “sampling from a population”
...
In practice, except in unusual circumstances, this
is impossible
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In statistics, a good sample is
characterized by being random and therefore representative of the population at
hand
...
This means that whatever the ith outcome of the sampling process is, the next (ith
+ 1) outcome will have no dependence on the ith outcome
...
Outcomes are viewed as realizations of a random variable and therefore the
outcomes for any experiment share the same underlying probability distribution
...
i
...
”
which stands for independent and identically distributed
...
It should be intuitive that the larger the sample, the
better the sample in terms of using the data to characterize the underlying
population
...
Figure 13-1: Effect of Sample Size on Approximating the
Underlying Distribution
After a sample is collected, it is mathematically characterized by certain
functions of the sample data
...
Some statistics you are likely already familiar with and that were discussed in
Chapter 5 include mean, median, variance, etc
...
218
in great depth in most elementary statistics courses and books
...
A sample mean is simply the average of the data
...
R will calculate a
sample mean using mean(x) where x is the data vector
...
If you take repeated data
samples from the same underlying population and calculate sample means, the
distribution of the sample means will follow the central limit theorem,
commonly known as the law of averages
...
The law of averages holds for any underlying population distribution, even
though the data themselves may be far from normally distributed
...
25 (rate=lambda in R) (see Chapter 7) and calculate means of
each sample
...
> #First a graph of the distribution from which to sample from
> e<-seq(
...
1)
> plot(e,dexp(e))
Figure 13-2: Exponential with lambda=4
...
Note that because
generated data are all i
...
d
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
25),nrow=50)
> n20<-matrix(rexp(50*20,rate=0
...
25),nrow=50)
> n200<-matrix(rexp(50*200,rate=0
...
2,0
...
6,0
...
2,0
...
6,0
...
2,0
...
6,0
...
2,0
...
6,0
...
With this smallest sample size the calculated sample means do
not show a normal distribution tendency
...
With sample size of n=200 a normal
distribution pattern for the distribution of sample means is clear
...
220
Figure 13-3 also illustrates another important point about the distribution of a
sample mean
...
This is because the measure of variation, the standard deviation,
which when applied to repeat samples of sample means is called “standard error
of the mean”, is inversely proportional to the square root of the sample size
...
Error ( X ) =
σ
n
In other words if you want to increase precision of your sample mean, increasing
the sample size always works
...
We will discuss this issue again when we discuss
hypothesis testing in the next chapter
...
Mathematically (s) is the square root of the average of squared
deviations from the mean
...
s =
∑ (Xi − X )2
n −1
In addition to measuring the mean and variability of the data, we often like to
compute parameters of the underlying distribution, a process referred to as
parameter estimation
...
Statistics
include both parameters and other metrics of the data
...
In coming sections of this chapter we discuss two ways to estimate parameters –
point estimates and interval estimates
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The criteria “degrees of
freedom” which is based in part on sample size, is part of the defining parameter
for plotting a sampling distribution
...
Every statistic has
it’s own sampling distribution – mean, mode, median, etc
...
These common distributions serve as the basis for many
statistical inference tasks
...
This distribution is the basis for t-testing, comparing means drawn from
different samples, and we use it in hypothesis testing
...
Recall from Chapter 7 the discussion of the normal distribution and that a
random variable (X) following the normal distribution may be transformed to a
Z-score with the following relationship where the distribution of Z becomes the
standard normal distribution with mean of 0 and standard deviation of 1
...
This estimated standard deviation is a random variable based on
sample size
...
A
standardized t-score takes the following form:
t=
X−µ
s
If the data values x1,…,xn follow a normal distribution, then we call the
distribution of the corresponding t scores a t-distribution with n-1 degrees of
freedom
...
The t distribution also applies to
the sampling distribution of sample means as follows:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Note that µ
where
is both the mean of the data as well as the mean of the sampling distribution of
the sample mean
...
X is
The t-distribution is symmetric like the normal and has mean 0 but its standard
deviation is > 1
...
Remember, that by the law of averages, these values approximately follow a
normal distribution
...
> t<-(n200means-
4)/sd(n200means)
Next, let’s plot these t-scores and overlay a curve for a standard t-distribution
with n-1 degrees of freedom (where n=sample size)
...
> hist(t,prob=T,main="Standardized data wtih overlay of t-distribution")
> curve(dt(x,49),add=T)
Figure 13-4 shows the resulting plot
...
223
0
...
0
0
...
3
0
...
As an aside, the term “degrees of freedom” is a description of the number
of observations that are free to vary after the sample statistics have been
calculated
...
So the
degrees of freedom left over is n-1, used in the parameter for the t-distribution
...
It is a confusing concept, and pay attention to how it is calculated for a particular
distribution
...
In
statistics this is referred to as an asymptotic approximation
...
This is correlated with the
earlier discussions of when the sample size is smaller it is more variable and the
mean is less precise (review Figure 13-3)
...
1)
par(mfrow=c(2,2))
plot(x,dnorm(x),type='l',ylab="",main="df=2")
lines(x,dt(x,df=2),lty=2)
plot(x,dnorm(x),type='l',ylab="",main="df=5")
lines(x,dt(x,df=5),lty=2)
plot(x,dnorm(x),type='l',ylab="",main="df=10")
lines(x,dt(x,df=10),lty=2)
plot(x,dnorm(x),type='l',ylab="",main="df=20")
lines(x,dt(x,df=20),lty=2)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In each case the solid line is the
normal distribution and the dashed line is the t-distribution
...
Indeed when sample size is roughly 30 or so (a specific number
is subject to debate) we often use the normal distribution instead of the tdistribution because of this close approximation
...
It is important to review as well that as a probability distribution the area
under the curve for any t-distribution is always 1
...
The Chi-Square distribution
indirectly models the sample variance
...
225
The Chi-Square is a sampling distribution because it depends on the sample size
through n-1 degrees of freedom
...
In R the density curve of the Chi-Square distribution
is called by the following function:
dchisq(x, df)
Figure 13-6 plots the Chi Square distribution with 2, 4 and 9 degrees of freedom
...
Figure 13-6: Chi-Square Distribution with 2, 4, and 9 Degrees
of Freedom
The Chi-Square, like any other probability distribution, has an area under the
curve of 1
...
Computing specific values of the Chi-Square is
difficult to do directly, so to compute such values tables or computers are used
...
The F Distribution
The Chi-Square distribution is not so interesting in and of itself, although testing
using the Chi-Square has wide applications
...
226
Square distribution
...
Let U and V be independent random variables, where U has a
Chi-Square distribution with m degrees of freedom and V has Chi-Square
distribution with n degrees of freedom
...
At first glance, the F distribution may make no sense or seem useless
...
What the F ratio creates is a testable “signal to
noise” ratio
...
Computing the ratio of variability and modeling it with
an F distribution provides a criterion to determine if the experimental procedure
achieves a significant effect over the background noise level
...
In bioinformatics, F ratios are used
extensively in microarray data analysis
...
Meanwhile, let’s work with the F distribution using R
...
The formula for calling
the density curve of the F distribution in R is:
df(x, df1, df2)
Where df1 is the numerator degrees of freedom and df2 is the denominator
degrees of freedom
...
The following code produces the F distribution plots in Figure 13-7:
>
>
>
>
>
+
+
+
>
>
x <- seq(
...
005)
m <- c(1,5,10,30)
n <- c(1,5,10,30)
par(mfrow=c(4,4))
for (i in 1:4){
for (j in 1:4) {
plot(x,df(x,m[i],n[j]),type='l',ylab="f(x)",cex=
...
227
Figure 13-7: F distributions with varying df1, df2
Like for the Chi-Square, specific percentiles of the F distribution are difficult to
determine analytically
...
We will use this function extensively when we
discuss ANOVA
...
Parameter Estimation
Given our sample data we want to fit a model to the data
...
In order to do this, we
need to determine the best fitting parameters for the particular model we have in
mind
...
Parameter
estimates take two forms, point estimates and interval estimates
...
Interval estimates
have merit in quantifying how precise a parameter estimate is
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Parameter
estimation is a key topic in mathematical statistics and selected appendix
references are listed which cover this topic in mathematical detail for the
interested reader
...
Earlier in this chapter we calculated means of samples from
the exponential
...
We can also measure based
on the samples of data the standard error of the mean and model its distribution
as well (proportional to a Chi Square)
...
If in many repeated
(conceptual) samples this difference averages out to zero, we say that the point
estimate is unbiased
...
We will only work in detail with maximum
likelihood estimates specifically here, but a good mathematical statistics book
will cover the other methods in great depth
...
Table 13-1 lists some
common point estimates of parameters that can be calculated with simple
algebra or using basic functions in R
...
229
However what if we want to estimate more complicated parameters? For
example, calculate point estimates for the alpha and beta parameters of a gamma
distribution? Reviewing the formula in Chapter 7 for the gamma distribution
indicates that there exist no simple calculations of point estimates of these
population parameters
...
Often the maximum likelihood method (MLE)
works best
...
The principle of maximum likelihood estimation is
that somewhere on the range of parameter values, the likelihood function hits a
maximum value
...
Recall from calculus that a maximum value occurs when the first derivative of a
function is set to zero
...
Figure 13-8 illustrates the analytical MLE solution for a binomial
parameter
...
lik ( p ) = p x1 (1 − p )1− x1 p x2 (1 − p )1− x2 ⋅
...
A common use of maximum likelihood estimation in genetics is to estimate
parameters for the Hardy Weinberg equilibrium model for the distribution of
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In a given population gene pool, for a
two allele gene locus, alleles A and a are at frequencies p and q, respectively
...
The likelihood for this
follows a multinomial probability distribution that we can model, where
parameter theta,θ=p (and therefore 1-θ=q, theta between 0 and 1) as:
L(θ)α(θ) 2nAA ( 2θ(1 − θ)) nAa (1 − θ) 2naa
Letting n1=nAA and n2=nAa and n3=naa and logarithms (does not matter which
base):
Log L(θ)=2n1log(θ) + n2log (2θ(1-θ)) + 2n3log(1-θ)
Eliminating the multiplicative term:
Log L(θ)= 2n1log(θ) + n2log (2) + n2log θ + n2log (1-θ) + 2n3log(1-θ)
To analytically solve for the maximum likelihood take the first derivative of this
(which we designate with a lowercase letter “l” prime):
l’(θ)=2n1/θ + n2/θ - n2 /(1-θ) - 2n3 /(1-θ)
To find the value of theta that maximizes the above we could solve it
analytically (which is not hard in this case, but with other models is often
analytically impossible) by setting l’(θ) =0 and solving for theta, or we can solve
it using R and finding the maximum value of the likelihood function over a grid
of theta values, which is very simple and illustrated below
...
Note that simply calculating theta from the proportion of AA does not work,
since this does not account for the information about theta contained in the rest
of the data
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
4562
> #MLE value for theta (corresponding vector index to lik[534])
> theta[534]
[1] 0
...
54,y=-2000,legend="MLE theta=0
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
There are a lot of R packages
with specific applications that utilize maximum likelihood estimation
...
Maximum likelihood methods are covered in
most mathematical statistics texts
...
That value in and of itself says nothing about how accurate or precise the
estimate is
...
Often it is good to use both a point estimate and an interval estimate for a
parameter under study
...
Other types of interval
estimates exist, such as predictive intervals and tolerance intervals
...
The classic and most common method of constructing a confidence interval
requires a point estimate of the parameter and follows the general formula:
point estimate +/- standard error of point estimate * test statistic
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The width of the interval depends
on the standard error and the confidence level of the test statistic chosen
...
This is because the standard error is a measure of variation in
the date on which the point estimate is made
...
We define the level of a confidence interval to be 1-alpha,
which produces a (1-a)⋅100% confidence interval, where alpha is the error level
...
An example
below will demonstrate this
...
05 is the norm for most confidence
intervals
...
Although R doesn’t have a generic function to compute confidence intervals,
knowing the basic formula above and the specifics of what you’re making a
confidence interval of, it’s vary easy to construct confidence intervals in R
...
To do this we can
use R to first generate a random sample:
> #Random sample of 20 from standard normal
> x<-rnorm(20,0,1)
> #Computer mean and standard error
> xBar<-mean(x)
> s <- sd(x)
> se<- s/(19^
...
1041
> se
[1] 0
...
To do a 95% confidence interval, the alpha level is 0
...
We distribute this evenly between the upper and lower confidence
limits, so for a symmetric distribution like the t-distribution, we can use as a test
statistic the 0
...
> test<-qt(0
...
093024
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
6947
> clUpper<-xBar+se*test
> clUpper
[1] 0
...
695, 0
...
But what does this mean? The way a
confidence interval is interpreted is as follows: if we repeated this experiment
(drawing random samples of size 20 from a standard normal) and made
confidence intervals for each sample mean point estimate, then 95% of the time
the confidence interval derived would contain the true mean
...
Making confidence
intervals for other point estimates follows the same algorithm discussed here
...
Bootstrapping
To motivate the technique of bootstrapping, consider the following scenario
...
Because
you are wise and look at the distribution of data before performing further
analysis, you do a histogram and get the distribution depicted in Figure 13-10
...
235
Clearly this is not a nice distribution pattern in terms of being able to fit a
standard probability model
...
Hence, a more important objective of study is primarily to understand the
variability of the expression level for this particular gene
...
This is a case where the bootstrap can be a
useful technique
...
It is a simple, yet powerful, computational technique that in recent
years has shown explosive growth in applications
...
More applications of the bootstrap are sure to come, but here we focus
on the basics of the bootstrap and how to do basic bootstrapping in R
...
Parametric bootstrap techniques assume that the data are generated
from a standard parametric probability model (such as the normal, Poisson, and
other models discussed in earlier chapters)
...
Nonparametric bootstrap techniques are more
versatile and suit our example situation better
...
Nonparametric Bootstrapping
The beauty of the nonparametric bootstrap is that, since there are no
assumptions of the underlying model, you can apply it to any dataset
...
Our goal is to determine how
variable the expression of the gene is given the data
...
So how should we go about
determining the standard error?
The bootstrap estimates the standard error of a metric (such as a parameter
estimate, mean, median, etc) by repeatedly drawing bootstrap samples from the
original data
...
For each bootstrap sample,
the metric is measured
...
Then, the standard error of the metric under study is measured using the
observed variation of the bootstrap samples
...
Although it may sound too easy and simplistic, this is a very robust and
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
presents a schematic of the bootstrap process
...
The empirical probability distribution assigns an equal probability to
each of the data points, 1/n
...
Using the cdf is what allows us to not rely on
a particular probability model
...
From this package you can simply plot the
empirical cdf with the plot
...
Let’s do this for our
dataset, with the resulting plot in Figure 13-12
...
> #use package step fun to plot CDF
> plot
...
5,lty=2)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The
median provides a good central measure for a distribution of unknown form,
which is the case here
...
> hist(bootmedstar,nclass=40,xlab="median",main=
+ "Dist
...
238
Figure 13-13
Next, the easiest thing to do to get a sense of variability of the data is to simply
take the standard deviation of the bootstrap sample medians:
> sd(bootmedstar)
[1] 0
...
Such a measure could serve as a test statistic to compare with variability
of expression of other genes, or of the same gene under different conditions
...
There are many other uses of the bootstrap, but most involve the
theme of hunting for a measure of standard error or variability for a dataset with
an unknown distribution
...
Both packages are worth looking at and include various applications of the
bootstrap and further examples
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
This chapter continues on
the topic of statistical inference, discussing hypothesis testing and how to
perform hypothesis tests using R
...
Many
introductory statistics texts cover hypothesis testing in great depth and almost
any text should serve as a good reference to the interested reader
...
Hypothesis testing is the foundation of the
scientific method and therefore crucial to empirical data analysis
...
Luckily the
different tests follow the same basic theory
...
Basic Theory
Hypothesis testing begins with the formulation of a null hypothesis (H0) and an
alternative hypothesis (HA) testing the value of a statistic calculated from a
sample of data collected in an experiment
...
Often the null hypothesis is a clearly
specified situation, which can easily be translated to a probability distribution
that models a set of data
...
240
mean of a population is 0
...
In this case, this would be to show the mean is not zero
and differs significantly enough from zero based on a statistical test
...
In this case we would base our evidence
on the sample mean (remember this follows a normal approximation for large
samples according to the central limit theorem discussed in the last chapter)
When we use an alternative hypothesis that “the mean is different from zero” we
need to be a bit more specific
...
A one-sided alternative
hypothesis in this case may say the mean is greater than zero, or the mean is less
than zero, but not either case
...
Two-sided tests are generally more common
...
After (1) clearly defining the null and alternative hypotheses, (2) you collect
data
...
Let’s illustrate this with a simple example
...
Four proteins of known
crystal structure (empirically verified secondary structures) were used to test
various methods available on the Internet that allowed the user to enter primary
structures of proteins and then used algorithms to predict secondary structure
motifs, returning results to the user with a record of which amino acids
corresponded to which predicted secondary structure
...
The information in these tables is presented for
informational purposes to illustrate where the data came from (and a way of
doing an experiment using technology widely available)
...
rcsb
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
81-89,
96-112, 119-136
(68
...
26%)
75-81 (5
...
66%)
23-34(15
...
71%)
ICQFKLVLLGESAVGKSSLVLRFVKGQFHEYQESTIGAA
FLTQTVCLDDTTVKFEIWDTAGQERYHSLAPMYYRGAQA
AIVVYDITNTDTFARAKNWVKELQRQASPNIVIALAGNK
ADLASKRAVEFQEAQAYADDNSLLFMETSAKTAMNVNEI
FMAIAKKL (164)
Ubiquitin
Beta sheet
regions
12-18, 23-24,
29-35, 41-43,
45-48, 54-55,
67-73, 9197,104-112,
115-123
(44
...
bioch
...
edu-/o_fasta/chofas
...
ibcp
...
pl?page=/NPSA/npsa_gor4
...
embl-
Neural network based; combines
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
de/predictprotein/submit_def
...
Statistically we could come
up with several tests using data collected from this experiment
...
Step 1: Clearly define the null and alternative hypothesis
...
5)
...
Step 2: Collect the data
The total percentage of amino acids that followed correct secondary structural
motif by method is recorded in Table 14-3
...
Table 14-3
Protein
Ubiquitin
Ubiquitin
Ubiquitin
DeoxyHb
DeoxyHb
DeoxyHb
Rab5c
Rab5c
Rab5c
Prealbumin
Prealbumin
Prealbumin
Method
CF AVG
GOR
PHD
CF AVG
GOR
PHD
CF AVG
GOR
PHD
CF AVG
GOR
PHD
% Correct
0
...
645
0
...
472
0
...
879
0
...
604
0
...
449
0
...
780
Step 3: Calculate the appropriate test statistic
What is the appropriate test statistic here? Recall the discussion in Chapter 13
about the t-distribution being the sampling distribution of the mean for a small
size sample (which cannot be approximated as normal when n<30 roughly)
...
Recall that
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
So, under the null hypothesis with the theoretical mean, µ, equal to 0
...
5
s/ n
is distributed as a t-distribution with n-1 degrees of freedom
...
Let’s use R to calculate the test statistic:
> percentCorrect
[1] 0
...
645 0
...
472 0
...
879 0
...
604 0
...
449 0
...
780
> sampleMean<-mean(percentCorrect)
> sampleMean
[1] 0
...
5
> s<-sd(percentCorrect)
> s
[1] 0
...
5)
> testStatistic
[1] 3
...
What we want is whether this test statistic value falls within the range of our
hypothesized value (0
...
244
the null hypothesis
...
The test statistic tells us that that the sample average is about
3
...
5
...
The criterion
is called the alpha-level, or the type one error level
...
An ordinary alpha level for a
two-tailed t-test (which is what we are doing here) is 0
...
For a twotailed test this is split into 2
...
5% of the
upper end of the curve
...
10, 0
...
001
...
Let’s use an alpha level of 0
...
Let’s plot where our test statistic falls on the sampling
distribution under the null hypothesis:
>
>
>
>
x <- seq(-5,5,by=
...
2,legend="3
...
245
Based on the graph, it looks like our value is pretty extreme and may be
probable cause to doubt that the sample mean comes from the distribution of the
null hypothesized mean, but let’s be sure before making a decision
...
004413
This tells us that only 0
...
Given a cutoff alpha value of 5% (2
...
5% so we reach the
decision that we reject the null hypothesis and conclude that the true mean of
our data differs significantly from 0
...
Alternatively to determine if our value is too extreme we could have computed
the values of the t-distribution for the critical points of both tails of the
distribution using the qt function:
> alpha<-0
...
200985
> qt(1-alpha/2,df=n-1)
[1] 2
...
200985 or above 2
...
05 and we could reject any test statistic
more extreme than these values
...
If the p-value is less than the alpha-level the
experimenter set, then we reject the null hypothesis for a given test
...
Our p-value of 2 times 0
...
008826 is significant, but a value of 0
...
The next section discusses significance and
statistical decision making in more detail
...
Remember nothing in statistics is ever absolute and in any type of statistical
analysis there is always the randomness factor
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Whenever you perform a hypothesis test, there are four possible outcomes, listed
in Table 14-3
...
The other two outcomes are not “correct” but a result
of the statistical uncertainty inherent in a hypothesis test
...
To understand errors, imagine a production assembly line and the null
hypothesis (H0) is that there is nothing wrong with production (product failure
rate below a certain hypothesized criterion)
...
The
consequence of this is wrongly stopping production and monetary loss to the
company
...
If the alpha level for a test is 0
...
When we set
the alpha level we are setting what is an acceptable type I error rate for the
experiment
...
In this case production goes on even though there really does exist a higher than
acceptable product failure rate
...
The type II error rate is sometimes referred to as the beta (β) level
...
This is 1 minus the false negative rate or 1- β
...
247
introductory statistics books
...
First, an important thing to know about power is that as the sample size
increases, the power of a test increases
...
That is a test with an alpha level of 0
...
05
...
Do not confuse power with
significance levels!
For some tests, R provides some relief in computing power
...
t
...
prop
...
For example, power
...
test is specific for computing power related values
for the t-test (and has optional parameters for different versions of the t-test,
one-sided versus two sided, etc)
...
05
...
For
example, suppose we want to detect a difference of 0
...
5
...
For the type of test
we need to specify “one
...
Later on we will discuss other types of t tests,
such as two sample and paired for which power can also be calculated with this
command
...
t
...
5,sig
...
05, type=”one
...
level
power
alternative
=
=
=
=
=
=
20
0
...
05
0
...
sided
Note the power for the test above is only 0
...
Perhaps if we look for a less subtle difference, say delta=1, we should get a
higher power test, let’s see:
> power
...
test(n=20,delta=1,sig
...
05, type="one
...
248
sd
sig
...
05
0
...
sided
Indeed the power of this test is 0
...
86% of the time the test will
reject the null hypothesis when it should
...
5645 result to detect a delta of 0
...
Maybe we think it would be a good idea to use fewer samples, perhaps to save
money
...
t
...
level=0
...
sample")
One-sample t test power calculation
n
delta
sd
sig
...
05
0
...
sided
Reducing the sample size reduces the power of the test to 0
...
Maybe it pays to use n=20 as a sample size? To illustrate
the effect of significance level perhaps we can go with the n=10 sample size but
instead use a higher alpha level
...
t
...
level=0
...
sample")
One-sample t test power calculation
n
delta
sd
sig
...
1
0
...
sided
Using a higher alpha level does increase the power to
...
8975
...
Using the power
...
test function is a little tricky at first,
but if you become skilled at using it (and the help function provides more details
about the parameter options) it can be a valuable tool to use when determining
sample sizes and significance levels to use for a particular test you want to work
with
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Different tests differ in what is tested, in test
statistics and how they’re calculated and sampling distributions of the test
statistic (or other criteria for nonparametric tests)
...
The R
outputs for different tests show similar forms
...
Package ctest
Package ctest should be included with the basic installation of R (if not it is
available on the CRAN site)
...
6 has it preloaded)
...
Most of the tests discussed in the
remainder of this chapter are in this package
...
Use the help function with
the appropriate test name for further details about test functionality and
parameters
...
For version 1
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
For example, many of the microarray analysis packages
available for R come with specialized hypothesis test functionality
...
For most parametric tests the distribution of
the test statistic is one of the sampling distributions discussed in Chapter 13 –
the t-distribution, the Chi-Square distribution or the F-distribution
...
Recall from the previous chapter that using the student’s
t distribution is appropriate to model the mean when you are sampling from a
distribution that can be approximated as normal when the sample size becomes
large
...
There are three common forms of the t-test discussed here: the one sample t-test,
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
All three tests are called using the
function t
...
One Sample t-test
The one sample student’s t-test was illustrated earlier in our in depth analysis of
a hypothesis test, but let’s include an example here for comparison purposes to
the other two forms of the t-test and to show how R automates functionality for
this test
...
The test statistic for
the one-sample t-test, detailed earlier, is:
t=
X −µ
s/ n
Let’s consider a small dataset of gene expression data (available as geHT data in
package rbioinfo)
...
> geHTdata
names
1
Gene 1
2
Gene 2
3
Gene 3
4
Gene 4
5
Gene 5
6
Gene 6
7
Gene 7
8
Gene 8
9
Gene 9
10 Gene 10
c1
2650
1200
1541
1545
1956
1599
2430
1902
1530
2008
t1
3115
1101
1358
1910
2999
2710
2589
1910
2329
2485
c2
2619
1200
1401
1652
2066
1754
2789
2028
1660
2104
t2
2933
1309
1499
2028
2880
2765
2899
2100
2332
2871
c3
2331
1888
1256
1449
1777
1434
2332
1888
1501
1987
t3
2799
1901
1238
1901
2898
2689
2300
1901
2298
2650
c4
2750
1315
1625
1399
1999
1702
2250
2000
1478
2100
t4
3200
980
1421
2002
2798
2402
2741
1899
2287
2520
Let’s use a one-sample t-test to compare the hypothesis that the mean of the
control expression values is 2000:
> #create data vector of all control data
> controls<-c(geHTdata$c1,geHTdata$c2,geHTdata$c3,geHTdata$c4)
> #perform one-sample t test that true mean is 2000
> t
...
174, df = 39, p-value = 0
...
028 1989
...
375
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
03583 rejects the null hypothesis at a critical
value (alpha level) of 0
...
01
...
For convenience t
...
As is expected the hypothesized value of 2000 lies
outside the 95% confidence interval, since we rejected that value at a
significance level of 5%
...
Two sample t-test
The two-sample t-test is used to directly compare the mean of two groups (X
and Y)
...
The null hypothesis states that the means of two groups are equal,
or equivalently, the difference of the means is zero:
Ho: µ(X)=µ(Y), or µ(X) – µ(Y) = 0
The test statistic for the two-sample t-test used by default in R (for Welch’s test)
is:
t=
X −Y
2
2
s X sY
+
m n
where s2X is the sample variance of the X group, s2Y is the sample variance of
the Y group, and m and n are the sizes of groups X and Y
...
This is very simple in R by just entering the two vectors whose means
are being compared as parameters to function t
...
test(controls,treatments)
Welch Two Sample t-test
data: controls and treatments
t = -3
...
732, p-value = 0
...
6098 -188
...
375 2273
...
253
The p-value for this test very strongly rejects the null hypothesis of no
difference between the mean of the treatment group and control group,
indicating there are some genes which exhibit significantly different gene
expression levels, although the test does not provide specifics as to which genes
these are
...
The final variant of the t-test is called the paired t-test
...
The test statistic here is:
t=
d
sd / n
Where d is the average difference between the pairs, n is the number of pairs,
and sd is the sample variance of the paired differences
...
The example of paired data we will use here is the difference
between treatment and control on measures of the same gene
...
Suppose gene 4 and gene 9 are really the
same gene, so we can pool the data for these two genes:
> g4g9ctrl
[1] 1545 1652 1449 1399 1530 1660 1501 1478
> g4g9trt
[1] 1910 2028 1901 2002 2329 2332 2298 2287
Each of the 8 data values in the two data vectors with corresponding vector
indices created above is a data pair
...
Note that in parameters we
indicate “paired=TRUE” to perform the test
...
test(g4g9ctrl,g4g9trt,paired=TRUE)
Paired t-test
data: g4g9ctrl and g4g9trt
t = -9
...
127e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-768
...
8971
sample estimates:
mean of the differences
-609
...
254
The p-value for the test indicates there is a significant difference for this
particular gene in treatment conditions versus control, and we can reject the null
hypothesis of no difference
...
Recall that a Bernoulli trial models an experiment that has one of two discrete
outcomes
...
test,
“an exact test of a simple null hypothesis about the probability of success in a
Bernoulli experiment”
...
Suppose we want to test our assumption (null hypothesis) that
these proportions are correct given that we have empirical data from 900 plants,
625 of which have purple flowers, and the remainder (275) have white flowers
...
test function with parameters x=number of
successes (purple flowers), n=total in sample, and p=proportion of successes to
be tested (3/4)
...
test(x=625,n=900,p=3/4)
Exact binomial test
data: 625 and 900
number of successes = 625, number of trials = 900, p-value = 0
...
75
95 percent confidence interval:
0
...
724417
sample estimates:
probability of success
0
...
6944) as well as a
very small p-value, which significantly rejects the null hypothesis that the
proportion of successes can be modeled at 0
...
Scientifically this can be
interpreted as the relationship between dominant and recessive cannot be
explained by simple Mendelian ratios for this genotype
...
It is often useful to compute a confidence interval after
a null hypothesis has been rejected
...
663 and
0
...
75 lying outside the 95% confidence
interval
...
255
Comparing Variances
Sometimes we will be interested in testing whether two groups have equal
variances, as this is often an assumption when performing the t-test and other
statistical tests
...
Let’s do a
simple test to determine if the variances for the gene expression data for the
gene 4/gene 9 data (same gene) are the same under treatment or control
conditions
...
test and the desired
data vectors as parameters:
> var
...
2271, num df = 7, denom df = 7, p-value = 0
...
045468 1
...
2271085
This test uses an F-test comparing the ratio of the variances of the two groups to
a critical value on the F-distribution (discussed in Chapter 13 as the sampling
distribution for the ratio of variances)
...
The p-value result above
indicates it is OK to assume at an alpha=0
...
Note this holds only if you
assume equal variances for the pooled variance version
...
929
> var(g4g9trt)
[1] 37232
...
929/37232
...
2271086
Select Nonparametric Tests
Nonparametric tests make no or very minimal assumptions about the probability
density from which the data are derived
...
256
be approximated as normal, and when using non numerical (rank, categorical)
data
...
There are dozen’s of nonparametric tests in use
...
Test statistics for most
nonparametric tests are not from a standard distribution, but are instead
calculated from a test-specific test statistic and values for which are obtained
from a standard table or built into R
...
Instead of using a
mean value derived from a t-distribution they use a median value as a test
criteria
...
For small datasets or data not normally
distributed the median provides a good central measure of the data and the
Wilcoxon tests are a good alternative to the t-tests in these cases
...
This test determines whether the median of the empirical data differs
significantly from a hypothesized median
...
For example, let’s use the Wilcoxon test to test whether the median of the
protein prediction percent correct differs significantly from 0
...
test(percentCorrect,mu=0
...
02100
alternative hypothesis: true mu is not equal to 0
...
5
...
test function, like the t
...
Use help(wilcox
...
These tests work in a parallel
fashion to their t-test counterparts
...
It literally tests the “goodness” of the
density fit
...
257
calculates the frequencies in each bin, similar to a histogram construction
...
It then calculates a
test statistic that follows a chi square distribution, where oi are the observed
frequencies and ei are the expected frequencies values for n bins:
χ2 =
n (o − e ) 2
∑ i o i , df = n − 1
i
i =1
Note that this does use the Chi Square distribution but is a nonparametric test
because the sample is not from the Chi Square distribution but we are testing the
fitness of the distribution that the sample does come from
...
We think that that the proportion of
nucleotides A, T, C, G should be equally present in a given DNA sequence, with
proportion 0
...
This is modeled by a multinomial distribution (see
Chapter 8)
...
That is
our null hypothesis is that the data fit a multinomial distribution with equal
probability for each nucleotide
...
Genbank that connects to Genbank (http://www
...
nlm
...
gov/)
and downloads sequences into R for you
...
Let’s use this function to download a sequence into R, and then test that
sequence with a chi square goodness of fit test to see if it conforms to a
multinomial distribution
...
First we need to obtain the
sequence
...
Genbank function and references
> myoglobin<-read
...
[1059] "a" "a" "c" "a" "t" "c" "t" "c"
The goodness of fit test functionality is built into R as a default for function
chisq
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
test(obs,p=rep(1/4,4))
Chi-squared test for given probabilities
data: obs
X-squared = 12
...
005109
> # or even simpler using default p's
> chisq
...
79, df = 3, p-value = 0
...
5 level
> #of a df=n-1=3 Chi Square distribution is
> qchisq(0
...
814728
Based on result of a test statistic 12
...
81) we fail to accept our null hypothesis at an
alpha=0
...
Analyzing Contingency Tables
Contingency tables are a simple, yet powerful, method of analyzing count data
that fits into categories
...
This is also called categorical data analysis
...
The most typical case of a
contingency table is a 2by 2 table (2 rows, 2 columns), although the table size
can be extended to r (rows) by c (columns)
...
In a 2 x 2 contingency table there are four cells for the possible combinations of
two factors
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Therefore under the null hypothesis the expected
probability of an observation falling into category oij (i=roe, j=column) is just
the product of the marginal probabilities for that row and column
...
R1/N for example, is the marginal probability of being in row 1
...
To do this we will use
the Chi-Square test and Fisher’s exact test, both of which are available as
predefined R functions available as part of package ctest
...
Suppose we are doing a genetic
study and studying the effect on which of two alleles for a gene a person has and
the presence of a disease
...
The data for a 2 x 2 contingency analysis should be entered in the
format below, which works for both tests
...
The null hypothesis is that there is
no relation and the factors are independent
...
260
Chi-Square Test (for Independence)
The Chi-Square test for independence is similar to the goodness of fit test, but
uses for the expected values the values calculated from the marginal
probabilities of the data rather than expected values based on a distribution
...
The degrees of
freedom for this test are the number of rows minus 1 times the number of
columns minus 1, which is 1 in the case of a 2 x 2 table
...
test with the appropriately
formatted table as a parameter:
> chisq
...
6624, df = 1, p-value = 3
...
921e-9) we would strongly reject the null
hypothesis that the factors gene allele and disease are independent and conclude
that there is a significant relation between the disease and which allele of the
gene a person has
...
It answers the question “How
extreme is this table” out of all possible tables with the same sample size N
...
This type of test is referred to as a permutation test, and such tests
are becoming popular in bioinformatics applications because of their precision
...
261
R is one of the few statistical software packages, which easily computes such
test values
...
test and enter the data matrix as a
parameter
...
test(contingencyTestData)
Fisher's Exact Test for Count Data
data: contingencyTestData
p-value = 2
...
1195958 0
...
2105543
The p-value result here is slightly different from the Chi-Square test value, but
indicates the same result
...
Because the mathematical theory behind them is somewhat
complex, we will not cover them in depth here
...
This provides the “best test” providing the highest power
with the lowest error (for mathematical reasons beyond our scope here)
...
Likelihood ratio tests are common in other genetics applications as
well
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Now we talk about another type of
models, linear modeling
...
The most common types of linear models
are linear regression and ANOVA models, which will be introduced here
...
Let’s first look at ANOVA, then linear
regression, and then the end of the chapter discusses general linear models
...
This test, however, is limited because in many
cases we wish to compare more than two group means
...
Let’s return to the dataset illustrated in Table 14-3, made into a data frame in R
(available as protStruct)
...
Factors are non-numerical variables
...
For example, Method has 3 levels: CF AVG, GOR, and PHD,
representing the three different methods used to evaluate protein secondary
structure (see discussion in Chapter 14)
...
The function as
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
467
GOR
0
...
868
CF AVG
0
...
844
PHD
0
...
405
GOR
0
...
787
CF AVG
0
...
772
PHD
0
...
factor(Protein)
[1] TRUE
Suppose we want to test whether the percentages correct differ by method
(ignoring the protein factor for now)
...
Because we
have three groups, a two-sample t test is inadequate for this analysis
...
467
2
GOR
0
...
868
4 CF AVG
0
...
844
6
PHD
0
...
405
8
GOR
0
...
787
10 CF AVG
0
...
772
12
PHD
0
...
The way ANOVA does this is to break up the variability into
two categories – variability within groups, and variability between groups
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
467
0
...
405
0
...
645
0
...
604
0
...
868
0
...
787
0
...
664
Group Averages
0
...
716
0
...
Different factor levels are sometimes referred to as “treatments”
...
ANOVA partitions the observed variability into two components
...
This “within” variation is calculated by observing the variability of
the replicates within each level of the experimental factor
...
Changing the factor levels
causes the other component of variability
...
This type of variability is measured using group averages as
compared to the grand average
...
Note, we have not yet discussed how to calculate within, between and total
variation
...
The sum of distances of
data from any average of the data is always zero
...
Figure 15-1 shows the details of these calculations for the protein structure data
...
The sum of squares total is the sum
of the squared distances from each data point to the grand average
...
265
squares between is the sum of squared distances from the group average to the
grand average
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
267
Note that in Figure 15-1 the following relation holds:
Sum of Squares TOTAL
= Sum of Squares WITHIN + Sum of Squares BETWEEN
This can be simplified as:
SST = SSW + SSB
Thus the total variation is partitioned into two parts – the variation due to
treatment (SSB) and the variation due to random variation (SSW)
...
However, the values of the sum of squares in and of themselves are not
interesting and do not provide us within enough information to perform a
statistical inference test
...
Recall the brief discussion in Chapter 13 regarding using the F distribution as
the distribution of the ratio of variances which can be used to analyze signal to
noise ratios
...
The signal here is the between
variation which represents changes in the mean response due to treatments
(prediction methods in this case)
...
Therefore, what we want a test of is whether we have a real signal – a real
difference between the treatments, in relation the amount of noise we see in the
experimental data
...
Why is this? If you look at
the data in Figure 15-1 you may notice that for the sum of squares between there
are only three unique values
...
The SSW has more than 3
unique values
...
For the sum of squares
between, the degrees of freedom is the number of treatments (which we’ll
designate with a lowercase letter a) minus 1
...
Like the sum of
squares the degrees of freedom also have the relation:
Total df = Within df + Between df
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In our example we have 11 total degrees of
freedom, 9 degrees of freedom within, and 2 degrees of freedom between
...
So we need to standardize our
values to make the comparison level
...
Mean squares are calculated by taking the sum of squares and dividing them by
their respective degrees of freedom
...
305
=0
...
048
=0
...
The test utilizes
the F distribution and takes for a test statistic the ratio of the MSB/MSE, our
“signal to noise” ratio
...
1525/0
...
6
...
In this case
these are 2 and 9 respectively
...
05:
> qf(0
...
256495
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
6 is more extreme on the F distribution than
the critical value of 4
...
The ANOVA Table
All of the computations done in the discussion above are easily summarized into
a convenient standardized format known as the ANOVA table, depicted in Table
15-2
...
Table 15-2
Source
Degrees of
Freedom
Sum of
Squares
Mean
Squares
F Ratio
Factor
a-1
SS(between)
MSB
MSB
MSE
Error
N-a
SS(error)
MSE
Total
N-1
SS(total)
Performing ANOVA using R
Performing ANOVA analysis using R is fairly simple
...
This function will be discussed later in
this chapter when it is used for regression
...
To do this, you can nest the function calls, as
demonstrated below for our sample data:
> anova(lm(Correct~Method,data=protStruct))
Analysis of Variance Table
Response: Correct
Df
Sum Sq Mean Sq F value
Pr(>F)
Method
2 0
...
152676 28
...
0001263 ***
Residuals 9 0
...
005342
--Signif
...
001 `**' 0
...
05 `
...
1 ` ' 1
R refers to the sum of squares between by the group name (method row) and the
sum of squares within as part of the “residuals” row
...
This simple
model can be built upon to create much more complicated models but they all
follow the same paradigm of partitioning the variance and are all run in R in a
similar fashion
...
270
now let’s consider another issue, determining which means are significantly
different
...
However, it does not determine which
groups differ
...
However, we don’t know the specifics
of whether PHD differs from GOR significantly, or GOR differs from CF AVG,
etc
...
t
...
This will produce a pairwise comparison of the
methods and produce a table of p-values for the significant differences of
specific pairs
...
> attach(prot
...
t
...
00115 PHD 0
...
05793
P value adjustment method: holm
Based on this output for this experiment, it can safely be concluded there is no
difference between the GOR and PHD methods but that both of these differ
significantly from the CF AVG method
...
In such situations the Bonferroni adjustment would require to select individual
alpha levels of 0
...
Even the Holm adjustment, which is less stringent, would still not work
for so many tests
...
It is called the False Discovery Rate (FDR) and was proposed by
Benjamini and Hochberg(1995)
...
So, a FDR rate of 0
...
Let’s see how a 5% FDR rate compares with the
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
> pairwise
...
test(Correct,Method,p
...
AVG GOR
GOR 0
...
00013 0
...
However the difference between the two methods would be dramatic if we
performed several hundred tests simultaneously
...
This can be done in
R using the stripchart function, which will plot the spread of data values for
each group as depicted in Figure 15-2
...
272
Figure 15-2
The addition of arrows (with flat heads at 90 degree angles) depicts the spread
range of the mean within 2 standard errors, which is usually about the
approximate range of a confidence interval
...
Figure 15-2 provides a visual confirmation of our earlier
conclusions
...
Suppose we want to
analyze not only whether the percentage correct (result) is affected by the
method of prediction used, but also whether the percentage correct is affected by
which protein is being tested
...
We call such an analysis a two factor ANOVA
...
Nesting lm inside ANOVA
creates a new ANOVA table as a result, this time with an additional line for the
additional factor
...
273
Analysis of Variance Table
Response: Correct
Df
Sum Sq Mean Sq F value
Pr(>F)
Method
2 0
...
152676 42
...
0002832 ***
Protein
3 0
...
008872 2
...
1583825
Residuals 6 0
...
003577
--Signif
...
001 `**' 0
...
05 `
...
1 ` ' 1
Based on this analysis the method factor is significant but the protein factor is
not
...
The relationship for total sum of squares
still holds:
SST = SSW + SSB
Only now we can split SSB into two components, so:
SST = SSW + [SS(method) +SS (protein)]
The F-values are the mean square for the sum of squares for that method divided
by the sum of squares error
...
The analysis of interactions is
not presented here, but this is a point of interest to note
...
Although the complexity of these models is beyond coverage here, all
of these rely on the basic analysis method discussed above
...
In particular,
books on the statistical discipline of experimental design have in-depth coverage
of various types of ANOVA models
...
Meanwhile let’s proceed to studying the second major form of linear model, the
regression model
...
274
Regression Analysis
Regression analysis is the statistical technique used to model relationships
between a set of input (or predictor) variables and one or more output (response)
variables
...
We discuss simple linear regression and how to perform this
analysis in some detail here
...
Correlations
Correlation metrics are measures of associations between variables
...
Two variables can correlate quite nicely, yet have no cause/effect
relationship
...
Variables measuring each of these
quantities would have a high correlation
...
Nevertheless looking at correlations does provide a useful preliminary analysis
for looking at relationships between variables
...
There are three most
commonly used correlation metrics: Pearson’s correlation coefficient,
Spearman’s rho, and Kendall’s tau
...
Spearman’s rho and Kendall’s tau are nonparametric correlation coefficients and
based on the ranks of the data
...
All three correlations measure the strength of a linear relationship
between the two variables
...
This usually does not
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
As an example, consider the small dataset geneCorrData
(this was available as part of the rbioinfo package which contains expression
measurements for a time course of two genes under the same treatment
condition in a prior release of R but has been retired)
...
06 -1
...
81 -1
...
48 -0
...
42 -0
...
30 -0
...
35 -0
...
31 -0
...
18 -0
...
20 0
...
11 -0
...
09 -0
...
16 0
...
45 0
...
53 0
...
67 0
...
80 0
...
87 1
...
92 1
...
276
Figure 15-3
The plot demonstrates visually that there is evidence of a relationship but does
not quantify it
...
To do this
we can use the R function cor
...
By default the test calculates Pearson’s
correlation coefficient, but other coefficients can be specified using the optional
method parameter
...
test(gene1,gene2)
Pearson's product-moment correlation
data: gene1 and gene2
t = 7
...
246e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0
...
9556856
sample estimates:
cor
0
...
test(gene1,gene2,method="spearman")
Spearman's rank correlation rho
data: gene1 and gene2
S = 192, p-value = 8
...
8018576
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
88 or 0
...
It says the association between the expression values for gene 1 (X values) and
gene 2 (Y values) is positive and the values are highly correlated (as one goes
up, so does the other)
...
A
simple linear regression model is required to do this
...
In this equation m is the
“slope” of the line (change in y over change in x) and b is the “intercept” of the
line where the y-axis is intersected by the line
...
Although the straight-line model is perfect for algebra class, in the real world
finding such a perfect linear relationship is next to impossible
...
In this case, the question
literally is “Where do you draw the line?”
...
Y=2X+1
25
20
Y
15
10
5
0
0
2
4
6
8
10
12
X
Figure 15-4
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
To study the relationship between X and Y, the simplest relationship
is that of a straight line, as opposed to a more complex relationship such as a
polynomial
...
Plotting X versus Y, as done in Figure 15-3 is a good first step to
determine if a linear model is appropriate
...
Determining if a straightline relationship between X and Y is appropriate is the first step
...
Fitting the Regression Model
Obviously, given Figure 15-3, you could simply take a ruler and draw in a line,
which, according to your subjective eye, best goes through the data
...
Therefore a more sophisticated method is needed
...
For any given value of X, the true mean value of Y depends on X, which
can be written µ y| x
...
2
...
3
...
The mean of Y is
determined by the straight line relationship which can be written as:
µ y| x = β 0 + β1X where the betas are the slope and intercept of the line
(note this closely resembles y=b+mx, the equation of a line)
...
The variance of the Yi’s is constant (homoscedasticity property)
...
The response Yi’s are normally distributed with a constant variance (see
property 4)
...
The actual observed values of Y differ from
the mean of Y ( µ y| x ) by an error term (reflecting random variation)
...
279
The errors, ε i’s, are the difference between the mean line and the actual Yi data
values
...
In order to fit a regression model, we have parameters to estimate
...
We have a special method for estimating these called the method of least
squares
...
The least square estimates minimize the sum of squares of the
vertical distances between the points and the line, which corresponds with
minimizing the squared residual values
...
For the estimates of the theoretical parameters, we use the convention of lower
ˆ
case b’s (an alternative notation uses hats on the Greek letters, such as β) to
denote that these are estimates based on the data not theoretical values
...
For b0 (intercept)
b0= y − b1x
Of course, performing such calculations with large datasets is tedious
...
Once again, as with ANOVA, we will use the linear model function lm
...
05541
0
...
280
In the above output R has calculated the b0 term (-0
...
9707)
...
As a challenge, you can verify these results using the formulas
given for the coefficients and the empirical data
...
Plotting the regression line makes the analysis more interesting
...
> plot(gene1,gene2)
> lines(gene1,fitted(geneLM))
To make things even more interesting we can add segments connecting the
empirical data to the regression line
...
> segments(gene1,fitted(geneLM),gene1,gene2,lty=2)
> title("Fitted Model with Residual Line Segments")
Figure 15-5 shows the resulting plot
...
ˆ
yi = -0
...
9707 * x i
Figure 15-5
Note that regression is a tool to understand existing causal relationships not to
create them
...
281
conclude two variables are causally related just because they can find a good
fitting regression model relating the variables
...
Supply these functions with the saved variable containing
the linear model object to obtain the output of results:
> fitted(geneLM)
1
2
3
4
5
6
-1
...
84167628 -0
...
46310317 -0
...
39515415
7
8
9
10
11
12
-0
...
23013511 -0
...
16218609 -0
...
09990299
13
14
15
16
17
18
0
...
45906209 0
...
72115116 0
...
83763519
> resid(geneLM)
1
2
3
4
5
4
...
783237e-01 1
...
689683e-02 -2
...
551542e-01 3
...
986489e-02 7
...
678139e-01
11
12
13
14
15
-3
...
701311e-05 8
...
490621e-01 -7
...
811512e-01 2
...
723648e-01
In fact, if you add the fitted values and the residual values together you get the
empirical Y data values (gene 2)
...
frame(sumFnR,gene2)
> compare
sumFnR
1
-1
...
02
3
-0
...
48
5
-0
...
24
7
-0
...
33
9
0
...
53
11 -0
...
10
13
0
...
11
15
0
...
34
17
1
...
21
gene2
-1
...
02
-0
...
48
-0
...
24
-0
...
33
0
...
53
-0
...
10
0
...
11
0
...
34
1
...
21
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In R the lm function is more than just an ordinary function
...
Those familiar with objectoriented programming know about the principle of encapsulation, where
information is “hidden” inside an object
...
The
summary function can be used to extract details of a linear model object
...
Applying the summary function to our linear model object yields the following
output:
> summary(geneLM)
Call:
lm(formula = gene2 ~ gene1)
Residuals:
Min
1Q Median
-0
...
2196 -0
...
1492
Max
0
...
Error t value Pr(>|t|)
(Intercept) -0
...
07330 -0
...
461
gene1
0
...
12925
7
...
25e-06 ***
--Signif
...
001 `**' 0
...
05 `
...
1 ` ' 1
Residual standard error: 0
...
779,
Adjusted R-squared: 0
...
41 on 1 and 16 DF, p-value: 1
...
The next section is a
summary of the residuals
...
The distribution should be
centered around zero, so the median value should be near zero, and it is so for
our data
...
This is one indication of a
good model fit
...
This may or may not indicate a problem with the model fit, but
deserves further exploration
...
283
Figure 15-6
The other section of summary output that is of interest for testing simple linear
regression models is the coefficients section
...
Coefficients:
Estimate Std
...
05541
0
...
756
0
...
97070
0
...
511 1
...
In this case, with a p-value of 1
...
However, if this test
provides an insignificant p-value (0
...
The intercept
test is of little interest in practice
...
Both
are linear models that relate a response (Y) to input (X) through a linear
relationship
...
Both
ANOVA and regression take the form:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
If we expand this to two-way ANOVA the
model is:
Yij = µ + αi + βj + ε ij
Now instead of one treatment effect we have two, alpha and beta, representing
the effect of the two groups (in our example earlier, protein and prediction
structure method)
...
In
regression the input data is continuous data, whereas with ANOVA we are
working with factor (discrete) variables
...
There
are dozens of variations of the linear model, which is a topic of advanced
statistics courses
...
It is of note though that the input variables themselves
need not be linear and can have exponents other than 1
...
In bioinformatics, the use of linear models occurs frequently in quantitative
genetics applications such as QTL mapping
...
R has extensive functionality for dealing with all types of linear and
nonlinear regression models and many packages are devoted to specific
advanced regression modeling applications
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
286
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
However there is a large body of work in statistics devoted to the
simultaneous analysis of several response variables
...
We have already introduced multivariate
probability models; in particular the multinomial distribution is a multivariate
model for discrete data
...
Multivariate statistics can be challenging
...
Another serious problem is the edge effect
issue
...
Another challenge is data and pattern description and visualization
...
It is likely that important features of multivariate data get lost in
these projections
...
Furthermore we will introduce some techniques of analyze multivariate data
which are becoming commonplace in biomedical literature
...
288
The Multivariate Normal Distribution
We first generalize the bivariate normal distribution, discussed in Chapter 8, to
the case of several variables
...
The probability density function for
higher dimensional normal probability models is fairly complex and requires the
notation of linear algebra
...
They are: µ1 and µ2, the mean of Y1 and mean of Y2,
respectively; σ1 and σ2, the standard deviation of Y1 and of Y2, respectively,
and the correlation coefficient ρ12 between Y1 and Y2
...
1 displays
random samples of size 500 from several different normal distributions whose
parameters are given below in Table 16
...
Table 16
...
1
µ1
µ2
σ1
σ2
ρ12
σ12
(a)
0
0
1
1
0
0
(b)
2
1
1
...
414
1
-0
...
707
(d)
2
1
1
...
9
1
...
1: Simulated Bivariate Normal Data
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
The bivariate means
determine the location of the center of the distribution
...
For example in plot (b), the
standard deviation of Y1 is 1
...
Finally the correlation describes the strength of a
linear association between the two variables
...
Plot (d)
depicts a strong positive correlation of 0
...
Notice the last column for the
parameter σ12 in the table
...
Note that
this matrix is symmetric: the off-diagonal elements above the diagonal
correspond to those below the diagonal
...
These commands
require as inputs a vector of the means of the variables, and the variancecovariance matrix
...
1 was generated using the
following commands:
>
>
>
>
>
library(mvtnorm)
mean4 <- c(2,1)
Sigma4 <- matrix(c(3,
...
5),
...
5),1),ncol=2)
mat4 <- rmvnorm(500,mean4,Sigma4)
plot(mat4,xlim=c(-8,8),ylim=c(-4,6),xlab="y1",ylab="y2")
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Although clearly not all continuous data have a multivariate
normal distribution, it is very difficult to work with other multivariate
continuous distributions, one exception being the Dirichlet distribution that is
used for modeling continuous fractions (see Chapter 8)
...
Geometrically speaking such transformations include translations, rotations,
stretching and contracting
...
Another nice feature of the multivariate normal distribution is that any marginal
distributions are again normal
...
4 −0
...
6 −1
...
4 1
...
8
−0
...
2 1
...
2
...
4,-
...
6,-1
...
4,1
...
8,-
...
2,1
...
2
...
5,
ρ13 = -0
...
3, ρ23 = 0
...
2, ρ34 = 0
...
We examine the two-
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
For example, scatterplots of the pairs (1,2), (1,4), and (2,4) are shown
in Figure 16
...
-1
0
1
2
3
4
5
-2
0
Y[, 1]
2
4
6
8
10
Y[, 2]
Figure 16
...
Two-dimensional marginal distributions:
Scatterplots
...
1
...
A general linear transformation of the above four
variables is defined by any arbitrary coefficients c, b1, b2, b3, b4 which then
define a new random variable U:
U = c + b1Y1 + b2Y2 + b3Y3 + b4Y4
We consider the following two transformations:
U1 = 3 + 2Y1 –2Y2 + 4Y3 – 4Y4
U2 = -6 –Y1 – Y2 + 6Y3 + 2Y4
By examining the histograms and the scatterplots (Figure 16
...
>
>
>
>
>
>
>
# Generate linear combinations
U1 <- 3 + 2*Y[,1] -2*Y[,2] + 4*Y[,3] - 4*Y[,4]
U2 <- -6 - Y[,1] - Y[,2] + 6*Y[,3] + 2*Y[,4]
# Create graphs
par(mfrow=c(1,3))
hist(U1,nclass=12);hist(U2,nclass=12)
plot(U1,U2)
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
4: Distributions of Linear Combinations of Normal
Variates
So far we have only examined simulated data, which by design are very well
behaved
...
Whenever
possible we will judge as to whether the data follow a multivariate normal
distribution
...
However they can be influenced
by individual extreme data values, so-called outliers, especially if the dataset is
small to moderately large
...
Multivariate Sample Statistics
A routine first step in multivariate data analysis involves calculating sample
statistics for the purpose of summarizing the data
...
It is best to
arrange the data into a data frame, instead of into a matrix
...
> Y <- as
...
frame(Y)
> # Provide summaries for each variable
> summary(Y)
Y1
Y2
Y3
Min
...
8827
Min
...
173
Min
...
062
1st Qu
...
2324
1st Qu
...
653
1st Qu
...
534
Median : 1
...
008
Median : 5
...
9506
Mean
: 3
...
943
3rd Qu
...
5636
3rd Qu
...
315
3rd Qu
...
311
Max
...
9096
Max
...
023
Max
...
927
Y4
Min
...
266
1st Qu
...
918
Median : 8
...
020
3rd Qu
...
316
Max
...
822
We note how close the medians and the means are to each other, indicating that
the data are symmetric
...
293
sample variances and covariances arranged in a variance-covariance matrix, and
the sample (Pearson) correlation coefficients arranged in a correlation matrix
...
9997 2
...
0116 3
...
9993 0
...
4111 -0
...
9756 4
...
6587 -1
...
4111 1
...
0466 1
...
8692 -1
...
7394 9
...
0000 0
...
2044 -0
...
4875 1
...
4119 -0
...
2044 0
...
0000 0
...
2851 -0
...
2835 1
...
We use
the symbol S (bold faced S) to denote the sample variance covariance matrix,
which is customary in multivariate statistics
...
Displaying Multivariate Data
The best way to familiarize oneself with what is in a data set is through
examining graphs
...
While high dimensionality (many variables) prevents us from clearly seeing all
features in lower dimensional plots, graphing is still a powerful method
particularly for identifying outliers, relationships between variables, and for
suggesting transformations to achieve distributions that are closer to a normal
...
(see Anderson
(1935)
...
We will
analyze the species “versicolor”
...
294
> data(iris)
> # first print the data
> print(iris)
1
2
3
…
...
49
50
51
52
...
…
...
Length Sepal
...
Length Petal
...
1
3
...
4
0
...
9
3
...
4
0
...
7
3
...
3
0
...
3
5
...
0
6
...
7
3
...
2
3
...
5
1
...
7
4
...
2
setosa
0
...
4 versicolor
1
...
7
6
...
8
2
...
3
2
...
1
6
...
1
1
...
5 virginica
1
...
2
5
...
4
3
...
4
5
...
3
1
...
versicolor <- iris[iris$Species=="versicolor",1:4]
> # Plot 4 histograms
> par(mfrow=c(2,2))
> for (i in 1:4) { hist(iris
...
versicolor)[i]) }
4
...
5
0 5
1
5
Sepal
...
Length
6
...
0
3
...
0
5
...
2
0
6 1
2
Petal
...
Length
2
...
0
1
...
8
Figure 16
...
5
...
R provides a convenient command, “pairs” that
produces all scatterplots of all pairs of variables in a data set arranged as a
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
6)
...
There is of course redundancy in this figure
...
n <- nrow(iris)
# Create labels
ir
...
labels[iris$Species=="versicolor"]<- "c"
ir
...
labels)
0
...
5
2
...
5
ss ss
s
ssssssssssssss s
s
s s ss
s
sss sss s
ssssssssss sss
ss sss s
sss
6
...
5
s
ss
s
s
ss sss
ss s
s
s
ss
vv
v vvv
vvvvv vv
vv
vvvvvvvvv
vv v v
vv
vv v
cv
v
vcc vvvv
ccc v v
ccc
cccccv v
ccc
ccccc
cc
ccccccc
cc
c
7
...
5
5
...
5
vv
v
c
v
c c vv vvv
cc v v vv
cc cv vvvv
c v
cccc v
cc v
v
c cccc vvvvv v
cccccc vvvv v
ccv c vv
cc c
c v
c c
c
vvvv
vvv v v
v vvv
v v v vvvvv
v vvv
v
c
v vv
v vvv vv
cccc ccvv
c
cc
c
cccc v
c cccc
c
cc
cc c
cc
Petal
...
5
6
s
s
ss
ss s
s
s
sss
s
s
sss
s
ss
ss
s
sss
ss
s
s
v v
v vv vv v v
v v
vv
v
v
v v
vv vvvvv vv v
vcvvvv v v
v vv c v
v v
v
ccc c v
cv
v c
ccccccccccc
cc vc cc
c c
c c c cc
c cc cc
cc
cc c
4
...
W idth
ss
s
ss
ss
ss
s
ss s
ss
ss
sss
s
ss
v
vv
vv
v
v
vvv
vv
c vvv
c
ccccvvvv
c vv
v
cccvvvvv
c vv v
c
cv v
vc
ccccv vv
cc v
cc cv
cc v
c cc c vv
cccc
cc c
ccc c
c v
1
...
5
4
s
sss
ss
vv
s
s
v
s ssss
s ss
s
v
sssss cvc v
s
vv vcv
s
s
c c vv
v v
ss ss
s
vc vcv
s sss ccccvvvccvv v
s s ccvc cccv v v
s s vc ccvcv v v
v c
cccvv
v v
c
c cc vv c
v
v c cv c v
cc v
s cc c ccc
c v
c
4
...
0
3
...
0
Sepal
...
0
2
2
...
W idth
s
s
ss
s
ss s
ss
sss s
sss
sss
s
1 2 3 4 5
6 7
Figure 16
...
versicolor)
Sepal
...
Width
Petal
...
Length Sepal
...
Length Petal
...
0000000
0
...
7540490
0
...
5259107
1
...
5605221
0
...
7540490
0
...
0000000
0
...
296
Petal
...
5464611
0
...
7866681
1
...
Length and Petal
...
Length and
Petal Width are most strongly correlated with respective correlations of 0
...
7867
...
5 5
...
5 6
...
5 7
...
Length
2
...
2 2
...
6 2
...
0 3
...
4
5
...
0
4
...
0
3
...
0
Petal
...
Width
5
...
0
4
...
0
3
...
0
1
...
2 1
...
6 1
...
5 5
...
5 6
...
5 7
...
Length
Petal
...
0 1
...
4 1
...
8
Petal
...
Length
3
...
2
3
...
8
2
...
4
2
...
0
Sepal
...
Width
1
...
2 1
...
6 1
...
5 5
...
5 6
...
5 7
...
4
3
...
0
2
...
6
2
...
2
2
...
Width
3
...
54
...
55
...
5
Petal
...
Software may also include interactive graphics that allow the
rotation of the point cloud which further enhances the subtle features in the data
...
For example plotting the four 3-d point clouds of
the iris versicolor data results in the following (Figure 16
...
Width
Figure 16
...
In particular, in
graph 1 (top left) there appear to be 4 points with low Sepal values slightly
separated from the rest of the data
...
Again
these patterns are subtle and would only be considered mild departures from
multivariate normality
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In the special case of uncorrelated normal data, the football
becomes aligned with the axes, and in the special case of uncorrelated data and
equal variances among the variables, the football becomes a sphere
...
For one and two-variable data it is easy to
identify outliers in a scatterplot
...
For multivariate data with more than 3 variables,
there is the chance that lower-dimensional projections do not reveal all outliers
...
The onedimensional projections are represented as rugplots along the x and y axes
...
8: Masking of Outlier in Projections
One very simple classical method is to obtain for each data point xi, the unit-free
measure of how many standard deviations it lies away from the center of the
data, the multivariate mean
...
298
zi =
xi − µ X
σX
Typically we are only interested in the absolute value (positive number) of the tscore, in which case we can write
2
x −x
ui = t = i
=
sX
2
i
( xi − x )
1
x − x)
2 ( i
sX
For normally distributed data, ti2 has for large data approximately a chi-square
distribution with 1 degree of freedom
...
It turns out that the measure, di2 , which is a generalized squared distance, is:
di2 = ( xi − x ) ' S −1 ( xi − x )
This is a vector-matrix expression, and S-1 denotes the inverse matrix of the
sample variance-covariance matrix S, and (
...
Any p by p matrix, such as S defines the coefficients of a system of p
linear equations which is a transformation from one p-vector to another
...
For any p>2, such systems
invariably require computer software for quick calculations
...
The di2
are also called Mahalanobis distances, after the Indian statistician Mahalanobis
...
R provides the
Mahalanobis distances with the mahalanobis command for the rows of a data
matrix
...
9
...
versicolor,
+ mean(iris
...
versicolor))
> plot(dist2^
...
299
0
...
0 1
...
0 2
...
0 3
...
5
0
10
20
30
40
50
Index
Figure 16
...
Note that the square root of the 95th percentile of the chi squared distribution
with 4 degrees of freedom equals 3
...
We would use this number as a rough
cut-off to flag extreme observations that give an indication of departure from
normality
...
Principal Components
In order to effectively analyze multivariate data the scientist is often forced to
resort to what is known as dimension reduction techniques, the first we will
introduce is known as principal component analysis
...
The derived
variables, called principal components, are linear combinations of the original
variables, and are usually listed in the order of their respective importance
...
The aim is to discard
subsequent principal components after a large percentage of the variation (90%
– 99 %) has been explained by the first k principal components, where k < p
...
In
R we use the command princomp followed by the summary command to
obtain standard PC analysis output given below
...
The procedure resides in the package mva
...
iris <- princomp(iris
...
iris)
Importance of components:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
1
Comp
...
3
Comp
...
6914597 0
...
23169066 0
...
7808176 0
...
08766635 0
...
7808176 0
...
98433102 1
...
10 visualizes the amount of the variance that is
explained by the four principal components
...
7% of the
variability, while using the first three PC’s would capture 98
...
> # Create screeplot of a princomp object
> screeplot(pc
...
3
0
...
0
0
...
4
pc
...
1
Comp
...
3
Comp
...
10: Screeplot of PC Analysis for Iris Versicolor
We note that all four original variables are size measurements (length and
width)
...
PC1 can be interpreted as a derived variable of size
...
Classification and Discriminant Analysis
The goal of classification is to develop a predictive model that is capable of
assigning membership to a class based on several measured characteristics
...
It is assumed that taking size measurements is much easier than
determining the exact species
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
As the name suggests the goal is to predict membership in
any one of two classes
...
Reaven and Miller (1979) examined the relationship between chemical
subclinical and overt nonketotic diabetes in 145 non-obese adult subjects
...
In addition, the relative weight
and fasting plasma glucose were measured for each individual in the study
...
http://lib
...
cmu
...
1
...
We attempt to derive a classification into the two categories normal (coded as 3
in the data) and subclinical diabetes (coded as 2) using the two variables glucose
intolerance, and insulin response
...
> glucose <- read
...
glucose
...
txt')
> # Only use the 6 relevant variables
...
intol','i
...
resist','rel
...
gluc','diagn')
> # Only use normal (diagn=3) and subclinical diab (diagn=2)
> gluc1 <- glucose[glucose$diagn != 1,]
> attach(gluc1)
> # Plot the data
> plot(gluc
...
respons,type='n')
> text(gluc
...
respons,labels=diagn,cex=0
...
11, we see that the responses of the
two groups are quite interspersed
...
The means of each group are
indicated as filled squares
...
302
2
110
2
3
2
100
90
i
...
7
0
...
9
1
...
1
1
...
intol
Figure 16
...
In the ideal case the observations are totally separated in
sample space, in which case we can easily draw a line, or a curve of separation
that classifies the data perfectly
...
Because of the
overlap there is some risk of misclassifying a future sample into group “2” when
it should be classified into group “3” and vice versa
...
Thus, the classification/misclassification probability is the
conditional probability of the classification multiplied by the prior probability
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Most often we consider a “0-1” cost function which
means that there is a cost of 1 when misclassification occurs
...
p ( A | X in A) =
p ( X in A | A) p A
p ( X in A | A) p A + p ( X in A | B ) pB
p ( B | X in A) =
p ( X in A | B ) pB
p ( X in A | A) p A + p ( X in A | B ) pB
Note that for binary classification, these two probabilities add to one
...
In particular expressions like
p(X in A|A) may involve complicated multivariate density functions
...
A
...
Fisher’s LDA is an
optimal classification rule if the data in each group follow a multivariate normal
distribution that have different mean vectors but with the same variancecovariance matrix
...
The package “MASS” contributed to the R project in conjunction with the text
“Modern Applied Statistics with S-Plus” (Venables and Ripley, 1997) provides
command lda linear discriminant analysis
...
Let’s first apply lda:
>
>
>
>
# Perform Linear Discriminant Analysis
library(MASS)
gluc
...
lda
Prior probabilities of groups:
2
3
0
...
6786
The prior probabilities are assumed according to the fraction of data that fall in
each class, unless otherwise specified
...
intol i
...
0558
99
...
9372
91
...
intol -5
...
304
i
...
07247
The coefficients of lda determine the linear transformation
...
An easier interpretation is the following: The separation
line crosses the line connecting the group means at the halfpoint, and has a slope
that is minus the ratio of the lda coefficients
...
573/0
...
9
...
11)
...
The command
predict(lda) provides a printable list with the following objects:
>
>
>
>
# Store the list of output
gluc
...
lda)
# class membership prediction
gluc
...
pred$posterior
2
3
1
0
...
9685
2
0
...
7135
3
0
...
5640
4………
...
pred$x
LD1
1
1
...
09097
3
-0
...
4…………
...
pred$class we notice that the first 3 observations are
classified as 3 because p(3|X) > 0
...
We could now count the
misclassifications by comparing the actual diagnosis, the diagn variable with
the predicted class
...
pred$class)
2 3
2 17 19
3 10 66
Thus the estimated misclassification errors are as follows
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
528, and P(X is in “2”| “3”) = 10 / 76 = 0
...
We
can obtain the classification probabilities by dividing the table by its row sums
as follows:
> tab <- table(gluc1$diagn,gluc
...
4722 0
...
1316 0
...
Classification for More Than Two Populations
The discussion of binary classification directly generalizes to more than two
populations without much difficulty
...
We now attempt to classify the full
diabetes data using all five variables and all three diagnoses (1: overt diabetic, 2:
chemical diabetic, 3: normal)
...
Linear discriminant analysis attempts to transform the data into one or two new
variables, so called canonical variates, so that the separation of diagnosis is
strongest in the first, and second strongest in the second
...
The following gives the R calculations
...
lda <- lda(glucose[,1:5],glucose$diagn)
gluc
...
2276 0
...
5241
Group means:
gluc
...
respons i
...
wt plas
...
9839
217
...
8 106
...
9
2
1
...
31
493
...
0
209
...
9372
91
...
0 172
...
0
Coefficients of linear discriminants:
LD1
LD2
gluc
...
3624357 -3
...
respons
0
...
036633
i
...
0125764 -0
...
wt
0
...
006173
plas
...
0042432 0
...
8812 0
...
306
The optimal classification results in two canonical variates (LD1 and LD2)
given by the coefficients above
...
Let’s examine class membership and misclassification error
...
pred <- predict(gluc
...
pred$class
[1] 3
[30] 3
[59] 3
[88] 2
[117] 1
Levels:
3
3
3
2
1
1
3
3
3
2
1
2
3
3
2
2
1
3
3
3
2
2
1
3
3
3
2
1
3
3
2
2
1
3
3
2
2
2
3
3
3
3
1
3
3
3
2
1
3
3
2
2
1
3
3
3
2
1
3
3
2
2
1
3
3
3
2
1
3
3
3
2
2
3
3
3
2
1
3
3
3
2
1
3
3
3
3
3
3
3
2
2
2
3
3
3
2
2
3
3
3
2
2
3
3
3
2
1
3
3
3
3
1
3
3
2
2
1
3
3
2
3
1
2
3
3
1
1
3
3
2
1
1
3
3
2
1
1
3
3
2
1
1
> tab <- table(glucose$diagn,gluc
...
8182 0
...
03030
2 0
...
86111 0
...
0000 0
...
96053
This classification has much lower misclassification error than the previous one
where we only used two response variables
...
039
...
12 plot (b) that the separation
into the three groups shows clearly when the data are represented in the two
canonical variates, whereas the groups appear interspersed, particularly groups 2
and 3, when represented in the first two variables glucose intolerance, and
insulin response, as shown in plot (a)
...
7
0
...
9
1
...
1
1
...
respons
300 350
(a)
3
3
3
3333
33 3
3
33 3
3
33
1 3 33 3
1 1
33
333 3
33
333
333 3 3
1 1 2 23333 33
33
1 1
33 3
233 33
33
11
1
22 223 33 33
3
11
23 3 3 3
1
2
1 1 2 2 23
1
22 2 2
22
23
1
22 3
2 22
1 1 22
12
2
3
2
2
22
2
22
2
1
1
1
1
-6
gluc
...
307
Figure 16
...
Any of the linear methods for multivariate
statistics are optimal when the data more or less follow a multivariate normal
distribution
...
This outlier
sensitivity can have a major impact on the misclassification error: Outliers can
have a pull-effect on the separation curve that tends to reduce their probability
of being misclassified
...
In order to adequately evaluate the classification procedure we need to have a
separate data set, the so-called test set, for which we evaluate how well the
classification rule works
...
The test set can be a
separated part of the dataset originally used which was not incorporated when
making the model or a different dataset
...
Classification Trees
Classification trees, or recursive binary partition schemes originated as tools for
decision making in the social sciences and were introduced to main-stream
statistics by Breiman et al
...
Classification trees essentially provide a
sequence of rectangular blocks of the data space, where one of the groups is
assigned to each block
...
Classification trees are
similar to the widely used “keys” in botany for plant identification
...
Let’s use this methodology on the cancer microarray gene
expression data, available at http://www-stat
...
edu/ElemStatLearn ,the
website for the textbook by Hastie et al
...
This data has been fully
preprocessed
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
We selected a set of 35 genes
that exhibited minimal expression of at least –0
...
To make the
results more understandable we only consider the 6 most expressive genes,
which corresponds to a minimal expression level over the 64 cancer samples of
–0
...
We thus work with 6 variables and attempt to partition into 8 different cancer
types
...
They are assigned in this order
in the following output
...
red <- MAdat[(rowmin > -
...
red <- MAdat
...
frame(t(MAdat
...
factor(names(MAdat
...
data
...
We use an abbreviated model formula: cancers ~
...
> # Classification Tree via Recursive Partitioning
> library(tree)
> MA
...
,data=Genedat)
We can get an overall summary of the tree procedure:
> summary(MA
...
, data = Genedat)
Variables actually used in tree construction:
[1] "X5680" "X4831" "X2838" "X3234"
Number of terminal nodes: 8
Residual mean deviance: 2
...
439 = 25 / 57
And we can plot the tree, which uses these commands
...
13
...
tr)
> # Add labels to the splits
> text(MA
...
309
X5680 < -0
...
055
COLON
X3234 < -0
...
105039
X2838 < 0
...
12498
BREASTOVARIAN
X5680 < 0
...
13 Classification Tree for Gene Expression Data
Hastie et al
...
We can
assess the predictive power of a tree using cross-validation as described
previously
...
tree
...
Classification trees have their natural counterpart in regression analysis where
the variable to be predicted is continuous
...
The commercially available computer software CART (Classification and
Regression trees) is a full-fledged stand-alone package for tree-based analyses
...
Clustering Methods
Clustering of data is usually a first step in organizing a multivariate dataset
...
Let’s say you purchased a new bookshelf
that holds 100 of your books
...
In such a way you create
groups, or clusters of books, where books within a cluster are similar
...
Or you may use quantitative characteristics such as book size,
thickness, etc
...
e
...
The final outcome will
depend on what “measures of similarity” or “dissimilarity” you have used
...
310
A basic reality of clustering is the fact that for even a small number of items to
be grouped, there are thousands of possible ways to cluster – that’s why
“cleaning house” can be so frustrating! In fact the number of possible clusters
of k items is 2k – 1, a number that grows exponentially with k
...
But
even with today’s high-speed computers, it is still a computational challenge to
repeatedly check and sort billions of possible clusters
...
There are two broad categories of clustering
techniques
...
In the second, which can
be called non-hierarchical clustering, the number (k) of clusters is predetermined
and one simply finds the best possible partition of the samples into k clusters
...
We first need to define what we use as measure of similarity, or dissimilarity
...
We are looking at six
genes and the expression of these genes under treatment and control conditions
...
Gene
Treatment
Control
G1
0
...
5
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
2
0
...
3
0
...
9
0
...
5
0
...
3
-0
...
6 -0
...
2 0
...
2 0
...
6 0
...
0
treatement
The data are plotted in Figure 16
...
4
1
6
3
2
5
-0
...
4 -0
...
0 0
...
4 0
...
8
control
Figure 16
...
For continuous data
dissimilarity is measured using traditional distance metrics, the most obvious
one being the Euclidean (geometric) distance
...
+ ( x p − y p ) =
p
∑ (x − y )
i
2
i
i =1
In our case, we only have two variables, so p=2, and for genes 4 and 6 the
distance is
d (G 4, G 6) =
2
( 0
...
3) + ( 0
...
5) )
2
= 0
...
49 = 0
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
5,
...
4,
...
5,-
...
5,
...
3,
...
5,
...
4243
0
...
4123
0
...
9220 0
...
0000 0
...
8062 1
...
0198 1
...
9000 0
...
281
Two other popular distance metrics used for clustering are the city block, or
Manhattan distance (or L1 norm) and the max component distance
...
Any
recognizable abbreviation is sufficient
...
> # Manhattan Distances
> dist(genes,'manh')
1
2
3
4
5
2 0
...
3 0
...
7 1
...
8
5 1
...
0 0
...
7
6 1
...
4 0
...
3 1
...
3
3 0
...
4
4 0
...
7 0
...
0 0
...
8 1
...
0 1
...
9 0
...
The major advantage of this type of
clustering comes from computational efficiency
...
As the name
suggests, the aim is to optimally assign n data points to k clusters (k < n) where
optimal means that each point belongs to the cluster whose mean it is closest to
with respect to a chosen dissimilarity metric
...
Mathematically, K-means clustering works in a three steps algorithm:
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
Calculated the centroid (=mean
value or center) coordinates for every cluster selected
•
Second, check through the data, assigning each data item to the cluster
whose centroid is nearest (based on a distance metric)
...
•
Third, repeat the second step until no more reassignment takes place
...
In some versions the new
centroid coordinates are calculated only after all data points have been checked
and possibly reassigned
...
> clus <- kmeans(genes, 2, 20)
> # Note: the second number (20) denotes the number of iterations
> clus
$cluster
[1] 2 1 2 2 1 2
$centers
[,1] [,2]
1 0
...
15
2 0
...
50
$withinss
[1] 0
...
85
$size
[1] 2 4
> plot(genes, pch = clus$cluster,xlab='control',ylab='treatment')
> # Plot Cluster Centers
> points(clus$centers,pch=8)
Cluster 2
Cluster 1
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
15
...
( ∗ denotes
cluster means)
We can see by eye from that all 6 genes are closest (in Euclidean distance) to
their respective cluster means
...
Furthermore, the within cluster means that are used as centroids are equally sensitive to outliers
...
The R package cluster is a suite of robust
clustering methods for the most part based on the text by Kaufman and
Rousseeuw (1990)
...
This package is highly recommended as an add-on to the
standard mva package
...
The median is
insensitive, or robust, to outliers, no matter how extreme an outlier
...
e
...
For clustering of p-dimensional data
we consider so-called medoids as centers for the clusters
...
Medoids can be
defined for any distance metric, but usually the Manhattan metric is chosen in
order to reduce the influence of potential outliers
...
The input for the command can
be either a data matrix or, instead, a distance matrix
...
Standardizing
variables is highly recommended when the variables represent measurements
that do not have a common natural scale
...
Hierarchical Clustering
In most large data in bioinformatics, such as in gene expression microarrays,
there is no a-priori information as to what the number of clusters should be
...
315
clustering calculations from small to large number of clusters
...
Such a restriction results in what is called
hierarchical clustering
...
There are two ways to do hierarchical clustering: (1) from small to large (socalled agglomerative), and (2) from large to small (so-called divisive)
...
Clusters are then successively combined until ultimately all data reside in a
single big cluster
...
The question
of course arises as to when to stop the procedure
...
Therefore it is best to stop before branches in successive
splits become crowded over short distances
...
At any given step in the algorithm, all pairs of
dissimilarities between clusters are examined, and the pair of clusters with
smallest dissimilarity will be joined to form a single cluster
...
These are
commonly called the methods of linkage
...
In the average linkage
method the distance between two clusters is the average of the dissimilarities
between the points in one cluster and the points in the other cluster
...
In the complete linkage method,
we use the largest dissimilarity between a point in the first cluster and a point in
the second cluster
...
The following graph illustrates the
first three different linkage methods
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
16: Linkage Methods for Hierarchical Clustering
We examine the different linkage methods for the sleep data
...
We note
that agnes is virtually identical to the command hclust of the mva package but
is more convenient to use when standardization of the data is necessary
...
The
variable with the largest variance will have the largest impact on the clustering
...
The plot command provides two graphs, (1) banner plot, (2)
dendrogram
...
It is obtained using the
which=2 option
...
We display dendrograms of the clustering that
results from the three linkage methods average, complete, and single
...
In order to avoid
cluttering we create simple numberings for labels
...
names(sleep1) <- 1:n
cl1 <- agnes(sleep1[,-1],method='aver',metric='eucl',stand=T)
cl2 <- agnes(sleep1[,-1],method='comp',metric='eucl',stand=T)
cl3 <- agnes(sleep1[,-1],method='sing',metric='eucl',stand=T)
plot(cl1,which=2,main="Average Linkage")
plot(cl2,which=2,main="Complete Linkage")
plot(cl3,which=2,main="Single Linkage")
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
91
16
48
30
1
11
25
35
15
19
2
5
40
41
45
9
42
17
22
3
47
14
33
7
39
23
21
36
12
8
28
34
10
27
26
43
38
29
44
37
4
6
18
20
31
46
24
32
13
2
0
Height
4
6
Complete Linkage
Agglomerative Coeff icient = 0
...
0
0
...
8
Figure 16
...
Agglomerative Clustering of Sleep Data using
Three Different Linkage Methods
Divisive Hierarchical Clustering
Divisive clustering basically works in the opposite direction of agglomerative
clustering
...
In subsequent steps clusters are divided into subclusters
...
At any given stage, which cluster should be
divided next? The R command diana of the library cluster chooses at each
stage the cluster with the largest diameter, which is the largest dissimilarity
between any two of its observations
...
This point will
initiate the new cluster split
...
The command diana works just as the
other commands in the cluster library, such as agnes: It takes as input a data
matrix or data frame, or instead a dissimilarity matrix, and has options for
metric, and for standardization
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
In the past 5 years since the inception of this book (which
originated as a contract in the bioinfomatics series at a major publisher, which
had financial problems and cancelled the series) the number of R applications
has exploded
...
There are some books out there already, mostly for
microarray applications
...
The reader should install interesting packages
and run the examples in them to explore functionality of R of interest
...
319
Resources for Further Study
This is a list of some of the reference material I used in preparing this
manuscript
...
P
...
Springer
...
Assumes no advanced
mathematics beyond algebra
...
J
...
Pinheiro, D
...
Bates
...
An
advanced title covering liner and nonlinear mixed effect models
...
B
...
Ripley and V
...
Venables
...
A more advanced book using S
...
Includes some coverage of R but more specific to SPlus
...
B
...
Ripley and V
...
Venables
...
For more advanced
users interested in programming
...
The Cartoon Guide to Statistics L
...
Smith
...
A
beginner’s introduction to statistical concepts
...
J
...
Philips
...
H
...
Nonmathematical conceptual introduction to statistics
...
320
Mathematical Statistics and Data Analysis, 2nd Ed
...
A
...
Duxbury
...
Mathematically advanced
(calculus based)
...
J
...
Devore
...
Written at the level of (very) elementary calculus
...
Probability and Statistical Inference, 6th Ed
...
V
...
Tanis
...
Probability Theory
A First Course in Probability, 6th Ed
...
Ross
...
Easy to understand
comprehensive elementary text on probability models and theory
...
W
...
Wiley
...
Schaum's Outline of Probability
...
Lipschutz
...
An
inexpensive review of probability
...
A
...
B
...
S
...
CRC
Press
...
Mathematically advanced
...
J
...
CRC
Press
...
Statistics: A Bayesian Perspective
...
A
...
Duxbury
...
Markov Chains
Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference
...
Gamerman
...
Short but complete coverage of the theory of MCMC
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
W
...
Gilks, D
...
Spiegelhalter and S
...
CRC Press
...
Monte Carlo Statistical Methods
...
P
...
Casella
...
Graduate-level text includes extensive coverage of Markov Chains
...
Practical Nonparametric Statistics, 3rd Ed
...
J
...
Wiley
...
Written at the level of algebra and very
understandable
...
Design and Analysis of Experiments
...
D
...
Montgomery
...
Wiley
...
Design of Experiments: Statistical Principles of research Design and Analysis,
2nd Ed
...
O
...
Duxbury
...
Regression and Linear Modeling
Applied Linear Statistical Models
...
Advanced undergraduate level
text contains coverage of regression, ANOVA, experimental design and liner
models
...
Kleinbaum et al
...
Generalized Linear Models, 2nd Ed
...
A
...
McCullagh
...
Graduate level text on GLMs
...
322
Multivariate Statistics
Applied Multivariate Statistical Analysis, 5th Ed
...
A
...
W
...
Pearson Education
...
Written at a reasonable mathematical level
...
T
...
Tibshirani, and J
...
Springer
...
Contains numerous examples
using gene expression data
...
Multivariate Statistical Analysis: A Conceptual Introduction
...
K
...
Radius
...
Using Multivariate Statistics, 4th Ed
...
G
...
S
...
Perason
Education
...
Bioinformatics
Bioinformatics: Sequence and Genome Analysis
...
Cold Spring Harbor
Press
...
Bioinformatics for Dummies
...
M
...
Notredame
...
Non-technical introduction to bioinformatics
...
C
...
Jambeck
...
A Primer of Genome Science
...
Gibson and S
...
Muse
...
Covers the
empirical side of bioinformatics, including genome projects, sequencing,
microarrays, and proteomics
...
Statistical Methods in Bioinformatics: An Introduction
...
J
...
R
...
Springer
...
Quantitative Genetics
Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics
...
Sorensen and D
...
Springer
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
A
...
Paterson (Ed
...
Comprehensive coverage of QTL analysis
...
T
...
Speed (Ed)
...
Collection of essays by microarray authorities
...
G
...
Springer
...
Copyright May 2007, K Seefeld
Permission granted to reproduce for nonprofit, educational use
...
325