Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

You have nothing in your shopping cart yet.

Title: Molecular Biology
Description: concepts of molecular biology and it will be helpful

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above


BASICS ON MOLECULAR BIOLOGY

Cell – DNA – RNA – protein
Sequencing methods
arising questions for handling the data, making sense of it
next two week lectures: sequence alignment and genome
assembly

Cells





2

Fundamental working units of every living system
...

Prokaryotes and Eukaryotes are descended from primitive cells and the results of
3
...


Prokaryotes and Eukaryotes


According to the most recent
evidence, there are three
main branches to the tree of
life



Prokaryotes include Archaea
(“ancient ones”) and bacteria



Eukaryotes are kingdom
Eukarya and includes plants,
animals, fungi and certain
algae
Lecture: Phylogenetic trees,
this topic in more detail

3

All Cells have common Cycles

• Born, eat, replicate, and die
4

Common features of organisms


Chemical energy is stored in ATP



Genetic information is encoded by DNA



Information is transcribed into RNA




There is a common triplet genetic code

some variations are known, however
Translation into proteins involves ribosomes



Shared metabolic pathways



Similar proteins among diverse groups of organisms

5

All Life depends on 3 critical molecules
• DNAs (Deoxyribonucleic acid)
– Hold information on how cell works

• RNAs (Ribonucleic acid)
– Act to transfer short pieces of information to different parts of cell
– Provide templates to synthesize into protein

• Proteins
– Form enzymes that send signals to other cells and regulate gene
activity
– Form body’s major components
6

DNA structure


DNA has a double helix structure
which is composed of
– sugar molecule
– phosphate group
– and a base (A,C,G,T)



By convention, we read DNA
strings in direction of
transcription: from 5’ end to 3’
end
5’ ATTTAGGCC 3’
3’ TAAATCCGG 5’

7

DNA is contained in chromosomes

In eukaryotes, DNA is packed into linear chromosomes

In prokaryotes, DNA is usually contained in a single, circular
chromosome

8
http://en
...
org/wiki/Image:Chromatin_Structures
...
wikipedia
...
It is usually only a single strand
...


tRNA linear and 3D view:
10

http://www
...
ucsf
...
gif

DNA, RNA, and the Flow of Information
Replication

Transcription

”The central dogma”

Translation
Is this true?

11

Denis Noble: The principles of Systems Biology illustrated using the virtual heart
http://velblod
...
net/2007/pascal/eccs07_dresden/noble_denis/eccs07_noble_psb_01
...
wikimedia
...
png

Amino acids

13

How DNA/RNA codes for protein?





14

DNA alphabet contains four
letters but must specify protein,
or polypeptide sequence of 20
letters
...




Proteins do all essential work for the cell
– build cellular structures
– digest nutrients
– execute metabolic functions
– mediate information flow within a cell and among cellular communities
...


15

Genes


“A gene is a union of genomic sequences encoding a coherent set of
potentially overlapping functional products”



A DNA segment whose information is expressed either as an RNA
molecule or protein
(translation)

(folding)
MSG …

(transcription)
5’

… a t g a g t g g a …

3’

3’

… t a c t c a c c t …

5’

16

http://fold
...






Prokaryotes are typically haploid:
they have a single (circular)
chromosome
DNA is usually inherited vertically
(parent to daughter)
Inheritance is clonal
– Descendants are faithful copies
of an ancestral DNA
– Variation is introduced via
mutations, transposable
elements, and horizontal transfer
of DNA
Chromosome map of S
...
mgc
...
cn/ShiBASE/circular_Sd197
...
Proc
...
9x109 bases)



Reads have to be assembled!



28

Problems






29

Sanger sequencing error rate per base varies from 1% to 3%1
Repeats in DNA
– For example, ~300 base longs Alu sequence repeated is over million times in
human genome
– Repeats occur in different scales
What happens if repeat length is longer than read length?
Shortest superstring problem
– Find the shortest string that ”explains” the reads
– Given a set of strings (reads), find a shortest string that contains all of them

Sequence assembly and combination locks



30

What is common with sequence assembly and opening keypad locks?

Whole-genome shotgun sequence


Whole-genome shotgun sequence assembly starts with a large sample of
genomic DNA
1
...

3
...


31

Sample is randomly partitioned into inserts of length > 500 bases
Inserts are multiplied by cloning them into a vector which is used to infect
bacteria
DNA is collected from bacteria and sequenced
Reads are assembled

Assembly of reads with Overlap-LayoutConsensus algorithm






32

Overlap
– Finding potentially overlapping reads
Layout
– Finding the order of reads along DNA
Consensus (Multiple alignment)
– Deriving the DNA sequence from the layout
Next, the method is described at a very abstract level, skipping a lot of details

Finding overlaps


First, pairwise overlap alignment of
reads is resolved



Reads can be from either DNA strand:
The reverse complement r* of each
read r has to be considered

acggagtcc
agtccgcgctt
r1
5’

… a t g a g t g g a …

3’

3’

… t a c t c a c c t …

5’

r2

33

r1: tgagt, r1*: actca
r2: tccac, r2*: gtgga

Example sequence to assemble
5’ – CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGTATGCGCATCG
TGATTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3’

• 20 reads:
#
1
2
3
4
5
6
7
8
9
10
34

Read
CATCGTCA
CGGTGAAG
TATGCGCA
GACGAGTC
CTGACAAA
ATGCGCAT
ATGCGGTC
CTGCGTGA
GCGTGACG
GTCGGTGA

Read*
TCACGATG
CTTCACCG
TGCGCATA
GACTCGTC
TTTGTCAG
ATGCGCAT
GACCGCAT
TCACGCAG
CGTCACGC
TCACCGAC

#
11
12
13
14
15
16
17
18
19
20

Read
GGTCGGTG
ATCGTGAT
GCGCTGCG
GCATCGTG
AGCGCGCT
GAAGTTGT
AGTGAAAC
ACGCGATG
GCGCATCG
AAGTGAAA

Read*
CACCGACC
ATCACGAT
CGCAGCGC
CACGATGC
AGCGCGCT
ACAACTTC
GTTTCACT
CATCGCGT
CGATGCGC
TTTCACTT

Finding overlaps


Overlap between two reads can
be found with a dynamic
programming algorithm

Overlap(1, 6) = 3
6 ATGCGCAT

– Errors can be taken into account





12 ATCGTGAT

Dynamic programming will be
discussed more during the next
two weeks
Overlap scores stored into the
overlap matrix
– Entries (i, j) below the diagonal
denote overlap of read ri and rj*

35

1 CATCGTCA

Overlap(1, 12) = 7

1

6

12

3

7

Finding layout & consensus


Method extends the assembly
greedily by choosing the best
overlaps



Both orientations are considered



Sequence is extended as far as
possible

consensus sequence
36

Ambiguous bases

7*
GACCGCAT
6=6* ATGCGCAT
14
GCATCGTG
1
CATCGTGA
12
ATCGTGAT
19
GCGCATCG
13* CGCAGCGC
--------------------CGCATCGTGAT

Finding layout & consensus


We move on to next best
overlaps and extend the
sequence from there



The method stops when there are
no more overlaps to consider



A number of contigs is produced



Contig stands for contiguous
sequence, resulting from merging
reads

37

2
CGGTGAAG
10
GTCGGTGA
11
GGTCGGTG
7
ATGCGGTC
--------------------ATGCGGTCGGTGAAG

Whole-genome shotgun sequencing:
summary
Original genome sequence





Reads
Non-overlapping
read

Overlapping reads
=> Contig



Ordering of the reads is initially unknown



Overlaps resolved by aligning the reads



In a 3x109 bp genome with 500 bp reads and 5x coverage, there are ~107 reads and
~107(107-1)/2 = ~5x1013 pairwise sequence comparisons

38

Repeats in DNA and genome assembly
Two instances of the same repeat

39

Repeats in DNA cause problems in
sequence assembly






Recap: if repeat length exceeds read length, we might not get the correct
assembly
This is a problem especially in eukaryotes
– ~3
...

2
...


...
wikipedia
...
, A WholeGenome Assembly of Drosophila,
Science 24, 2000
Genome size 120 Mbp

Sequencing of the Human Genome


The (draft) human genome was
published in 2001



Two efforts:
– Human Genome Project (public
consortium)
– Celera (private company)



HGP: BAC-by-BAC approach



Celera: whole-genome shotgun
sequencing

HGP: Nature 15 February 2001
Vol 409 Number 6822

Celera: Science 16 February 2001
Vol 291, Issue 5507

45

Sequencing of the Human Genome
• The (draft) human genome
was published in 2001
• Two efforts:
– Human Genome Project
(public consortium)
– Celera (private company)

• HGP: BAC-by-BAC approach
• Celera: whole-genome
shotgun sequencing

HGP: Nature 15 February 2001
Vol 409 Number 6822

Celera: Science 16 February 2001
Vol 291, Issue 5507

46

Next-gen sequencing: 454



Sanger sequencing is the prominent first-generation sequencing method
Many new sequencing methods are emerging



Genome Sequencer FLX (454 Life Science / Roche)
– >100 Mb / 7
...
5% accuracy / base in a single run
– >99
...

A mixture of DNA fragments with agarose beads
containing complementary oligonucleotides to the
adapters at the fragment ends are mixed in an
approximately 1:1 ratio
...

The resulting beads are decorated with
approximately 1 million copies of the original
single-stranded fragment, which provides
sufficient signal strength during the
pyrosequencing reaction that follows to detect
and record nucleotide incorporation events
...


Next-gen sequencing: Illumina Solexa


49

Illumina / Solexa Genome Analyzer
– Read length 35 - 50 bp
– 1-2 Gb / 3-6 day run
– > 98
...
99% accuracy / consensus with 3x coverage

The Illumina sequencing-by-synthesis
approach
...
The cluster strands are extended
by one nucleotide
...
Once imaging is
completed, chemicals that effect cleavage of
the fluorescent labels and the 3 -OH blocking
groups are added to the flow cell, which
prepares the cluster strands for another round
of fluorescent nucleotide incorporation
...
94% accuracy / base
– >99
...
In a manner similar to Roche/454 emulsion PCR amplification, DNA
fragments for SOLiD sequencing are amplified on the surfaces of 1- m magnetic
beads to provide sufficient signal during the sequencing reactions, and are then
deposited onto a flow cell slide
...
Each ligation step
is followed by fluorescence detection, after which a regeneration step removes
bases from the ligated 8mer (including the fluorescent group) and concomitantly
prepares the extended primer for another round of ligation
...
Because each fluorescent group on a ligated 8mer identifies a
two-base combination, the resulting sequence reads can be screened for basecalling errors versus true polymorphisms versus single base deletions by aligning
the individual reads to a known high-quality reference sequence
Title: Molecular Biology
Description: concepts of molecular biology and it will be helpful