Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

You have nothing in your shopping cart yet.

Title: bioinformatics
Description: Abstract: A flood of data means that many of the challenges in biology are now challenges in computing. Bioinformatics, the application of computational techniques to analyse the information associated with biomolecules on a large-scale, has now firmly established itself as a discipline in molecular biology, and encompasses a wide range of subject areas from structural biology, genomics to gene expression studies. In this review we provide an introduction and overview of the current state of the field. We discuss the main principles that underpin bioinformatics analyses, look at the types of biological information and databases that are commonly used, and finally examine some of the studies that are being conducted, particularly with reference to transcription regulatory systems

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above


Review Paper

N
...
Luscombe,
D
...
Gerstein

Review

Department of Molecular Biophysics
and Biochemistry
Yale University
New Haven, USA

What is bioinformatics? An
introduction and overview
Abstract: A flood of data means that many of the challenges in biology are now challenges
in computing
...

In this review we provide an introduction and overview of the current state of the field
...


Introduction
Biological data are being produced
at a phenomenal rate [1]
...
On average,
these databases are doubling in
size every 15 months[2]
...

influenzae genome [4], complete
sequences for over 40 organisms
have been released, ranging from
450 genes to over 100,000
...

Yearbook of Medical Informatics 2001

Bioinformatics - a definition1
(Molecular) bio – informatics: bioinformatics is conceptualising biology in
terms of molecules (in the sense of physical chemistry) and applying
"informatics techniques" (derived from disciplines such as applied maths,
computer science and statistics) to understand and organise the information
associated with these molecules, on a large scale
...

1

As submitted to the Oxford English Dictionary

As a result of this surge in data,
computers have become indispensable
to biological research
...
Bioinformatics,
the subject of the current review, is
often defined as the application of
computational techniques to understand
and organise the information associated
with biological macromolecules
...
At the same time,
there have been major advances in the
technologies that supply the initial data;
Anthony Kerlavage of Celera recently
cited that an experimental laboratory
can produce over 100 gigabytes of
data a day with ease [5]
...

Aims of bioinformatics
The aims of bioinformatics are threefold
...
While data-curation is an
essential task, the information stored
in these databases is essentially useless until analysed
...

The second aim is to develop tools and
resources that aid in the analysis of
data
...
This needs more than
just a simple text-based search and
programs such as FASTA [8] and
PSI-BLAST [9] must consider what
comprises a biologically significant
match
...
The third aim is to
use these tools to analyse the data and
interpret the results in a biologically
meaningful manner
...
In bioinformatics, we can now
conduct global analyses of all the
available data with the aim of uncovering common principles that apply
across many systems and highlight
novel features
...
We focus on
the first and third aims just described,
with particular reference to the keywords underlined in the definition: information,informatics, organisation,
84

understanding , large-scale and
practical applications
...


“…the INFORMATION
associated with these
molecules…”
Table 1 lists the types of data that are
analysed in bioinformatics and the range
of topics that we consider to fall within
the field
...
We also give approximate
values describing the sizes of data being
discussed
...
Raw DNA
sequences are strings of the four baseletters comprising genes, each typically
1,000 bases long
...
5 billion
bases in 8
...
At the next
level are protein sequences comprising
strings of 20 amino acid-letters
...
Sources of data used in bioinformatics, the quantity of each type of data that is currently
(August 2000) available, and bioinformatics subject areas that utilise this data
...
2 million sequences
(9
...
6 million –
3 billion bases each)

Characterisation of repeats
Structural assignments to genes
Phylogenetic analysis
Genomic-scale censuses
(characterisation of protein content, metabolic pathways)
Linkage analysis relating specific genes to diseases

Gene expression

largest: ~20 time
point measurements
for ~6,000 genes

Correlating expression patterns
Mapping expression data to sequence, structural and
biochemical data

11 million citations

Digital libraries for automated bibliographical searches
Knowledge databases of data from literature

Other data
Literature

Metabolic pathways

Pathway simulations

Yearbook of Medical Informatics 2001

Review Paper

bacterial protein containing approximately 300 amino acids
...

There are currently 13,000 entries in
the Protein Data Bank, PDB, most
of which are protein structures
...


sequences
...


“… ORGANISE the information on a LARGE SCALE …”

Scientific euphoria has recently
centred on whole genome sequencing
...
6 million bases
in Haemophilus influenzae to 3 billion
in humans
...

We can now measure expression levels
of almost every gene in a given cell
on a whole-genome level although
public availability of such data is still
limited
...
Currently the largest
dataset for yeast has made approximately 20 time-point measurements
for 6,000 genes [10]
...


Redundancy and multiplicity of data
A concept that underpins most
research methods in bioinformatics is
that much of this data can be grouped
together based on biologically meaningful similarities
...
Genes can be clustered into those
with particular functions (eg enzymatic
actions) or according to the metabolic
pathway to which they belong [12],
although here, single genes may actually
possess several functions [13]
...
At a structural level, we
predict there to be a finite number of
different tertiary structures – estimates
range between 1,000 and 10,000 folds
[14,15] – and proteins adopt equivalent
structures even when they differ
greatly in sequence [16]
...


What is apparent from this list is the
diversity in the size and complexity of
different datasets
...
This
is partly related to the greater complexity and information-content of individual
structures compared to individual

There are common terms to describe
the relationship between pairs of
proteins or the genes from which they
are derived: analogous proteins have
related folds, but unrelated sequences,
while homologous proteins are both
sequentially and structurally similar
...
Among homologues,
it is useful to distinguish between
orthologues, proteins in different
species that have evolved from a
common ancestral gene, and
paralogues, proteins that are related by
gene duplication within a genome [19]
...

An important concept that arises
from these observations is that of a
finite “parts list” for different organisms
[21,22]: an inventory of proteins
contained within an organism, arranged
according to different properties such
as gene sequence, protein fold or
function
...
As the number of
different fold families is considerably
smaller than the number of gene
families, categorising the proteins by
fold provides a substantial simplification of the contents of a genome
...
As such, we expect
this notion of a finite parts list to become
increasingly common in the future
genomic analyses
...
Below, we discuss the major
databases that provide access to the
primary sources of information, and
also introduce some secondary databases that systematically group the
data (Table 2)
...

85

Review Paper
Table 2
...


Database
Protein sequence
(primary)

URL

SWISS-PROT
PIR-International

www
...
ch/sprot/sprot-top
...
mips
...
mpg
...
bioinf
...
ac
...
ncbi
...
nih
...
fcgi?db=Protein

Protein sequence (secondary)
PROSITE
PRINTS
Pfam

www
...
ch/prosite
www
...
man
...
uk/dbbrowser/PRINTS/PRINTS
...
sanger
...
uk/Pfam/

Macromolecular
structures
Protein Data Bank (PDB)
Nucleic Acids Database (NDB)
HIV Protease Database
ReLiBase
PDBsum
CATH
SCOP
FSSP

www
...
org/pdb
ndbserver
...
edu/
www
...
gov/CRYS/HIVdb/NEW_DATABASE
www2
...
ac
...
html
www
...
ucl
...
uk/bsm/pdbsum
www
...
ucl
...
uk/bsm/cath
scop
...
cam
...
uk/scop
www2
...
ac
...
ncbi
...
nih
...
ebi
...
uk/embl
www
...
nig
...
jp

Genome sequences
Entrez genomes
GeneCensus
COGs

www
...
nlm
...
gov/entrez/query
...
mbb
...
edu/genome
www
...
nlm
...
gov/COG

Integrated databases
InterPro
Sequence retrieval system (SRS)
Entrez

www
...
ac
...
expasy
...
ncbi
...
nih
...
Primary databases contain
over 300,000 protein sequences and
function as a repository for the raw
data
...

Composite databases such as OWL
[24] and the NRDB [25] compile and
filter sequence data from different
primary databases to produce combined non-redundant sets that are more
complete than the individual databases
86

and also include protein sequence data
from the translated coding regions in
DNA sequence databases (see
below)
...
One of the most
popular is PROSITE [26], a database
of short sequence patterns and profiles
that characterise biologically significant
sites in proteins
...
Motifs
are usually separated along a protein
sequence, but may be contiguous in

3D-space when the protein is folded
...
Finally, Pfam [28] contains
a large collection of multiple sequence
alignments and profile Hidden Markov
Models covering many common protein
domains
...

These different secondary databases
have recently been incorporated into a
single resource named InterPro [29]
...
The Protein Data
Bank, PDB [6,7], provides a primary
archive of all 3D structures for
macromolecules such as proteins,
RNA, DNA and various complexes
...
As the information provided in individual PDB
entries can be difficult to extract,
PDBsum [30] provides a separate Web
page for every structure in the PDB
displaying detailed structural analyses,
schematic diagrams and data on interactions between different molecules in
a given entry
...
All
comprise hierarchical structural
taxonomy where groups of proteins
increase in similarity at lower levels
of the classification tree
...
These
include the Nucleic Acids Database,
NDB [34], for structures related to
nucleic acids, the HIV protease
database [35] for HIV-1, HIV-2 and
SIV protease structures and their
complexes, and ReLiBase [36] for
receptor-ligand complexes
...
The
GenBank [2], EMBL [37] and DDBJ
[38] databases contain DNA sequences for individual genes that encode
protein and RNA products
...

As whole-genome sequencing is
often conducted through international
collaborations, individual genomes are
published at different sites
...

In addition to providing the raw
nucleotide sequence, information is
presented at several levels of detail
including: a list of completed genomes,
all chromosomes in an organism,
detailed views of single chromosomes
marking coding and non-coding regions,
and single genes
...
For example,
annotations for single genes include
the translated protein sequence,
sequence alignments with similar genes
in other genomes and summaries of
the experimentally characterised or
predicted function
...
The database allows
building of phylogenetic trees based on
different criteria such as ribosomal
RNA or protein fold occurrence
...
The COGs database [20] classifies proteins encoded
Yearbook of Medical Informatics 2001

in 21 completed genomes on the basis
of sequence similarity
...
The most straightforward
application of the database is to predict
the function of uncharacterised proteins
through their homology to characterised
proteins, and also to identify phylogenetic patterns of protein occurrence
– for example, whether a given COG
is represented across most or all
organisms or in just a few closely
related species
...

These experiments measure the
amount of mRNA or protein products
that are produced by the cell
...
The first method
measures relative levels of mRNA
abundance between different samples,
while the last two measure absolute
levels
...
For yeast, the Young [10],
Church [47] and Samson datasets [48]
use the GeneChip method, while the
Stanford cell cycle [49], diauxic shift
[50] and deletion mutant datasets [51]
use the microarray
...
For
humans, the main application has been
to understand expression in tumour
and cancer cells
...


The technologies for measuring
protein abundance are currently limited
to 2D gel electrophoresis followed by
mass spectrometry [54]
...
At present, data
from these experiments are only
available from the literature [56,57]
...
For instance, the 3D coordinates
of a protein are more useful if combined
with data about the protein’s function,
occurrence in different genomes, and
interactions with other molecules
...
Unfortunately, it is not
always straightforward to access and
cross-reference these sources of information because of differences in
nomenclature and file formats
...
At a more advanced
level, there have been efforts to
integrate access across several data
sources
...
Another is the Entrez facility
[39], which provides similar gateways
to DNA and protein sequences,
genome mapping data, 3D macromolecular structures and the PubMed
bibliographic database [60]
...


gene products, and large-scale analyses
of gene expression levels
...


“…UNDERSTAND and
organise the information…”

Other subject areas we have included
in Table 1 are development of digital
libraries for automated bibliographical
searches, knowledge bases of biological
information from the literature, DNA
analysis methods in forensics, prediction
of nucleic acid structures, metabolic
pathway simulations, and linkage analysis
– linking specific genes to different
disease traits
...
As shown in Table 1, the
broad subject areas in bioinformatics
can be separated according to the sources
of information that are used in the studies
...
For protein sequences, analyses include developing
algorithms for sequence comparisons
[63], methods for producing multiple
sequence alignments [64], and searching
for functional domains from conserved
sequence motifs in such alignments
...
These studies have lead to molecular simulation topics in which structural
data are used to calculate the energetics
involved in stabilising macromolecular
structures, simulating movements within
macromolecules, and computing the
energies involved in molecular docking
...
Research includes
characterisation of protein content and
metabolic pathways between different
genomes, identification of interacting
proteins, assignment and prediction of
88

In addition to finding relationships
between different proteins, much of
bioinformatics involves the analysis of
one type of data to infer and understand
the observations for another type of
data
...
These methods,
especially the former, are often based on
statistical rules derived from structures,
such as the propensity for certain amino
acid sequences to produce different
secondary structural elements
...
Combined with similarity
measurements, these studies provide us
with an understanding of how much
biological information can be accurately
transferred between homologous
proteins [71]
...
The

first is represented by the vertical axis in
the figure and outlines a possible approach
to the rational drug design process
...

Starting with a gene sequence, we can
determine the protein sequence with
strong certainty
...

Geometry calculations can define the
shape of the protein’s surface and
molecular simulations can determine the
force fields surrounding the molecule
...
In practise, the
intermediate steps are still difficult to
achieve accurately, and they are best
combined with experimental methods to
obtain some of the data, for example
characterising the structure of the protein
of interest
...
Initially,
simple algorithms can be used to compare the sequences and structures of a
pair of related proteins
...
Using this data, it is also
possible to construct phylogenetic trees
to trace the evolutionary path of proteins
...
Comparisons become more
complex, requiring multiple scoring
schemes, and we are able to conduct
genomic scale censuses that provide
comprehensive statistical accounts of
protein features, such as the abundance
of particular structures or functions in
different genomes
...

Yearbook of Medical Informatics 2001

Review Paper

Fig
...
Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the
integration of other scientific disciplines, specifically computing
...
The
vertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab
...
From there, we can determine the structure using structure prediction techniques
...
Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for
the design of a drug specific to that molecule
...
Initially with a pair of proteins, we can make comparisons between the between sequences and structures
of evolutionary related proteins
...
Using multiple
sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question
...
Alignments now become more
complex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a population
census – providing comprehensive statistical accounting of protein features in genomes
...
Briefly, for data
organisation, the first biological
databases were simple flat files
...
In
sequence analysis, techniques include
string comparison methods such as
text search and 1-dimensional alignment algorithms
...
3D
structural analysis techniques include
Euclidean geometry calculations
combined with basic application of
physical chemistry, graphical representations of surfaces and volumes,
and structural comparison and 3D
matching methods
...
In many of these areas,
the computational methods must be
combined with good statistical analyses
in order to provide an objective measure
for the significance of the results
...
In this section, we focus on the
studies that have contributed to our
understanding of transcription regulation in different organisms
...

90

We start by considering structural
analyses of how DNA-binding proteins
recognise particular base sequences
...
Finally, we provide an overview
of gene expression analyses that have
been recently conducted and suggest
future uses of transcription regulatory
analyses to rationalise the observations
made in gene expression experiments
...

Structural studies
As of August 2000, there were 379
structures of protein-DNA complexes
in the PDB
...

A structural taxonomy of DNAbinding proteins, similar to that
presented in SCOP and CATH, was
first proposed by Harrison [72] and
periodically updated to accommodate
new structures as they are solved [73]
...
Assembly
of such a system simplifies the
comparison of different binding
methods; it highlights the diversity of
protein-DNA complex geometries
found in nature, but also underlines the
importance of interactions between αhelices and the DNA major groove,
the main mode of binding in over half
the protein families
...
This
provides compact frameworks that
present the α-helix on the surfaces of
structurally diverse proteins
...

Although there are exceptions, the
former typically approach the DNA
from a single face and slot into the
grooves to interact with base edges
...

Focusing on proteins with α-helices,
the structures show many variations,
both in amino acid sequences and
detailed geometry
...
While achieving
a close fit between the α-helix and
major groove, there is enough flexibility
to allow both the protein and DNA to
adopt distinct conformations
...

They are commonly inserted in the
major groove sideways, with their
lengthwise axis roughly parallel to the
slope outlined by the DNA backbone
...

Given the similar binding orientations,
it is surprising to find that the interactions
between each amino acid position along
the α-helices and nucleotides on the
DNA vary considerably between
different protein families
...
The rules of
interactions are based on the simple
premise that for a given residue position
on α-helices in similar conformations,
small amino acids interact with
nucleotides that are close in distance
and large amino acids with those that
are further [76,77]
...
When considering
these interactions, it is important to
remember that different regions of the
protein surface also provide interfaces
with the DNA
...
Such analyses
are based on the premise that a
significant proportion of specific DNAbinding could be rationalised by a
universal code of recognition between
amino acids and bases, ie whether
certain protein residues preferably
interact with particular nucleotides
regardless of the type of protein-DNA
complex [79]
...

Results showed that about 2/3 of all
interactions are with the DNA
backbone and that their main role is
one of sequence-independent stabilisation
...
Such preferences were
explained through examination of the
stereochemistry of the amino acid side
chains and base edges
...
These results
suggested that universal specificity,
one that is observed across all proteinYearbook of Medical Informatics 2001

DNA complexes, indeed exists
...

Armed with an understanding of
protein structure, DNA-binding motifs
and side chain stereochemistry, a major
application has been the prediction of
binding either by proteins known to
contain a particular motif, or those with
structures solved in the uncomplexed
form
...
In a different approach,
molecular simulation techniques have
been used to dock whole proteins and
DNAs on the basis of force-field
calculations around the two molecules
[84,85]
...
Comparisons between
bound and unbound nucleic acid
structures show that DNA-bending is
a common feature of complexes formed
with transcription factors [74, 86]
...
Therefore, it is now
clear that detailed rules for specific
DNA-binding will be family specific,
but with underlying trends such as the
arginine-guanine interactions
...
Identification of transcription

factors in genomes invariably depends
on similarity search strategies, which
assume a functional and evolutionary
relationship between homologous
proteins
...
coli, studies have so far
estimated a total of 300 to 500
transcription regulators [87] and
PEDANT [88], a database of automatically assigned gene functions,
shows that typically 2-3% of
prokaryotic and 6-7% of eukaryotic
genomes comprise DNA-binding
proteins
...

Nonetheless, they already represent a
large quantity of proteins and it is clear
that there are more transcription
regulators in eukaryotes than other
species
...

From the conclusions of the structural
studies, the best strategy for characterising DNA-binding of the putative
transcription factors in each genome is
to group them by homology and analyse
the individual families
...
Of even
greater use is the provision of structural
assignments to the proteins; given a
transcription factor, it is helpful to know
the structural motif that it uses for
binding, therefore providing us with a
better understanding of how it recognises the target sequence
...
These studies have shown that
prokaryotic transcription factors most
frequently contain helix-turn-helix
motifs [87,92] and eukaryotic factors
contain homeodomain type helix-turn91

Review Paper

helix, zinc finger or leucine zipper motifs
...
A study by Huynen and van
Nimwegen [93] has shown that members of a single family have similar
functions, but as the requirements of
this function vary over time, so does
the presence of each gene family in the
genome
...
The structural families
described above were expanded to
include proteins that are related by
sequence similarity, but whose
structures remain unsolved
...

Amino acid conservations were
calculated for the multiple sequence
alignments of each family [94]
...

Residues that contact the DNA backbone are highly conserved in all protein
families, providing a set of stabilising
interactions that are common to all
homologous proteins
...
First, protein
families that bind non-specifically
usually contain several conserved basecontacting residues; without exception,
interactions are made in the minor
groove where there is little discrimination between base types
...
The second class
comprise families whose members all
target the same nucleotide sequence;
here, base-contacting positions are
absolutely or highly conserved allowing
related proteins to target the same
sequence
...
Here
protein residues undergo frequent
mutations, and family members can
be divided into subfamilies according
to the amino acid sequences at basecontacting positions; those in the
same subfamily are predicted to bind
the same DNA sequence and those
of different subfamilies to bind
distinct sequences
...
The combined analysis of
sequence and structural data described
by this study provided an insight into
how homologous DNA-binding
scaffolds achieve different specificities
by altering their amino acid sequences
...

Therefore, the relative abundance of
transcription regulatory families in a
genome depends, not only on the
importance of a particular protein
function, but also in the adaptability
of the DNA-binding motifs to
recognise distinct nucleotide
sequences
...

Given the knowledge of the transcription regulators that are contained
in each organism, and an understanding
of how they recognise DNA

sequences, it is of interest to search for
their potential binding sites within
genome sequences [95]
...
Additional
sites are found by conducting wordmatching searches over the entire
genome and scoring candidate sites by
similarity [96-99]
...
The consensus
search approach is often complemented
by comparative genomic studies
searching upstream regions of
orthologous genes in closely related
organisms
...
coli DNA-regulatory motifs
are conserved in one or more distantly
related bacteria [100]
...

However, initial studies in S
...

While the 5 base-pair GATA
consensus sequence is found almost
everywhere in the genome, a single
isolated binding site is insufficient to
exert the regulatory function [101]
...
An initial study has
used this observation to predict new
regulatory sites by searching for overrepresented oligonucleotides in noncoding regions of yeast and worm
genomes [102,103]
...

Generally, binding sites are assumed to
be located directly upstream of the
regulons; however there are different
problems associated with this assumption depending on the organism
...
It is often difficult to predict
the organisation of operons [104],
especially to define the gene that is
found at the head, and there is often a
lack of long-range conservation in gene
order between related organisms [105]
...

Despite these problems, these
studies have succeeded in confirming
the transcription regulatory pathways
of well-characterised systems such as
the heat shock response system [99]
...

Gene expression studies
Many expression studies have so
far focused on devising methods to
cluster genes by similarities in
expression profiles
...
Briefly, the most
common methods are hierarchical
clustering, self-organising maps, and
K-means clustering
...
In
contrast, the self-organising map [109,
110] and K-means methods [111]
employ a “top-down” approach in which
the user pre-defines the number of
clusters for the dataset
...

Given these methods, it is of interest
to relate the expression data to other
attributes such as structure, function
and subcellular localisation of each
gene product
...
In
yeast, shorter proteins tend to be more
highly expressed than longer proteins,
probably because of the relative ease
with which they are produced [112]
...
Turning to protein
structure, expression levels of the TIM
barrel and NTP hydrolase folds are
highest, while those for the leucine
zipper, zinc finger and transmembrane
helix-containing folds are lowest
...
This is also reflected
in the relationship with subcellular
localisations of proteins, where
expression of cytoplasmic proteins is
high, but nuclear and membrane
proteins tend to be low [114,115]
...
Conventional
wisdom is that gene products that
interact with each other are more likely
to have similar expression profiles than
if they do not [116,117]
...
While
expression profiles are similar for gene
products that are permanently associated, for example in the large ribosomal
subunit, profiles differ significantly for
products that are only associated
transiently, including those belonging
to the same metabolic pathway
...
In general, it has been
shown that different cell lines (eg
epithelial and ovarian cells) can be
distinguished on the basis of their
expression profiles, and that these
profiles are maintained when cells are
transferred from an in vivo to an in
vitro environment [120]
...
Comparative analysis can be
extended to tumour cells, in which the
underlying causes of cancer can be
uncovered by pinpointing areas of
biological variations compared to
normal cells
...
One of the
difficulties in cancer treatment has
been to target specific therapies to
pathogenetically distinct tumour types,
in order to maximise efficacy and
minimise toxicity
...
Although the distinction between
93

Review Paper

different forms of cancer – for example
subclasses of acute leukaemia – has
been well established, it is still not
possible to establish a clinical diagnosis
on the basis of a single test
...

As the approach does not require prior
biological knowledge of the diseases, it
may provide a generic strategy for
classifying all types of cancer
...
However, analysis in this area
is still limited to preliminary analyses of
expression levels in yeast mutants lacking
key components of the transcription
initiation complex [10,122]
...

Finding Homologues
As described earlier, one of the
driving forces behind bioinformatics is
the search for similarities between
different biomolecules
...

The most obvious is transferring information between related proteins
...

Specifically with structural data,
theoretical models of proteins are
usually based on experimentally solved
structures of close homologues [123]
...
Where biochemical or
structural data are lacking, studies could
be made in low-level organisms like
yeast and the results applied to
homologues in higher-level organisms
such as humans, where experiments
are more demanding
...
Homologuefinding is extensively used to confirm
coding regions in newly sequenced
genomes and functional data is frequently transferred to annotate individual genes
...

Ironically, the same idea can be
applied in reverse
...
On a smaller scale, structural
differences between similar proteins
may be harnessed to design drug
molecules that specifically bind to one
structure but not another
...
Figure 2
outlines the commonly cited approach,
taking the MLH1 gene product as an
example drug target
...
Through
linkage analysis and its similarity to
mmr genes in mice, the gene has
been implicated in nonpolyposis colorectal cancer [126]
...
Sequence search techniques
can then be used to find homologues in
model organisms, and based on
sequence similarity, it is possible to
model the structure of the human
protein on experimentally characterised
structures
...

Large-scale censuses
Although databases can efficiently
store all the information related to
genomes, structures and expression
datasets, it is useful to condense all this
information into understandable trends
and facts that users can readily understand
...

This enables one to see whether they
are unusual in any way
...
For example,
are specific protein folds associated
with certain phylogenetic groups?
How common are different folds
within particular organisms? And to
what degree are folds shared between
related organisms? Does this extent of
sharing parallel measures of
relatedness derived from traditional
evolutionary trees? Initial studies show
that the frequency of folds differs
greatly between organisms and that
the sharing of folds between organisms
does in fact follow traditional
phylogenetic classifications [21,41]
...

Yearbook of Medical Informatics 2001

Review Paper

Fig
...
Above is a schematic outlining how scientists can use bioinformatics to aid rational drug discovery
...
Through linkage analysis and its similarity to mmr genes in mice, the gene has been
implicated in nonpolyposis colorectal cancer
...
Sequence search techniques can be used to find homologues in model organisms, and based on sequence
similarity, it is possible to model the structure of the human protein on experimentally characterised structures
...


As we discussed earlier, one of the
most exciting new sources of genomic
information is the expression data
...
Further genomic scale data
that we can consider in large-scale
surveys include the subcellular
Yearbook of Medical Informatics 2001

localisations of proteins and their interactions with each other [127-129]
...

Further applications in medical
sciences
Most recent applications in the
medical sciences have centred on
gene expression analysis [130]
...
Identification of genes that are expressed
differently in affected cells provides
a basis for explaining the causes of
illnesses and highlights potential drug
targets
...
Given
a lead compound, microarray experiments can then be used to evaluate
responses to pharmacological intervention, [135,136] and also provide
early tests to detect or predict the
toxicity of trial drugs
...

A typical scenario for a patient may
start with post-natal genotyping to
assess susceptibility or immunity from
specific diseases and pathogens
...

Regular lifetime screenings could lead
to guidance for nutrition intake and
early detections of any illnesses [137]
...
Given
the present rate of development, such
a scenario in healthcare appears to be
possible in the not too distant future
...
Originally developed for the
analysis of biological sequences, bioinformatics now encompasses a wide
range of subject areas including structural biology, genomics and gene expression studies
...
In
particular, we discussed the types of
biological information and databases
that are commonly used, examined
some of the studies that are being
96

conducted – with reference to transcription regulatory systems – and finally
looked at several practical applications
of the field
...
First is
that of comparing and grouping the
data according to biologically meaningful similarities and second, that of
analysing one type of data to infer and
understand the observations for another
type of data
...
As a
result, bioinformatics has not only
provided greater depth to biological
investigations, but added the dimension
of breadth as well
...

Acknowledgements

7
...


9
...


11
...


13
...


15
...


We thank Patrick McGarvey for comments
on the manuscript
...


References
1
...
It’s sink or swim as a tidal
wave of data approaches
...

2
...

GenBank
...

3
...
The SWISS-PROT
protein sequence database and its
supplement TrEMBL in 2000
...

4
...
Whole-genome random sequencing
and assembly of Haemophilus influenzae
Rd
...

5
...
The Economist 26 June
1999
...
Bernstein FC, Koetzle TF, Williams GJ,
Meyer EF, Jr
...


18
...


20
...


22
...
A computer-based
archival file for macromolecular structures
...

Berman HM, Westbrook J, Feng Z, Gilliland
G, Bhat TN, Weissig H, et al
...
Nucleic Acids Res
2000;28(1):235-42
...
Improved tools
for biological sequence comparison
...

Altschul SF, Madden TL, Schaffer AA,
Zhang J, Zhang Z, Miller W, et al
...
Nucleic
Acids Res
...

Holstege FC JE, Wyrick JJ, Lee TI,
Hengartner CJ, Green MR, Golub TR,
Lander ES, Young RA
...

Cell 1998;95(5):717-728
...
A
DNA structural atlas for Escherichia coli
...

Kanehisa M, Goto S
...
Nucleic
Acids Res 2000;28(1):27-30
...
Moonlighting proteins
...

Chothia C
...
One thousand families
for the molecular biologist [news]
...

Orengo CA, Jones DT, Thornton JM
...
Nature 1994;372(6507):631-4
...
How different amino
acid sequences determine similar protein
structures: the structure and evolutionary
dynamics of the globins
...

Russell RB, Saqi MA, Sayle RA, Bates PA,
Sternberg MJ
...
J Mol
Biol 1997;269(3):423-39
...
Recognition of analogous
and homologous protein folds—assessment
of prediction success and associated
alignment accuracy using empirical
substitution matrices
...

Fitch WM
...
Syst Zool
1970;19:99-110
...
A
genomic perspective on protein families
...

Gerstein M, Hegyi H
...
FEMS Microbiol Rev
1998;22(4):277-304
...
From genes to protein

Yearbook of Medical Informatics 2001

Review Paper

23
...


25
...


27
...


29
...


31
...


33
...


35
...


37
...
TIBtech 2000;18:34-39
...
PIR: a new resource for bioinformatics
...

Bleasby AJ, Akrigg D, Attwood TK
...
Nucleic Acids Res
1994;22(17):3574-3577
...
Construction of
validated, non-redundant composite protein
sequence databases
...

Hofmann K, Bucher P, Falquet L, Bairoch A
...

Nucleic Acids Res 1999;27(1):215-219
...

PRINTS-S: the database formerly known
as PRINTS
...

Bateman A, Birney E, Durbin R, Eddy SR,
Howe KL, Sonnhammer EL
...
Nucleic Acids
Res 2000;28(1):263-266
...

PRINTS prepares for the new millennium
...

Laskowski RA, Hutchinson EG, Michie
AD, Wallace AC, Jones ML, Thornton
JM
...
TIBS 1997;22(12):488-490
...
Assigning genomic
sequences to CATH
...

Lo Conte L, Ailey B, Hubbard TJ, Brenner
SE, Murzin AG, Chothia C
...

Nucleic Acids Res 2000;28(1):257-259
...
Touring protein fold
space with Dali/FSSP
...

Berman HM, Olson WK, Beveridge DL,
Westbrook J, Gelbin A, Demeny T, et al
...
A
comprehensive relational database of threedimensional structures of nucleic acids
...

Vondrasek J, Wlodawer A
...
TIBS
1997;22(5):183
...
Databases for protein-ligand
complexes
...

Baker W, van den Broek A, Camon E,
Hingamp P, Sterk P, Stoesser G, et al
...

Nucleic Acids Res 2000;28(1):19-23
...
Okayama T, Tamura T, Gojobori T, Tateno
Y, Ikeo K, Miyazaki S, et al
...
Bioinformatics
1998;14(6):472-8
...
Schuler GD, Epstein JA, Ohkawa H, Kans
JA
...
Methods Enzymol
1996;266:141-62
...
Tatusova TA, Karsch-Mizrachi I, Ostell
JA
...

Bioinformatics 1999;15(7-8):536-43
...
Lin J, Gerstein M
...
Genome Res
2000;10(6):808-18
...
Eisen MB, Brown PO
...
Methods
Enzymol 1999;303:179-205
...
Cheung VG, Morley M, Aguilar F, Massimi
A, Kucherlapati R, Childs G
...
Nat Genet 1999;21(1
Suppl):15-9
...
Duggan DJ, Bittner M, Chen Y, Meltzer P,
Trent JM
...
Nat Genet 1999;21(1
Suppl):10-4
...
Lipshutz RJ FS, Gingeras TR, Lockhart
DJ
...
Nat Gen 1999;21(1):20-24
...
Velculescu VE ZL, Zhou, W Traverso, G St
Croix, B Vogelstein B, Kinzler KW
...
1999
...
Roth FP HJ, Estep PW, Church GM
...
Nat
Biotechnol 1998;16(10):939-45
...
Jelinsky SA, Samson LD
...
Proc Natl Acad Sci U S A
1999;96(4):1486-91
...
Cho RJ, Campbell MJ, Winzeler EA,
Steinmetz L, Conway A, Wodicka L, et al
...
Mol Cell
1998;2(1):65-73
...
DeRisi JL, Iyer VR, Brown PO
...
Science
1997;278(5338):680-6
...
Winzeler EA, Shoemaker DD, Astromoff A,
Liang H, Anderson K, Andre B, et al
...
cerevisiae
genome by gene deletion and parallel analysis
...

52
...
Molecular
portraits of human breast tumours
...

53
...

Molecular classification of cancer: class
discovery and class prediction by gene
expression monitoring
...

54
...
2D protein
electrophoresis: can it be perfected? Curr
Opin Biotechnol 1999;10(1):16-21
...
Pandey A, Mann M
...
Nature 2000;405
(6788):837-46
...
Futcher B, Latter GI, Monardo P,
McLaughlin CS, Garrels JI
...
Mol Cell Biol
1999;19(11):7357-68
...
Gygi SP, Rist B, Gerber SA, Turecek F,
Gelb MH, Aebersold R
...
Nat Biotechnol
1999;17(10):994-9
...
Gerstein M
...
Nature Struct Biol
2000;7:960-3
...
Etzold T, Ulyanov A, Argos P
...
Methods Enzymol
1996;266:114-28
...
Wade K
...
Aviat Space
Environ Med 2000;71(5):559
...
Zhang MQ
...

Comput Chem 1999;23(3-4):233-50
...
Boguski MS
...
Science
1999;286(5439):453-5
...
Miller C, Gurd J, Brass A
...
Bioinformatics
1999;15(2):111-21
...
Gonnet GH, Korostensky C, Benner S
...
J Comput
Biol 2000;7(1-2):261-76
...
Orengo CA, Taylor WR
...
Methods Enzymol
1996;266:617-35
...
Orengo CA
...

Protein Sci 1999;8(4):699-715
...
Russell RB, Sternberg MJ
...
How good are we? Curr Biol
1995;5(5):488-90
...
Martin AC, Orengo CA, Hutchinson EG,
Jones S, Karmirantzou M, Laskowski RA,
et al
...
Structure
1998;6(7):875-84
...
Hegyi H, Gerstein M
...
J Mol Biol 1999;288(1):
147-64
...
Russell RB, Sasieni PD, Sternberg MJE
...
Binding site
similarity in the absence of homology
...

71
...

Assessing annotation transfer for genomics:
quantifying the relations between protein
sequence, structure and function through
traditional and probabilistic scores
...

72
...
A structural taxonomy of
DNA-binding
domains
...

73
...
An overview of the structures
of protein-DNA complexes
...

74
...
Protein-DNA interactions:
A structural analysis
...

75
...
Binding geometry
of alpha-helices that recognize DNA
...

76
...
ProteinDNA interactions: a 3D analysis of alphahelix-binding in the major groove
...

77
...

DNA recognition code of transcription
factors
...

78
...
DNA recognition by a betasheet
...

79
...

Sequence specific recognition of double
helical nucleic acids by proteins
...

80
...
A framework for the DNAprotein recognition code of the probe helix
in transcription factors: the chemical and
stereochemical rules [see comments]
...

81
...
Comprehensive analysis of hydrogen
bonds in regulatory protein DNAcomplexes: in search of common principles
...

82
...
Protein-DNA interactions: a 3D
analysis of amino acid-base interactions
...

83
...
A role for CH
...
J
Mol Biol 1998;277(5):1129-40
...
Sternberg MJ, Gabb HA, Jackson RM
...
Curr Opin Struct
Biol 1998;8(2):250-6
...
Aloy P, Moont G, Gabb HA, Querol E,

98

86
...


88
...


90
...


92
...


94
...


96
...


98
...


Aviles FX, Sternberg MJ
...

Proteins 1998;33(4):535-49
...
DNA bending: the prevalence
of kinkiness and the virtues of normality
...

Perez-Rueda E, Collado-Vides J
...
Nucleic
Acids Res 2000;28(8):1838-47
...
MIPS: a database
for genomes and protein sequences
...

Salgado H, Santos-Zavaleta A, GamaCastro S, Millan-Zarate D, Blattner FR,
Collado-Vides J
...
0):
transcriptional regulation and operon
organization in Escherichia coli K-12
...

Wingender E, Chen X, Hehl R, Karas H,
Liebich I, Matys V, et al
...
Nucleic Acids Res
2000;28(1):316-9
...

Advances in structural genomics
...

Aravind L, Koonin EV
...
Nucleic Acids Res
1999;27(23):4658-70
...
The
frequency distribution of gene family sizes
in complete genomes
...

Luscombe NM, Thornton JM
...
Manuscript in preparation
...
Prediction of function in DNA
sequence analysis
...

Robison K, McGuire AM, Church GM
...
J
Mol Biol 1998;284(2):241-54
...
Prediction of transcriptional
regulatory sites in the complete genome
sequence of Escherichia coli K-12
...

Mironov AA, Koonin EV, Roytberg MA,
Gelfand MS
...

Nucleic Acids Res 1999;27(14):2981-9
...

Prediction of transcription regulatory sites
in Archaea by a comparative genomic
approach
...


100
...

Conservation of DNA regulatory motifs
and discovery of new motifs in microbial
genomes [In Process Citation]
...

101
...

Saturation mutagenesis of the UASNTR
(GATAA) responsible for nitrogen
catabolite
repression-sensitive
transcriptional activation of the allantoin
pathway genes in Saccharomyces
cerevisiae
...

102
...
Zinc fingers in
Caenorhabditis elegans: finding families
and probing pathways
...

103
...

Extracting regulatory sites from the
upstream region of yeast genes by
computational analysis of oligonucleotide
frequencies
...

104
...
Operons in
Escherichia coli: genomic analyses and
predictions
...

105
...
MetabolismandevolutionofHaemophilus
influenzae deduced from a whole- genome
comparison with Escherichia coli
...

106
...
Cluster analysis and display of genomewide expression patterns
...

107
...
Large-scale
temporal gene expression mapping of
central nervous system development
...

108
...
Broad patterns of
gene expression revealed by clustering
analysis of tumor and normal colon tissues
probed by oligonucleotide arrays
...

109
...

Interpreting patterns of gene expression
with self-organizing maps: methods and
application to hematopoietic differentiation
...

110
...
Analysis of gene expression
data using self-organizing maps
...

111
...
Systematic determination of genetic network architecture
...


Yearbook of Medical Informatics 2001

Review Paper
112
...
Analysis of the
yeast transcriptome with structural and
functional categories: characterizing
highly expressed proteins
...

113
...
The current
excitment in bioinformatics, analysis of
whole-genome expression data: how does
it relate to protein structure and function
...

114
...
A Bayesian
System Integrating Expression Data with
Sequence Patterns for Localizing Proteins:
Comprehensive Application to the Yeast
Genome
...

115
...
Genomwide analysis relating expression level
with protein subcellular localisation
...

116
...
Detecting
protein function and protein-protein
interactions from genome sequences
...

117
...
Protein function in the postgenomic era
...

118
...

Relating whole-genome expression data
with protein-protein interactions
...

119
...
Medicine
...
Science
2000;289(5485):1670-2
...
Ross DT, Scherf U, Eisen MB, Perou
CM, Rees C, Spellman P, et al
...
Nat Genet
2000;24(3):227-35
...
Perou CM, Jeffrey SS, van de Rijn M,
Rees CA, Eisen MB, Ross DT, et al
...
Proc Natl Acad Sci U S A
1999;96(16):9212-7
...
Livesey FJ, Furukawa T, Steffen MA,
Church GM, Cepko CL
...
Curr Biol
2000;10(6):301-10
...
Sali A, Blundell TL
...
J Mol Biol 1993;234(3):779815
...
Jones DT, Taylor WR, Thornton JM
...

Nature 1992;358(6381):86-9
...
Kok K, Naylor SL, Buys CH
...
Adv Cancer Res 1997;71:27-92
...
Syngal S, Fox EA, Eng C, Kolodner RD,
Garber JE
...
J Med
Gen 2000;37(9):641-645
...
Uetz P, Giot L, Cagney G, Mansfield
TA, Judson RS, Knight JR, et al
...

Nature 2000;403(6770):623-7
...
Ross-Macdonald P, Sheehan A, Friddle
C, Roeder GS, Snyder M
...

Methods Enzymol 1999;303:512-32
...
Mewes HW, Heumann K, Kaps A, Mayer
K, Pfeiffer F, Stocker S, et al
...
Nucleic Acids Res
1999;27(1):44-8
...
Murray-Rust P
...
Curr Opin Biotechnol
1994;5(6):648-53
...
Friend SH
...
BMJ 1999;319(7220):1306-7
...
Tamayo P SD, Mesirov J, Zhu Q,
Kitareewan S, Dmitrovsky E, Lander ES,
Golub TR
...


134
...


136
...


138
...
Proc Natl
Acad Sci U S A 1999;96(6):2907-12
...

Distinctive gene expression patterns in
human mammary epithelial cells and
breast cancers
...

Hiltunen MO, Niemi M, Yla-Herttuala S
...

Curr Opin Lipidol 1999;10(6):515-9
...
High throughput analysis of
gene expression in the human brain
...

Debouck C, Metcalf B
...
Annu Rev
Pharmacol Toxicol 2000;40:193-207
...
Genomic medicine and the
future of health care
...

Ohlstein EH, Ruffolo RR, Jr
...

Drug discovery in the next millennium
...


Address of the authors:
Nicholas M
...
gerstein@yale
Title: bioinformatics
Description: Abstract: A flood of data means that many of the challenges in biology are now challenges in computing. Bioinformatics, the application of computational techniques to analyse the information associated with biomolecules on a large-scale, has now firmly established itself as a discipline in molecular biology, and encompasses a wide range of subject areas from structural biology, genomics to gene expression studies. In this review we provide an introduction and overview of the current state of the field. We discuss the main principles that underpin bioinformatics analyses, look at the types of biological information and databases that are commonly used, and finally examine some of the studies that are being conducted, particularly with reference to transcription regulatory systems