Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Title: Basic File Formats in Bioinformatics
Description: Basic File Formats in Bioinformatics - Genbank, Fasta
Description: Basic File Formats in Bioinformatics - Genbank, Fasta
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
File Formats in
Bioinformatics
Introduction
• To store biological data digitally, it requires specialised file formats to
represent the biological information of molecular sequences and
structures
...
For example, web page was written on
an
...
HTML files contain special tags that tell the
browser what each block of text is, and how to display it on the page
...
Plain text formats
• Early databases stored sequence data in a file
...
• More common file types include csv and tsv
...
• Markdown format is a markup language , like HTML, it includes headers and paragraphs lines
...
md in a source file from GitHub or Public Data repositories
...
The Genbank file format is quite flexible and allows
annotations, comments, and references to be included within
the file
...
Genbank files often have the file extension '
...
genbank'
...
Sample EMBL Format
ID AB000263 standard; RNA; PRI; 368 BP
...
XX
SQ Sequence 368 BP;
acaattggccc………………………………………………
...
• Text format file with extension
...
• Originated from the sequence alignment software called
FASTP
...
The identifier – comments, annotations
2
...
Identifier
It is preceded by with a ">"
...
Major database sequence identifiers are :
GenBank/EMBL/DDBJ
gi|gi_number|*|accession
...
NCBI refseq
ref|accession|locus
SWISS-PROT
sp|accession|locus
PDB
pdb|entry|chain
2) Sequence
• Contains the raw sequence
• Standard nucleic acid and amino acid IUB/IUPAC codes are used
...
fna - nucleic acid
...
faa - aminoacids
...
Sample Fasta format protein sequence
>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor
MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN
MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS
VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC
NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP
FastQ Format – Delivered from Sequencers
1
...
The second line is the raw sequence
...
The third line starts with ‘+’ and can have the
same sequence identifier appended
4
...
The BAM is a binary file format
while the SAM file format contains the same information but is text based
...
Both the BAM/SAM format contain not only the sequence data
for next-generation sequencing reads, but also have the
capability of storing alignment data of those reads to a
reference sequence
...
Alignment – read name, sequence, quality , alignment
information
...
• It consists of one line per feature, each containing 3-12
columns of data
• User defined sequence features as well as graphical
representations of features
...
3 required fields :
• Name of the chromosome or scaffold
• Start position of the feature in standard chromosomal coordinates
• End position of the feature in standard chromosomal coordinate
Chr1 21345679
21356739
9 optional fields :
• Label to be displayed under the feature
• A score between 0 and 1000
• defined as + (forward) or – (reverse)
• Thick start : the start codon for the gene display
• Thick end : Ending position of where the feature is drawn thickly
• itemRgb :Determines the color of the data contained in the BED line
• blockCount : the number of sub-elements \(e
...
exons\) within the feature
• blockSizes : the size of these sub-elements
• blockStarts : the start coordinate of each sub-element
Protein Structure File Format
• PDB - the PDB file format is used to store both
sequence information, but more importantly stores 3dimensional structure information
...
PDB files are simply text files, thus can be viewed with a
text editor, and often have the file extension '
...
• HEADER: contains name and source of the protein,
resolution, description of experimental conditions and
details, names of the authors, crystallographic parameters
including R-factor, sequence information, secondary
structure information, literature citations etc
...
• ATOM: this part contains the atomic coordinates and the Bfactor values as part of the protein chain
• HETATM: coordinates of cofactor molecules, substrates,
other groups that are not covalently bound to the protein
...
The MDL mol
file contains information regarding 2d (and possibly 3d) molecule
structure, such as atom type and atom connectivity
Title: Basic File Formats in Bioinformatics
Description: Basic File Formats in Bioinformatics - Genbank, Fasta
Description: Basic File Formats in Bioinformatics - Genbank, Fasta