Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine
Overview This lecture will summarize a huge amount of bioinformatics material that is usually presented as a full 12 week course. Data management and analysis of sequences from the HGP A quick look at GenBank and ENTREZ. Gene finding and translation
Similarity searching and alignment (BLAST) Protein structure and function Data Management and Analysis The Human Genome Project has generated huge quantities of DNA sequence data. This data will lead to many medial advances. But a great deal of analysis and research
will be needed. Access to the Data Organize the genome data & provide access for scientists Use the Internet The data is public, so anyone can access it.
GenBank All Genome Project data is stored in a database called GenBank managed by the National Center for Biotechnology Information (NCBI) The NCBI is a branch of the National Library of Medicine, which is part of the NIH (National Institutes of Health). http://ncbi.nlm.nih.gov
GenBank Sections In addition to DNA sequences of genes GenBank has a number of other sections including: Protein sequences (translated from DNA) Short RNA fragments (ESTs) Cancer Genome Anatomy Project (CGAP) gene expression profiles of normal, pre-cancer, and cancer
cells from a wide variety of tissue types Single Nucleotide Polymorphisms (SNPs) which represent genetic variations in the human population Online Mendelian Inheritance in Man (OMIM) a database of human genetic disorders Finding Genes GenBank contains approximately 13 billion bases in 12 million sequence
records (as of August 2001). These billions of G, A, T, and C letters would be almost useless without descriptions of what genes they contain, the organisms they come from, etc. All of this information is contained in the "annotation" part of each sequence record.
Entrez is a Tool for Finding Sequences NCBI has created a Web-based tool called Entrez for finding sequences in GenBank. Each sequence in GenBank has a unique accession number. Entrez can also search for keywords such as gene names, protein names, and the
names of orgainisms or biological functions Entrez has links to Medline Entrez is much more than just a tool for finding sequences by keywords. It contains links to PubMed/Medline Entrez also contains all known protein sequences and 3-D protein structures.
Entrez is Internally Cross-linked DNA and protein sequences are linked to other similar sequences Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar structures
These relationships might include genes in a multi-gene family, related journal articles, or other proteins in the same biochemical pathway This potential for horizontal movement through the linked databases makes Entrez a dynamic tool. You can start with only a vague set of
keywords or a sequence from the laboratory and rapidly access a set of relevant literature and related database sequences. Similarity Searching There are a variety of computer programs that are used for making comparisons between DNA
sequences. The most popular is known as BLAST (Basic Local Alignment Search Tool) BLAST is free at the NCBI website BLAST Searches GenBank The NCBI BLAST web server lets you compare your query sequence to various
sections of GenBank nr = non-redundant (main sections) month = new sequences from the past few weeks ESTs human, drososphila, yeast, or E.coli genomes proteins (by automatic translation) This is a VERY fast and powerful computer.
BLAST is Complex Similarity searching relies on the concepts of alignment and distance between pairs of sequences. Distances can only be measured between aligned sequences (match vs. mismatch at each position). A similarity search is a process of testing the best alignment of a query sequence
with every sequence in a database. Search with Protein not DNA 1) 4 DNA bases vs. 20 amino acids less random similarity 2) Can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix
3) Protein databanks are much smaller than DNA databanks. BLAST has Automatic Translation BLASTX makes automatic translation (in all 6 reading frames) of your DNA query sequence to compare with protein databanks TBLASTN makes automatic translation of an
entire DNA database to compare with your protein query sequence Only make a DNA-DNA search if you are working with a sequence that does not code for protein.
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| |||||||||||||||||||| Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239 Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| |||||||| Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296 Understand the Statistics! BLAST produces an E-value for every
match This is the same as the P value in a statistical test A match is generally considered significant if the E-value < 0.05 (smaller numbers are more significant) Very low E-values (e-100) are homologs or identical genes
Moderate E-values are related genes Long regions of moderate similarity are more important than short regions of high identity. BLAST is Approximate BLAST makes similarity searches very quickly because it takes shortcuts. looks for short, nearly identical words (11 bases)
It also makes errors misses some important similarities makes many incorrect matches easily fooled by repeats or skewed composition Bad Genome Annotation Gene finding is at best only 90% accurate. New sequences are automatically annotated
with BLAST scores. Bad annotations propagate Its going to take us 10-20 years or more to sort this mess out! Protein Function The ultimate goal of the HGP is to identify all of the genes and determine their functions
Genes function by being translated into proteins: structural enzymes regulatory signalling Translation Once we have found the DNA sequence of a gene,
we can decode the amino acid sequence of the corresponding protein . The Genetic Code is actually quite simple. Chemical Properties Some chemical properties of a protein can be calculated from its amino acid sequence: molecular weight charge/pH
hydrophobicity Patterns in Proteins Conserved Domains Proteins are built out of functional units know as domains (or motifs) These domains have conserved sequences Often much more similar than their respective proteins
Exon splicing theory (W. Gilbert) Exons correspond to folding domains which in turn serve as functional units Unrelated proteins may share a single similar exon (i.e.. ATPase or DNA binding function) Simple Structures Some motifs form structures that can be recognized as simple sequence patterns:
transmembrane domains coiled coils helix-turn-helix signal peptides Functional Motifs Other functional portions of proteins can be recognized by their sequence, even if their 3D structure is not known. There are many databases of protein
motifs/domains: ProSite, Pfam, ProDom, etc. Tools for Finding Motifs Define a motif from a set of known proteins that share a similar sequence and function. A pattern is a list of amino acids that can occur at each position in the motif. A profile is a matrix that assigns a value to every amino acid at every position in the
motif. A HMM is a more complex profile based on pairs of amino acids. Protein 3-D Structure Structure = Function Proteins function by 3-D interactions with other molecules (i.e. physical chemistry).
So for a protein, 3-D structure is function. But we cant accurately determine 3-D structure from gene sequence. Structure Prediction Predicting a proteins 3-D structure from its amino acid sequence is incredibly complex. proteins are polypeptides (long chains of amino acids)
can fold and rotate around bonds within each amino acid as well as the bonds between them it is not possible to evaluate every possible folding pattern for an amino acid sequence Secondary Structure The local structure of the amino acids in a protein can also be predicted to some
extent. Each amino acid has a tendency to form either an alpha helix or a beta sheet ....,....1....,....2....,....3....,....4....,....5....,....6 AA |MMSGAPSATQPATAETQHIADQVRSQLEEKYNKKFPVFKAVSFKSQVVAGTNYFIKVHVG| PHD sec | HHHHHHHHHHHHHHHH EEEEEEEEEEEEE EEEEEEEE |
Rel sec |999997899667599999999989997655877843368889999999233399999658| detail: prH prE prL subset: SUB sec sec
|103021343252044604644672424555547615444425212186671016926120| |.......e..e..eeb.ebbeeb.e.beeeeeee.eebeb.e....bbbb...bb.b...| Threading Rather than computing a 3-D structure from scratch, it may be possible to find a similar structure. Must have ~25% aa sequence identity. Uses a process called threading to create a
new structure based on a known structure. This still requires HUGE amounts of computer power. Protein Data Base There is a database of all known protein structures called the PDB. These have been determined by X-ray crystalography and/or NMR.
Anyone download and view these structures with a PDB viewer program. RasMol RasMol is the simplest PDB viewer. http://www.umass.edu/microbio/rasmol/ It can work together with a web browser to let you view the structure of any sequence found with Entrez that has a known 3-D structure.
Gene Finding & Translation How can we find genes on chromosomes? Genome project data is just huge chunks of DNA. Does automatic annotation work? Raw Genome Data:
Finding Genes is Not Easy Perhaps 1% of human DNA encodes functional genes. Genes are interspersed among long stretches of non-coding DNA. Repeats, pseudo-genes, and introns confound matters
Pattern Finding Tools It is possible to use DNA sequence patterns to predict genes: Promoters translational start and stop codes (ORFs) intron splice sites codon usage Similarity to Known Genes
It is also possible to scan new DNA sequence for known genes Can look for annotated genes/proteins Or just for RNAs (ESTs)
George Mason University Other titles: Times New Roman Arial Black Tahoma Monotype Sorts Arial Comic Sans MS Times MS Pゴシック Symbol Contemporary Portrait.pot MathType 5.0 Equation Some Words on CETI and Some Space Travel Basics What I Will Cover Today...
What Really Happened…. Allegiances Delian League Argos Melos Corcyria Spartan Confederacy Corinth Thebes Pylos Neutral Macedonia Purposes of the Delian League Initially formed to continue the naval war against Persia To protect trade routes Liberate Greek Islands from Persian control...
As a relative pronoun, "that" can usually be deleted by adding "-ing " to the subsequent verb; … The pattern of primarily closely related haplotypes shown in Hawaiian insular false killer whales is consistent with a social system that excludes...
technology is disrupting the traditional publishing cycle Open Access publishing ≠ no peer review why bother? Because 65% of studies in multidiscipline areas find that having another free, downloadable full text version of your publication increases citation same article Minihan,...
Confucius started the system of going to school starting at childhood, to prepare for tests, Tests would decide your status and occupation for the rest of your life. examples: doctor's, cops Confucianism official Chinese philosophy from 206 B.C.E. to 1912...
The "Fifth Pillar" of AML / BSA ComplianceFinal Rule for New Customer Due DiligenceThe views expressed in this presentation are strictly those of the author and do not necessarily represent the views of BB&T Bank.. Carolinas Credit Union League -...
KQ: How can I state the strengths and weaknesses of different types of research methods? L.O-To be able to explain what research methods means. To be able to evaluate the strengths and weaknesses of different types of social research. To...
Ready to download the document? Go ahead and hit continue!