Thinking about sequences 1 Why do we care about sequences? Reliable source of testable hypotheses for biological function Doolittle RF, Hunkapiller MW, Hood LE et al. (1983). Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. Science 1983 Jul 15;221(4607):275-7. Peltonen L, McKusick VA (2001). Genomics and medicine. Dissecting human disease in the postgenomic era. Science 2001 Feb 16;291(5507):1224-9. Intrinsic interest in how life describes itself Koonin EV (2001). How many genes can make a cell: the minimal-gene-set concept. Annu. Rev. Genomics Hum. Genet. 2000. 1:99-116. Hutchison CA, Peterson SN, Gill SR, Cline RT, White O et al. (1999). Global transposon mutagenesis and a minimal Mycoplasma genome. Science 1999 Dec 10;286(5447):2165-9. Miki R, Kadota K, Bono H, Mizuno Y, Tomaru Y et al. (2001). Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays. Proc Natl Acad Sci U S A 2001 Feb 27;98(5):2199-2204 Things to review today: the nature of the substrates we want to analyse history and current tookit what the tools do general findings emerging from sequence analysis pitfalls of doing or reading sequence analysis what is likely to be completed, or just start up, relatively soon One basic caveat sequence analysis suggests experiments; it doesnıt replace them The substrates to be analysed conceptual protein sequences (ORFs or CDSes): Numbers for genomes: 150-300 minimal viable cell? Bacteria: 468 Mycoplasma genitalium 1604 Mycobacterium leprae 4289 Escherichia coli Archaea: 1750 Methanococcus jannaschii 2493 Archaeoglobus fulgidus 2977 Sulfolobus solfataricus Eukarya: 6294 Saccharomyces cerevisiae >14600 Drosophila melanogaster 19308 Caenorhabditis elegans 25598 Arabidopsis thaliana >31000 Homo sapiens functional RNAs (not yet systematically searchable) Eddy SR (1999). Noncoding RNA genes. Curr Opin Genet Dev 1999 Dec;9(6):695-9. let-7 is conserved in metazoa let-7: Pasquinelli AE, Reinhart BJ, Slack F, Martindale MQ, Kuroda MI et al. (2001). Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature 2000 Nov 2;408(6808):86-9. Xist, Tsix, roX1, and roX2 are crucial to dosage regulation Noncoding RNA genes in dosage compensation and imprinting. Cell 2000 Sep 29;103(1):9-12. cis-regulatory sequences: Hughes JD, Estep PW, Tavazoie S, Church GM (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000 Mar 10;296(5):1205-14. McGuire AM, Hughes JD, Church GM (2000). Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res 2000 Jun;10(6):744-57. Bussemaker HJ, Li H, Siggia ED (2001). Regulatory element detection using correlation with expression. Nat Genet 2001 Feb;27(2):167-71. Development of tools pre-history (1953-1989) Creighton, T.E. (1993). Proteins: structures and molecular properties. 2cd. ed. W.H. Freeman and Company: New York. Lipman DJ, Pearson WR (1985). Rapid and sensitive protein similarity searches. Science 1985 Mar 22;227(4693):1435-41 Gribskov M, McLachlan AD, Eisenberg D (1987). Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 1987 Jul;84(13):4355-8. the 1990s: BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). Basic local alignment search tool. J Mol Biol 1990 Oct 5;215(3):403-10. Altschul SF, Boguski MS, Gish W, Wootton JC (1994). Issues in searching molecular sequence databases. Nat Genet 1994 Feb;6(2):119-29. matrix scanning for the masses Tatusov RL, Altschul SF, Koonin EV (1994). Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A 1994 Dec 6;91(25):12091-5. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402. genomic sequencing first whole genome 1995 since then ~30 microbial genomes completed Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT et al. (2001). The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001 Jan 1;29(1):22-8. multicellular organismal genomes Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD et al. (2000). The genome sequence of Drosophila melanogaster. Science 2000 Mar 24;287(5461):2185-95. The Arabidopsis Genome Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000 Dec 14;408(6814):796-815. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC et al. (2001). Initial sequencing and analysis of the human genome. Nature 2001 Feb 15;409(6822):860-921. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ (2001). The sequence of the human genome. Science 2001 Feb 16;291(5507):1304-51. now: extracting genes from raw DNA: nontrivial Gopal S, Schroeder M, Pieper U, Sczyrba A, Aytekin-Kurban G et al. (2001). Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome. Nat Genet 2001 Mar;27(3):337-40. Shoemaker DD, Schadt EE, Armour CD, He YD, Garrett-Engele P, McDonagh PD et al. (2001). Experimental annotation of the human genome using microarray technology. Nature 2001 Feb 15;409(6822):922-7. primary pairwise comparisons: routine now but still key Thompson JD, Higgins DG, Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994 Nov 11;22(22):4673-80. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997). The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997 Dec 15;25(24):4876-82. multiple comparisons: motifs and superfamilies Interpro Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E et al. (2001). InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 2000 Dec;16(12):1145-50. CDD -- some of Interpro databases, plus NCBI staff tools for defining oneıs own motifs: ClustalW HMMs Eddy SR (1996). Hidden Markov models. Curr Opin Struct Biol 1996 Jun;6(3):361-5. Clarke ND, Berg JM (1998). Zinc fingers in Caenorhabditis elegans: finding families and probing pathways. Science 1998 Dec 11;282(5396):2018-22. automated annotation (Interpro) defining RNA genes Lowe TM, Eddy SR (1999). A computational screen for methylation guide snoRNAs in yeast. Science 1999 Feb 19;283(5405):1168-71. cis-regulatory pathways Clarke ND, Berg JM (1998). Zinc fingers in Caenorhabditis elegans: finding families and probing pathways. Science 1998 Dec 11;282(5396):2018-22. Gibbs sampling Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993 Oct 8;262(5131):208-14. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999). Systematic determination of genetic network architecture. Nat Genet 1999 Jul;22(3):281-5. profiles to identify enzyme substrates in whole genome Yaffe MB, Leparc GG, Lai J, Obata T, Volinia S, Cantley LC (2001). A motif-based profile scanning approach for genome-wide prediction of signaling pathways. Nat Biotechnol 2001 Apr;19(4):348-53. analysis of variation in residues Lichtarge O, Bourne HR, Cohen FE (1996). An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996 Mar 29;257(2):342-58. Sowa ME, He W, Slep KC, Kercher MA, Lichtarge O, Wensel TG (2001). Prediction and confirmation of a site critical for effector regulation of RGS domain activity. Nat Struct Biol 2001 Mar;8(3):234-7. Landgraf R, Xenarios I, Eisenberg D (2001). Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 2001 Apr 13;307(5):1487-502. Lockless SW, Ranganathan R (1999). Evolutionarily conserved pathways of energetic connectivity in protein families. Science 1999 Oct 8;286(5438):295-9. analysis of variation as opposed to conservation: SNPs and mutations Sunyaev S, Ramensky V, Koch I, Lathe W 3rd, Kondrashov AS, Bork P (2001). Prediction of deleterious human alleles. Hum Mol Genet 2001 Mar 15;10(6):591-7. Ng PC, Henikoff S (2001). Predicting deleterious amino acid substitutions. Genome Res 2001 May;11(5):863-74. Schaner P, Richards N, Wadhwa A, Aksentijevich I, Kastner D, Tucker P, Gumucio D (2001). Episodic evolution of pyrin in primates: human mutations recapitulate ancestral amino acid states. Nat Genet 2001 Mar;27(3):318-21. one reason why SNPs in humans matter: modifiers of disease genes Dipple KM, McCabe ER (2000). Phenotypes of patients with "simple" mendelian disorders are complex traits: thresholds, modifiers, and systems dynamics. Am J Hum Genet 2000 Jun;66(6):1729-35. Dipple KM, McCabe ER (2000). Modifier genes convert "simple" Mendelian disorders to complex traits. Mol Genet Metab 2000 Sep-Oct;71(1-2):43-50. experimental testing of predicted ancestral functions Jermann TM, Opitz JG, Stackhouse J, Benner SA (1995). Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 1995 Mar 2;374(6517):57-9. What the most commonly used tools (BLAST, ClustalW, HMM) do all use dynamic programming -- in brief, find single best path through the 2-D space of all possible alignments by starting with good values near end and working back all use similarity scores for amino acid pairs BLAST: build a table of "words" that have >=SCORE_1 scan database for presence of words extend matches from words until they fall below SCORE_2 generate E-values with Karlin-Altschul statistics based upon random matches having an extreme value dist. ClustalW: generate tree that has best least-squares fit to overall distances (~= % identity) then iteratively align pairwise seqs., closest first use different matrices HMM: treat all real sequences as having been generated from a hidden model of the ideal sequence in which successive residues are independent of earlier ones (Markov) not efficient for protein database searches but rigorous and good for fuzzy sequences (short DNA, RNA genes) For a rigorous mathematical description of not merely what the tools do from a naive biologist's standpoint, but how they work, the most useful overview is probably: Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press. Some general findings that have emerged All genomes are a mix of conserved and unique genes unless they are small indeed; most genomes, on examination, have ~40% of their genes with no obvious similarity to other genes. [For individual genomes, see the relevant sequence papers.] Fischer D, Eisenberg D (1999). Finding families for genomic ORFans. Bioinformatics 1999 Sep;15(9):759-62. The minimal gene set for cellular life itself is probably quite tiny, probably between 150-300 genes (Koonin, Hutchinson et al.) The strongly conserved core set of genes for free-living microbes consists of 2885 genes (in last compilation by NCBI) Gene sets for "simple" creatures are surprisingly big, gene set for humans is difficult to decipher but is definitely not >10x that of worms Orthologs and paralogs are both abundant; deciding which is which can be nontrivial but important for functional analysis (e.g. in cell cycle) Murray, A.M. and Marks, D. (2001). Can sequencing shed light on cell cycling? Nature 409, 844-846. Organismal complexity can reside in not just gene number but the complexity and variation of individual gene products: expanded CDSes and alternate splicing. The human genome is a case in point. Some specific findings about individual human genes required for health Individual gene analysis can be a powerful tool for dissecting important genes in humans by identifying subtle motifs, which in turn can uncover potential functions of the protein product BRCA1 and BRCA2 Werner's syndrome Luhn K, Wild MK, Eckhardt M, Gerardy-Schahn R, Vestweber D. (2001). The gene defective in leukocyte adhesion deficiency II encodes a putative GDP-fucose transporter. Nat Genet 2001 May;28(1):69-72. It can also be useful in cloning genes by identifying strong candidates familial non-polyposis colon cancer At other times it can completely be stuck Major case in point -- many human tumor suppressors: Fanconi syndrome, menin, ST7 Youssoufian, H. (2001). Fanconi anemia and breast cancer: whatıs the connection? Nat. Genet. 27, 352-353. Zenklusen JC, Conti CJ, Green ED (2001). Mutational and functional analyses reveal that ST7 is a highly conserved tumor-suppressor gene on human chromosome 7q31. Nat Genet 2001 Apr;27(4):392- 8Nat Genet 2001 Apr;27(4):392-8. What you can do with all this Generally, identify candidate genes -- examples: Vaccine targets Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B et al. (2000). Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science 2000 Mar 10;287(5459):1816-20. X-ray crystallography targets Mallick P, Goodwill KE, Fitz-Gibbon S, Miller JH, Eisenberg D (2000). Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing. Proc Natl Acad Sci U S A 2000 Mar 14;97(6):2450-5. Given some gene that one has cloned, very rapidly define a possible biochemical function that is specifically testable (out of thousands) Specific cases: many human disease genes in the last decade Cluster genes for correlated function: Rosetta stone and phylogenetic pattern allow one to take proteins with no obvious homology and link them to proteins whose function is well known, or to proteins with a defined location in the cell Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D (1999). A combined algorithm for genome-wide prediction of protein function. Nature 1999 Nov 4;402(6757):83-6. Marcotte EM, Xenarios I, van Der Bliek AM, Eisenberg D (2000). Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci U S A 2000 Oct 24;97(22):12115-20. As mentioned, map function onto specific residues or analyse natural variation Pitfalls 1. Blind faith in something you care about 2. Statistics, or, rather, neglect of statistics 3. Not filtering your data set 4. Not dissecting your sequence and checking each subregion 5. Expecting too much from a public server or from manual examination 6. Not realizing that what is trendoid now is not what you should be planning to do for >5 years [Note: the following advice has no warranty.] What is going to be finished soon, maybe Motifs and homologies in proteins What has not yet really been explored thoroughly are: functional nonprotein RNA let-7, hints of others work by Gold suggests that many proteins can interact with many RNAs variation within proteins clearly relevant to medicine (modifiers in humans) SNPs probably = most non-disease human variation curator-level annotation Gene ontology, first proposed by Ashburner and others. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000 May;25(1):25-9. automatic literature scanning Jenssen TK, Laegreid A, Komorowski J, Hovig E (2001). A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001 May;28(1):21-8. Marcotte EM, Xenarios I, Eisenberg D (2001). Mining literature for protein-protein interactions. Bioinformatics 2001 Apr;17(4):359-363. How you might prepare to do this, maybe Have serious understanding of both the biology and the math [note: this means "be more educated than me", but it's also "be more educated than most current practitioners"] Donıt get locked into only computing; be able to test hypotheses