Thoughts about analysing cis-regulatory sequences by Erich Schwarz , 10/22/02 Given the prospect of much new non-coding DNA sequence data arriving soon, what do we *do* with it? Cis-regulatory sequences can be analysed by applying a mixture of whatever computer programs are freely available to deal with them, and then comparing the results of these programs to see which are most helpful. Such programs have three different goals: phylogenetic footprinting; detection of statistically significant (overrepresented) motifs; and detection of two or more motifs in a cluster. I have attached a list of all the software I currently know about that is published, could reasonably be applied to C. elegans, and publically available (ideally open-source; less ideally, available as a free academic binary). Phylogenetic footprinting and motif location are different from each other and may have distinct flaws. There have been few studies of cis-regulation (other than Mathieu Blanchette's work, or Martha Kirouac's unpublished work) in which both methods have been applied in parallel, and it is thus not clear whether these methods will in fact complement one another or not. For instance, echinoderm cis-regulatory regions have binding sites that are fairly degenerate (i.e. have 2 base mutations in a 7 bp core site), yet have identical regulatory function even in trans-species assays (C.T. Brown, pers. comm.). One way to deal with this is to search for "clusters" of two or more motifs in a limited region of the genome that have the same distribution as motifs in another region that is known to have specific regulatory properties. It seems to me that these clusters may depend much more on regional statistics than on exact alignment between conserved sequences and statistically overrepresented motifs. For instance, one can imagine that a promoter might depend both on a few specific elements (that might show up in seqcomp as conserved 20-bp blocks) and on a few other elements that acted on the general vicinity of the promoter (by binding many motifs that were overrepresented locally yet not strongly conserved in their primary sequence or exact alignment in the genomic region). The former group might correspond to dedicated transcription factors, and the latter to general activators of "chromatin". However, if one defines a "cluster" as simply meaning that within a region of (e.g.) 4 kb, several distinct traits are all found to coexist, one can use different means to detect these traits and the traits can be different in character. Another point is that any individual motif may have very weak specificity, yet collectively four or five motifs could generate strong specificity. One of the motifs in Martha's analyses is found in ~1000 other promoter regions (data not shown). However, if 2 other motifs of equally poor specificity can be defined for the genes in question, their *collective* occurence should nevertheless be about 2.5 promoters/genome, which is highly specific and testable. SOFTWARE LIST: 1. Programs for phylogenetic footprinting 1a. seqcomp (Family Relations, mussa) At present this is the only program I know of that actually works well in extracting small regions of similarity from large, generally weakly conserved cis-regulatory regions (by comparison with the ClustalW program, at any rate). It has been used in C. elegans by Martha Kirouac on several cis-regulatory sequences active during vulval development (submitted). The Family Relations interface has been available for some time; the mussa interface for multiple alignments was extensively developed this summer by Nora Mullaney and Tristan de Buysscher. Source code: http://woldlab.caltech.edu/~tristan/seqcomp.22.08.2002.tar.gz http://family.caltech.edu/FR/dist/FR-0.7.tar.gz ftp://tenaya.caltech.edu/pub/caltech/mussa-source-27aug2002.tar.gz References: Brown, C.T., Rust, A.G., Clarke, P.J., Pan, Z., Schilstra, M.J., De Buysscher, T., Griffin, G., Wold, B.J., Cameron, R.A., Davidson, E.H., and Bolouri, H. (2002). New computational approaches for analysis of cis-regulatory networks. Dev. Biol. 246, 86-102. http://woldlab.caltech.edu/~tristan/mussa/mussa.html 1b. FootPrinter FootPrinter uses a known phylogenetic tree for a set of sequences to define motifs that have undergone minimal divergence between those sequences. The basic reasoning is that, given a known evolutionary tree for a set of sequences, motifs are most convincing if they would require the minimum amount of mutation between the species to have the pattern we see now. It is not graphically oriented, but perhaps might be a useful cross-check for graphical searches such as FR/mussa. Source code: http://www.cs.washington.edu/homes/blanchem/www_software/FootPrinter2.0.tar.gz http://abstract.cs.washington.edu/~blanchem/cgi-bin/FootPrinter.pl References: Blanchette, M. and Tompa, M. (2002). Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12, 739-748. Blanchette, M., Schwikowski, B., and Tompa, M. (2002). Algorithms for phylogenetic footprinting. J. Comput. Biol. 9, 211-223. 2. Programs for extracting individual statistically significant motifs, or for searching sequences with them 2a. MEME/MAST MEME extracts motifs (highly conserved regions) in groups of related DNA (or protein) sequences. These can then be used to search sequence databases with MAST. In C. elegans, it has been used to analyse genes coexpressed in the body wall mechanoreceptors. Source code: ftp://ftp.sdsc.edu/pub/sdsc/biology/meme References: Bailey, T.L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28-36. Bailey, T.L. and Gribskov, M. (1998). Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48-54. Zhang, Y., Ma, C., Delohery, T., Nasipak, B., Foat, B.C., Bounoutas, A., Bussemaker, H.J., Kim, S.K., and Chalfie, M. (2002). Identification of genes expressed in C. elegans touch receptor neurons. Nature 418, 331-335. 2b. Consensus The Consensus software suite extracts motifs from DNA (or proteins); it has been used in two different C. elegans studies. One was predictive: a motif was extracted from muscle-specific genes and then successfully used to predict the transcriptional specificity of others. The other was detection of a novel motif from the promoters of heat-shock genes detected in a microarray. Source code: ftp://ftp.genetics.wustl.edu/pub/stormo/Consensus References: Hertz, G.Z. and Stormo, G.D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563-577. Guhathakurta, D., Schriefer, L.A., Hresko, M.C., Waterston, R.H., and Stormo, G.D. (2002). Identifying muscle regulatory elements and genes in the nematode Caenorhabditis elegans. Pac. Symp. Biocomput. 2002, 425-36. GuhaThakurta, D., Palomar, L., Stormo, G.D., Tedesco, P., Johnson, T.E., Walker, D.W., Lithgow, G., Kim, S., and Link, C.D. (2002). Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. Genome Res. 12, 701-712. 2c. AlignACE/CompareACE/ScanACE This set of programs uses Gibbs sampling to identify overrepresented sequences in DNA. It has been used by Martha Kirouac to extract possible sites for trans-regulatory factor binding from the large, and rather opaque, motifs provided by phylogenetic footprinting alone (submitted). While it has not been otherwise used much in C. elegans, it has seen extensive use for cis-regulatory regions of S. cerevisiae. The source code is not freely available, except by special request (e.g., in order to compile the program on an unusual computing platform like LinuxPPC). Binaries for Linux, however, are freely available. Binaries: http://atlas.med.harvard.edu/download/index.html http://atlas.med.harvard.edu/download/extra.html References: Roth, F.P., Hughes, J.D., Estep, P.W., and Church G.M. (1998). Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16, 939-945. Hughes, J.D., Estep, P.W., Tavazoie, S., and Church G.M. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205-1214. http://atlas.med.harvard.edu/download/alignace_notes.html 2d. Improbizer and Motif Matcher These programs look for motifs that occur in DNA improbably often (hence "improbizer"), and then scan DNA for any motifs found. It has been used by Jim Kent at UCSC to identify motifs in C. elegans introns. Improbizer is available as part of a big general source package ("jksrc"), which has improbizer code in the directory "hg/geneBounds/motifSig". There are compilation problems, but I expect they will be easy to fix. Source code: http://www.soe.ucsc.edu/~kent/src/jksrc447.zip Reference: http://www.cse.ucsc.edu/~kent/improbizer/index.html 2e. Yeast Motif Finder (YMF) and FindExplanators YMF identifies overrepresented motifs; FindExplanators extracts a short list of genuinely distinct motifs from a larger list of automatically generated motifs. The latter program can be used on motifs from any source, and addresses a statistical problem not often discussed in these analyses. Source code: The code is freely available: http://www.cs.washington.edu/homes/saurabh/ymf/ymf2.0.tar.gz http://www.cs.washington.edu/homes/saurabh/ymf/explanators1.0.tar.gz but must be accessed through a licensing portal: http://abstract.cs.washington.edu/~blanchem/cgi-bin/YMF.pl References: Sinha, S. and Tompa, M. (2000). A statistical method for finding transcription factor binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 344-354. Blanchette, M. and Sinha, S. (2001). Separating real motifs from their artifacts. Bioinformatics 17 Suppl 1, S30-8. 3. Programs for detecting clusters or signatures of multiple motifs 3a. CoBind This program is explicitly designed to allow searches for joint cis-regulatory motifs (target sites) of cooperatively binding factors. It, or something like it, may well be needed to boost the signal-to-noise ratio of individual motifs. For instance, scanning the genome with a single one of Martha's motifs, using ScanACE, gave ~1000 hits -- despite the fact that the individual motif in question was very likely to be valid. The combination of several motifs with "soft" specificity may be a recurrent theme in the regulatory regions we dissect. Reference: GuhaThakurta, D., and Stormo, G.D. (2001). Identifying target sites for cooperatively binding factors. Bioinformatics, 17, 608-621. Source code: http://ural.wustl.edu/~dg/Co-Bind/Co-Bind_src_release_06.01.tar.gz http://ural.wustl.edu/~dg/Co-Bind/parse_results.perl http://ural.wustl.edu/~dg/Co-Bind/sort_results.perl 3b. cooccur_scan.pl This program has been successfully used to predict novel enhancers in the Drosophila genome. Reference: Halfon, M.S., Grad, Y., Church, G.M., and Michelson, A.M. (2002). Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res. 12, 1019-1028. Source code: http://arep.med.harvard.edu/Halfon_Grad_etal/cooccur_program 3c. Worm Enhancer This program also searches for clusters of binding sites in the genome. It is only available as a Web server. While not yet used on C. elegans, it was used successfully on the Drosophila genome to predict new targets of regulatory genes. A related approach (cis-analyst) is only available for Drosophila genomic DNA through a Web interface, but corroborates the possible usefulness of this approach. Web Page (no source code): http://wormenhancer.org/Main References: Markstein, M., Markstein, P., Markstein, V., and Levine, M.S. (2002). Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl. Acad. Sci. 99, 763-768. Berman, B.P., Nibu, Y., Pfeiffer, B.D., Tomancak, P., Celniker, S.E., Levine, M., Rubin, G.M., and Eisen, M.B. (2002). Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. 99, 757-762.