Does the unknown sequence contain an ORF closely related to group II intron RTs?
If a BLAST search
identifies group II intron RTs as the closest relatives to an unknown ORF, then the unknown sequence probably
contains a group II intron or a fragment of one.
Is the unknown sequence >80% identical to another group II intron DNA sequence?
If the DNA sequence is >80% identical, it is straightforward to determine the boundaries by comparison and alignment with the known intron. If there is a sudden break in alignment before the end of the intron, then the unknown sequence is a fragment. A fragment is also indicated if the ORF lacks parts of RT domains (0-7) or domain X (the Zn domain is optional). Fragments are in fact more numerous than full length introns, so it is not unusual for parts of the intron to be missing. Also, many full length introns contain frame shifts or stop codons in the ORFs, and are probably nonfunctional.
What is the closest relative of the intron?
The closest relative should be used as a guide to model the structure and boundaries of an unknown intron. For instance, if the closest relative has a standard
IIA RNA structure, then the unknown intron probably also has a
IIA structure with similar motifs. The closer the relationship between ORFs, the closer the intron structures should be. If there are no close relatives—well, good luck!
Can intron domains 5 and 6 be located?
Domain 5 is the only motif of the RNA structure that is highly conserved in sequence. Using the consensus structures for domain 5 (this web site or
Toor et al.,
2001), search for a match to domain 5 in the vicinity of the stop codon for the ORF, usually within 40 bp of the stop codon, and usually downstream. The “AGC” and “CGC” sequences should be invariant for the structural types. If a domain 5 can be located, then there should be a reasonable domain 6 directly downstream (follow the consensus structures) with a hairpin of variable length, and containing a bulged A that is the branch site, and finally ending in AC, AT, or ACC (follow consensus structures for the structural subclasses).
Identification of the 5’ end.
Identification of the 5’ boundary can be hard and is less reliable. The consensus sequence for the 5’ end is GUGYG, but there are often many candidate GUGYG’s in the expected start region. The 5’ end is usually located 400-600 bp upstream of the start codon for the ORF. The best bet is to locate the boundary based on a closely related intron. Otherwise domain I has to be folded into a reasonable structure based on the consensus structures. The easiest way to do this is to remove the ORF sequence from the intron and subject it to RNA folding via MFOLD
(http://www.bioinfo.rpi.edu/applications/mfold/old/rna/). The optimal folding is usually not entirely correct, but by scanning through the suboptimal foldings, a structure can often be located that obeys most of the structural rules and is mostly right. This requires judgment and practice, and if you get to this stage, maybe just send it to us and we’ll try to finish it for you.