- The search patterns used by the Sequence Manipulation Suite are not case sensitive. The following is a simple search pattern that will find all occurrences of the sequence fragment GGAT (and ggat):
ggat
The above will match GGAT but not GGAA.
- Sequences containing residues that vary at a particular position can be matched using square brackets. The following pattern will find all occurrences of GGAT, GGAC, and GGAA:
gga[tca]
The above will match GGAT but not GGAG.
- To represent a completely variable residue in a pattern, use the . character. The following pattern will find all occurrences of GCA followed by any single residue, followed by TTT:
gca.ttt
The above will match GCAATTT but not GCAAATTT.
- To indicate that a residue can be repeated one or more times in a sequence, use the + character. The following pattern will find all occurrences of MVV followed by one or more R residues:
MVVR+
The above will match MVVRR but not MVVDR.
- To indicate that a residue can be repeated zero or more times in a sequence, use the * character. The following pattern will find all occurrences of MD followed by zero or more K residues, followed by an L:
MDK*L
The above will match MDL but not MDVL.
- To indicate that a residue can be repeated a specific number of times, use curly parentheses. The following pattern will find all occurrences of an M residue, followed by between one and four L residues, followed by a G residue:
ML{1,4}G
The above will match MLLG but not MLLLLLG.
- The special characters, brackets, and curly parentheses in the above examples allow repeated residues to be found. You can find repeated sub-sequences using regular parentheses in combination with the +, *, and {} characters. The following pattern will find all occurrences of two to 5 TNT sequences in a row, followed by one or more KM repeats:
(TNT){2,5}(KM)+
The above will match TNTTNTTNTKM but not TNTTNKM.
- To restrict matches to the beginning of a sequence, use the ^ character. For example, the following pattern will find GACCCT only if it is within three residues of the sequence start:
^.{0,3}GACCCT
The above will find GACCCT in the sequence ATCGACCCT but not in the sequence AATCGACCCT.
- To restrict matches to the end of a sequence, use the $ character. For example, the following pattern will find LVL only if it is located at the end of a sequence:
LVL$
The above will find LVL in the sequence KMHLVL but not in the sequence LVLD.
- To find variable sequences, you can also use the | character to separate patterns for the different versions of the sequence segment you want to find. For example, to find all occurrences of MML, MAL, and MAD you could use the following:
MML|MAL|MAD
The above will match MML but not MMK.
- Other examples:
atg(...)+(tag|taa|tga)
The above will match open reading frames that start with atg and end with tag, taa, or tga
[VILMFWC]{10,}
The above will match stretches of proteins containing ten or more hydrophobic residues.
new window | home | citation
2.304-Fri May 5 17:09:48 2006
|