|
|
In this section some analyses on selected organisms are shown. The organisms were chosen to maximize the taxonomic coverage of fully sequenced genomes with a discrete annotation rate.
Those data were analyzed and discussed in the work we submitted to Genome Research and will serve both as a reference for users of the AAstretch programs and for an evaluation of the
results we obtained on the evolutionary conservation of compositional biases associated to the "imperfect" poly-Qs.
As one can see by opening the file with a simple text editor or better in a spreadsheet program such as Microsoft Excel or LibreOffice Calc, the AAtretch program
emits a number of data useful for a correct interpretation of the results.
Two different files can be found, one for the protein based analysis and the other for the analysis of the coding sequences (we call it the synchronized analysis). The formats of the
output files are essentially identical, apart form the fact that statistics in the latter are based on codoins rateer then an residues.
The first section of the output contains all the configurations of the AAstretch program in two columns. Then it follows a number of columns with specific attributes, which are briefly
described here (see the manual in the download page for a longer explaination):
- Name: the name of the sequence (protein or cooding) containing the stretch. This is actually the header of the fasta formatted sequence as available in the Organism page and written from
genomic data by the AAprepare program. It contains the ensembl protein id, the ensembl transcript id, the ensembl gene id and the description of the gene
- Len: the length of sequence containing the localized stretch
- Stretches_tot: the number of stretches found in the same sequence (stretches are emitted one per line, so multiple stretches in a protein will stay close but on different lines)
- Stretch_seq: the sequence of the localized stretch
- Stretch_len: the length of the localized stretch
- Q%: the % of the Q residue in the stretch. In all the following examples the glutamine was chosen for the analysis, but any other residue con be used.
- pureQ_len: the length of the longest pure poly-Q repeat inside an imperfect polyQ stretch. Our definition of polyQ stretch tolerates insertions, so this kind of information is very useful.
- pureQ_ratio: the ratio between the length of the longest pure polyQ and the length of the full imperfect stretch.
- start: the position of the first residue of the stretch within the sequence.
- stop: the position of the first residue of the stretch within the sequence.
- Position%: the position of the stretch within the sequence expressed as a percent of the full length of the sequence. This is useful for localizing the stretch in a consistent manner across
the different sequences.
- lf_seq: the sequence of the N-terminal (left) flank of the stretch. Here the flanks were defined as 30-residues long, but it can be changed from the configuration of the AAstretch program.
- rf_seq: the sequence of the C-terminal (right) flank of the stretch. See lf_seq for details
- go_func: the Gene Ontology function codes and terms associated to the gene where the stretch have been localized.
- go_proc: the Gene Ontology biological process codes and terms associated to the gene where the stretch have been localized.
- go_comp: the Gene Ontology cellular component codes and terms associated to the gene where the stretch have been localized.
- omim: only for human (and if any), the code and description of the record in the Online Mendeleian Inheritance in Man (OMIM) database, indicating inheritable diseases associated with the gene.
Some examples of the AAstretch results are here presented.
|
|