AAstrect Project Homepage

From an Italian-French Collaboration

AAstretch Project - Imperfect homopeptidic repeats in genomes

Home

Download

Organisms

Examples

Credits

Contacts

In this section some analyses on selected organisms are shown. The organisms were chosen to maximize the taxonomic coverage of fully sequenced genomes with a discrete annotation rate. Those data were analyzed and discussed in the work we submitted to Genome Research and will serve both as a reference for users of the AAstretch programs and for an evaluation of the results we obtained on the evolutionary conservation of compositional biases associated to the "imperfect" poly-Qs.
As one can see by opening the file with a simple text editor or better in a spreadsheet program such as Microsoft Excel or LibreOffice Calc, the AAtretch program emits a number of data useful for a correct interpretation of the results.
Two different files can be found, one for the protein based analysis and the other for the analysis of the coding sequences (we call it the synchronized analysis). The formats of the output files are essentially identical, apart form the fact that statistics in the latter are based on codoins rateer then an residues.
The first section of the output contains all the configurations of the AAstretch program in two columns. Then it follows a number of columns with specific attributes, which are briefly described here (see the manual in the download page for a longer explaination):

Name: the name of the sequence (protein or cooding) containing the stretch. This is actually the header of the fasta formatted sequence as available in the Organism page and written from genomic data by the AAprepare program. It contains the ensembl protein id, the ensembl transcript id, the ensembl gene id and the description of the gene
Len: the length of sequence containing the localized stretch
Stretches_tot: the number of stretches found in the same sequence (stretches are emitted one per line, so multiple stretches in a protein will stay close but on different lines)
Stretch_seq: the sequence of the localized stretch
Stretch_len: the length of the localized stretch
Q%: the % of the Q residue in the stretch. In all the following examples the glutamine was chosen for the analysis, but any other residue con be used.
pureQ_len: the length of the longest pure poly-Q repeat inside an imperfect polyQ stretch. Our definition of polyQ stretch tolerates insertions, so this kind of information is very useful.
pureQ_ratio: the ratio between the length of the longest pure polyQ and the length of the full imperfect stretch.
start: the position of the first residue of the stretch within the sequence.
stop: the position of the first residue of the stretch within the sequence.
Position%: the position of the stretch within the sequence expressed as a percent of the full length of the sequence. This is useful for localizing the stretch in a consistent manner across the different sequences.
lf_seq: the sequence of the N-terminal (left) flank of the stretch. Here the flanks were defined as 30-residues long, but it can be changed from the configuration of the AAstretch program.
rf_seq: the sequence of the C-terminal (right) flank of the stretch. See lf_seq for details
go_func: the Gene Ontology function codes and terms associated to the gene where the stretch have been localized.
go_proc: the Gene Ontology biological process codes and terms associated to the gene where the stretch have been localized.
go_comp: the Gene Ontology cellular component codes and terms associated to the gene where the stretch have been localized.
omim: only for human (and if any), the code and description of the record in the Online Mendeleian Inheritance in Man (OMIM) database, indicating inheritable diseases associated with the gene.

Some examples of the AAstretch results are here presented.

Arabidopsis thaliana:	protein results	coding sequence results
Aspergillus niger:	protein results	coding sequence results
Caenorhabditis elegans:	protein results	coding sequence results
Canis familiaris:	protein results	coding sequence results
Danio rerio:	protein results	coding sequence results
Dictyostelium discoideum:	protein results	coding sequence results
Drosophila melanogaster:	protein results	coding sequence results
Gallus gallus:	protein results	coding sequence results
Homo sapiens:	protein results	coding sequence results
Mus musculus:	protein results	coding sequence results
oryza sativa:	protein results	coding sequence results
Pan troglodytes:	protein results	coding sequence results
Plasmodium falciparum:	protein results	coding sequence results
Pongo pygmaeus:	protein results	coding sequence results
Saccharomyces cerevisiae:	protein results	coding sequence results
Schizosaccharomyces pombe:	protein results	coding sequence results
Xenopus tropicalis:	protein results	coding sequence results

Design downloaded from free website templates.