From an Italian-French Collaboration
AAstretch Project - Imperfect homopeptidic repeats in genomes


Navigation

Home
 
Download
Organisms
Examples
 
Credits
Contacts

In this section some analyses on selected organisms are shown. The organisms were chosen to maximize the taxonomic coverage of fully sequenced genomes with a discrete annotation rate. Those data were analyzed and discussed in the work we submitted to Genome Research and will serve both as a reference for users of the AAstretch programs and for an evaluation of the results we obtained on the evolutionary conservation of compositional biases associated to the "imperfect" poly-Qs.
As one can see by opening the file with a simple text editor or better in a spreadsheet program such as Microsoft Excel or LibreOffice Calc, the AAtretch program emits a number of data useful for a correct interpretation of the results.
Two different files can be found, one for the protein based analysis and the other for the analysis of the coding sequences (we call it the synchronized analysis). The formats of the output files are essentially identical, apart form the fact that statistics in the latter are based on codoins rateer then an residues.
The first section of the output contains all the configurations of the AAstretch program in two columns. Then it follows a number of columns with specific attributes, which are briefly described here (see the manual in the download page for a longer explaination):
  • Name: the name of the sequence (protein or cooding) containing the stretch. This is actually the header of the fasta formatted sequence as available in the Organism page and written from genomic data by the AAprepare program. It contains the ensembl protein id, the ensembl transcript id, the ensembl gene id and the description of the gene
  • Len: the length of sequence containing the localized stretch
  • Stretches_tot: the number of stretches found in the same sequence (stretches are emitted one per line, so multiple stretches in a protein will stay close but on different lines)
  • Stretch_seq: the sequence of the localized stretch
  • Stretch_len: the length of the localized stretch
  • Q%: the % of the Q residue in the stretch. In all the following examples the glutamine was chosen for the analysis, but any other residue con be used.
  • pureQ_len: the length of the longest pure poly-Q repeat inside an imperfect polyQ stretch. Our definition of polyQ stretch tolerates insertions, so this kind of information is very useful.
  • pureQ_ratio: the ratio between the length of the longest pure polyQ and the length of the full imperfect stretch.
  • start: the position of the first residue of the stretch within the sequence.
  • stop: the position of the first residue of the stretch within the sequence.
  • Position%: the position of the stretch within the sequence expressed as a percent of the full length of the sequence. This is useful for localizing the stretch in a consistent manner across the different sequences.
  • lf_seq: the sequence of the N-terminal (left) flank of the stretch. Here the flanks were defined as 30-residues long, but it can be changed from the configuration of the AAstretch program.
  • rf_seq: the sequence of the C-terminal (right) flank of the stretch. See lf_seq for details
  • go_func: the Gene Ontology function codes and terms associated to the gene where the stretch have been localized.
  • go_proc: the Gene Ontology biological process codes and terms associated to the gene where the stretch have been localized.
  • go_comp: the Gene Ontology cellular component codes and terms associated to the gene where the stretch have been localized.
  • omim: only for human (and if any), the code and description of the record in the Online Mendeleian Inheritance in Man (OMIM) database, indicating inheritable diseases associated with the gene.
Some examples of the AAstretch results are here presented.

Arabidopsis thaliana: protein results   coding sequence results
Aspergillus niger: protein results   coding sequence results
Caenorhabditis elegans: protein results   coding sequence results
Canis familiaris: protein results   coding sequence results
Danio rerio: protein results   coding sequence results
Dictyostelium discoideum: protein results   coding sequence results
Drosophila melanogaster: protein results   coding sequence results
Gallus gallus: protein results   coding sequence results
Homo sapiens: protein results   coding sequence results
Mus musculus: protein results   coding sequence results
oryza sativa: protein results   coding sequence results
Pan troglodytes: protein results   coding sequence results
Plasmodium falciparum: protein results   coding sequence results
Pongo pygmaeus: protein results   coding sequence results
Saccharomyces cerevisiae: protein results   coding sequence results
Schizosaccharomyces pombe: protein results   coding sequence results
Xenopus tropicalis: protein results   coding sequence results



 

Copyright © 2011 Matteo Ramazzotti
Design downloaded from free website templates.