Dear Araport team,
Greetings! Could you please answer the following questions:
1. What is the best way to obtain a split-up of the counts for each category and sub-category of transposons that contribute to the total count of 3,901 transposable element genes. For example, can I grep for a list of TE key-words / names from the GFF or GTF file? Or any other way you think would work?
Currently, I am trying with a rather old classification system from Wicker et al 2007 , and the total is woefully short of 3,901. Hence this request for help with breaking up the 3,901 counts into types and sub-types of TEs.
2. Will your answer to question #1 above also work to identify the count contributions from each TE category and sub-category for not only 3,901 transposon genes, but also for all 35,090 TEs (I obtained this count from a simple grep for 'transposable_element' on Araport11_GFF3_genes_transposons.201606.gff)
Thanks very much!
: 1. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet 2007, Dec;8(12):973-82.
Please find our detailed response below
The focus of the Araport11 release was primarily towards improving the protein-coding (identifying novel genes, enforcing gene splits/merges) and non-coding gene (exhaustive annotation of the various non-coding RNA classes) fractions. We relied on published datasets to enrich these particular classes of gene features. All responses below are provided in light of the above premise of the Araport project efforts.
Yes, the Araport11 annotation is based on the TAIR10 assembly. You should be able to download the genome assembly FASTA files from here: https://www.araport.org/downloads/TAIR10_genome_release/assembly
In Araport11, we have inherited the annotated Transposable Elements (TE), as-is, without any updates and/or modifications. As of TAIR8, TE annotations provided by Hadi Quesneville were combined with pre-existing annotations to create a composite set of Arabidopsis transposons. These have been assigned a unique identifier (e.g. AT3TE53245) that indicates their relative position on the chromosome. In total, there are 31,189 TE features annotated on the genome.
Regarding Transposable Element genes (TE genes), as of TAIR8, the TEs have been associated with overlapping TE genes e.g. genes AT3G32022, AT3G32024, AT3G32026, AT3G32027 and AT3G32028 are associated to transposon AT3TE53245. Information about this is explained in the following README (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR8_genome_release/Readme-tr...). In total, there are 3,901 TE genes annotation on the genome.
Using ThaleMine, you can retrieve the above explained relationship between TEs and TE genes, by executing the following query (You can copy paste this query into ThaleMine, here: https://apps.araport.org/thalemine/importQueries.do?query_builder=yes):
<query model="genomic" view="TransposableElement.synonyms.value TransposableElement.primaryIdentifier TransposableElement.chromosome.primaryIdentifier TransposableElement.chromosomeLocation.start TransposableElement.chromosomeLocation.end TransposableElement.overlappingFeatures.primaryIdentifier" sortOrder="TransposableElement.synonyms.value ASC" > <join path="TransposableElement.overlappingFeatures" style="OUTER"/> <constraint path="TransposableElement.overlappingFeatures" type="TransposableElementGene" /> </query>
Once you click Submit, the query builder page will pop-up. On this page, click on Show Results, and the query will be run, retrieving the data of your interest; 6-columns: TE Name, TE Identifier, TE Chromsome, Chromosome Start, Chromosome End, Overlapping TE genes. On this page, you can filter this table to your liking (e.g. by a particular set of TE families), or add/remove columns. Or, you can click the Export button and download the complete dataset, in Tab/Comma delimited format.
Please let us know if you are unable to retrieve this data.
Hope this helps!
On Behalf of the Araport Team
Seeking some more clarifications
Thanks a lot for your reply, Vivek, especially the ThaleMine query I was able to use successfully.
Executing your ThaleMine query constructed resulted in a tsv file with 32,456 lines. Making the text entries in this 1st column gave 320 TE sub-family names. This info I used in a grep as follows: for i in
cat Transposon-Gene-ClassificationFamilies.txt; do echo $i; grep 'transposable_element' \ Araport11_GFF3_genes_transposons.201606.gff | grep -w "$i" | wc -l; done This resulted in 31,198 cases of 'transposable_elements' spread across these 320 TE sub-families. Which is slightly different from the 31,189 number you referred to in your reply and also referred to in Araport11 notes. Do you know of a simple reconciliation between these two numbers (differing by 9)? Or you understand what I may be doing wrong?
When I tried grep 'transposable_element' Araport11_GFF3_genes_transposons.201606.gff | grep 'gene' | wc -l, this gave 3,901 cases of TEs also annoated as genes, i.e. "transposable_element_gene" in the GFF file. But it is unclear how I can use the names of the 320 TE subfamilies to get their sub-counts for "transposable_element_gene" which is a subset of "transposable_element". OR do you recommend a new ThaleMine query construct for this purpose? If yes, could you please help with that construct?
For overlaps between TEs and genes that got re-categorized into the "transposable_element_gene" category using % overlap criterion, was it based on gene length, or TE length, or shorter / longer locus?
Finally, I want to make sure I am understanding the "transposable_element_gene" concept. So these are loci that are erstwhile "gene" annotations whose genomic coordinates overlap with those of erstwhile or more recent "transposable_element" annotations, and so binned under a new category called "transposable_element_genes". Correct?
Thanks, in advance.
Follow-up response to your questions
Apologies for the delay in responding to your queries.
In order to help with your analysis, I've extracted just the TAIR10 Transposable Elements (TE) and Transposable Element Genes subset and prepared a separate (cleaned) file for your use (Filename: TAIR10_GFF3_transposons.gff3). In addition, there is also a file listing out all the TAIR10 TEs, their coordinates on the genome, the TE family name as well as Super-family name (Filename: TAIR10_Transposable_Elements.txt). And finally, the README file from the TAIR FTP which describes the logic behind the relationship between TE and TE genes.
All files I've referenced above should be available for download here: https://www.araport.org/downloads/TAIR10_genome_release/annotation/gff/t...
Now, regarding your queries, here is how you should be able to generate some easy answers:
Following is a snippet from the GFF3 file mentioned above:
### Chr1 TAIR10 transposable_element 432995 433837 . - . ID=AT1TE01405;Name=AT1TE01405;Alias=ATLINE1A Chr1 TAIR10 transposon_fragment 432995 433837 . - . ID=AT1TE01405:transposon_fragment:1;Parent=AT1TE01405 ### Chr1 TAIR10 transposable_element_gene 433031 433819 . - . ID=AT1G02228;Note=transposable_element_gene;Name=AT1G02228;Derives_from=AT1TE01405 Chr1 TAIR10 mRNA 433031 433819 . - . ID=AT1G02228.1;Parent=AT1G02228;Name=AT1G02228.1;index=1 Chr1 TAIR10 exon 433031 433819 . - . ID=AT1G02228:exon:1;Parent=AT1G02228.1 ###
As you can see above, the TE AT1TE01405 belongs to the ATLINE1A transposon family (as denoted by the Alias attribute in the 9th column). And, the relationship of the TE gene AT1G02228 to the TE AT1TE01405 is signified by the Derives_from relationship (see attribute in the 9th column). This information should help you classify the TE gene features.
Regarding the % overlap criterion, a gene previously classified as protein-coding or pseudogene, got re-classified as a TE gene based on the extent of its overlap with a TE (annotated by the Hadi Quesneville) feature.
Finally, yes, your understanding is correct regarding TE genes (e.g. a gene encoded within a transposable element for example helicase, transposase etc). A variety of evidentiary data were used to classify such gene features, primarily based on the Quesneville TE annotation as well as sequence similarity hits to Repbase and GenBank to ensure that these genes are indeed like other known TE genes.
All of the data you are looking at above has been unchanged since the TAIR8 release (i.e. there has been no recent community driven efforts to enrich the existing At Transposon annotations).
Hope this helps with your analysis!
On behalf of the Araport Team
P.S. I deleted you other duplicate question, so that we can maintain the thread here.