Why do some Araport11 transcripts have the same transcript identifier as TAIR10 but differ in the gene model structure?

×

Status message

New Feature: Guest Login function added to facilitate site exploration without registering. Try it out!

Detailed Answer

The Araport11 annotation update pipeline (for detailed methods, please see: https://www.araport.org/data/araport11/methods) utilized RNA-seq based transcript assemblies (evidence) generated from 113 public SRA datasets (grouped into 11 tissue/organ sets) which were compared against the TAIR10 transcripts (reference).

The comparison suggested the following types of modifications:

  • Transcript evidence suggested a novel alternatively spliced variant -> Transcript added to the locus and was assigned a new isoform identifier based on TAIR10 locus identifier
  • Transcript evidence agreed with one/more of the reference transcripts and suggested extension of UTR -> Transcripts were updated in place and retained the TAIR10 isoform identifier
  • Transcript evidence agreed with one/more of the reference transcripts and suggested no modifications -> Transcripts remain unchanged and retained the TAIR10 isoform identifier
  • Transcript evidence overwhelmingly disagreed with the reference transcript, suggesting modification of the coding region -> Transcripts were updated in place and retained the TAIR10 isoform identifier

An example gene locus which demonstrates the updates made via the Araport11 pipeline is AT1G07350 (SR45a). In TAIR10, this locus had 2 splice variants, .1 and .2. Following the Araport11 annotation update, the RNA-seq based evidence added 4 novel transcript variants, .3, .4, .5 and .6, while also updating the existing isoforms by extending the UTRs on the 5' and 3' ends.

Refer to the JBrowse view for AT1G07350: http://www.araport.org/locus/AT1G07350/browse?tracks=TAIR10_genes,Arapor...

Apart from the automated RNA-seq based annotation update, several genes were manually curated based on other evidence (community datasets such as Direct RNA sequencing by Duc et al. 2013, Alternative Splicing variants by Marquez et al. 2012) by an in-house curator, in order to correct any inaccuracies in the existing TAIR10 annotation or those introduced by the automated update process.

An example of such an updated gene locus is AT1G01020 (ARV1). In TAIR10, this locus had 2 splice variants, .1 and .2. Following the Araport11 annotation update, the RNA-seq based evidence added 4 novel transcript variants, .3, .4, .5 and .6, while also updating the existing isoforms by extending the UTRs.

However, in this case, the 3' UTR extension of isoform AT1G01020.1 was manually trimmed off based on available evidence.

Refer to the JBrowse view for AT1G01020: http://www.araport.org/locus/AT1G01020/browse?tracks=TAIR10_genes,Arapor...

Short Answer

Gene models in the Araport11 release have been updated based on a large collection of RNA-seq datasets from NCBI (https://www.araport.org/data/araport11/methods). The general rule was to retain the transcript identifiers if the annotation structure update did not alter the translated protein (CDS). In other words, gene models with extended or trimmed UTRs will retain the same transcript identifiers. New transcript IDs were only instantiated if the evidence supported a novel splicing variant.

I don't see why you don't give the models new ids...

There are three really big problems with doing this the way you are at the moment.

1) naively, people will compare the expression of a transcript from an experiment using the TAIR10 annotation and the araport annotation and conclude that the expression has changed without realising that the gene model has changed because the identifier is the same. 2) Once they are aware, in order to make a sensible comparison people need to create a mapping between araport models and the corresponding identical TAIR10 models (unless you provide one) 2) non-coding genes don't have a CDS so, by definition, any change to the gene model does not impact the coding sequence and will this keep the same identifier even for wholesale changes.

Point 1. is particularly important because in many cases the bulk of the expression measured by RNA-seq can reside in the UTRs rather than the CDS. I really don't see the advantage to keeping the existing IDs over giving new IDs for any transcript model that changes and (if you are so inclined) noting that the CDS is the same as the TAIR10 model X.Y in the descriptive fields at the end of the gff/gtf file formats. This is what those fields are there for, after all! The advantage to this is that if you have the same ID in TAIR10 and araport for a transcript you then know that the transcript model is the same.

I think araport are causing problems for everyone with their current approach.