I've been investigating the gene models for CYCd7;1 and a new model is in Araport that is absent from TAIR10 (not unusual).
However, the functional domain of the protein is cut in half in this case by the missing exon, which appears incorrect. The RNAseq data appears to show that in the leaf there are insufficient reads to call the exon, but is such limited data the only reason why it has been left out? Can it be determined that intron-spanning reads were found that show the CDS misses that exon? Is there a link to the source data for individual genes, or is it a generic use of all the datasets?
This is also added to by the short version of the Carpel gene, which seems to clearly be over truncated at some small regions of low coverage, but the model is clearly still consistent.
I find this worrisome as a major advantage of Araport is the newer gene models, but without further evidence I'm not certain how they are derived.
Screenshot of the gene region: http://imgur.com/p5oKj3s
New gene models added as splice variants to existing gene loci are based on tissue-specific RNA-seq evidence
All new splice variant models annotated in Araport11 have passed the criteria adopted in our TopHat + Trinity + PASA (annotation) pipeline. Firstly, we only used uniquely mapped reads with two or fewer mismatch(es). Secondly, during the transcript assembly steps, we used a hybrid approach of combining de-novo and genome-guided Trinity assemblies to generate a contig set, with a minimum length cutoff of at least 183 bp (which is the size of 95% of the exons in TAIR10). The assembled transcripts were compared against the reference annotation using PASA to incorporate various structural changes (gene merges, gene splits, UTR extensions, CDS changes, and alternative splice variants). In this step, PASA required canonical splice sites and eight bp perfect matches on either side of a splice site to minimize incorrect transcript structures. Only the transcript models which fulfilled the above three criteria were annotated as novel transcript variants.
Regarding this new variant AT5G02110.2, the TopHat-aligned RNA-seq reads supported the junction connecting the first and second exons of this variant. This splicing pattern is present in pollen, leaf, light-grown seedling and receptacle samples (as seen in this JBrowse view: https://apps.araport.org/jbrowse/?data=arabidopsis&loc=Chr5%3A416566..41...). The green lines denotes the intron-spanning reads, the span of the green bars corresponds to the introns, and the numbers underneath the bars are the raw read counts.
Only the reads in the leaf samples passed all three filters and were assembled into a transcript variant consistent with AT5G02110.2, however, the supporting reads were actually found in multiple tissues.
We recognize that the criteria in our pipeline are qualitative rather than quantitative. In other words, we did not set a minimum expression level as a requirement for annotating a splice variant. In the example you have noted, AT5G02110.2 is likely a minor isoform in most tissues, but it is supported, and thus annotated by Araport.
As a postprocess, we have generated RNA-seq based expression values, at the gene and transcript levels, using the 113 RNA-seq datasets used by the Araport11 structural annotation pipeline. This piece of information could complement the data in JBrowse tracks and will be available in the next release of ThaleMine v1.9.0 (which will be released shortly at https://apps.araport.org/thalemine/). Users will have this additional piece of evidence to define the major and minor isoforms in the tissues of interest.
Please refer to a recently released version of our submitted manuscript (currently under review), describing the Araport11 reannotation process. It can be accessed here: http://dx.doi.org/10.1101/047308