I'm trying to figure out how the number of loci with splice isoforms ~38% was calculated.
I downloaded the Araport11 pre-release-3 annotation gtf file, extracted all the unique transcript ids (Gene.XX) and counted the occurrence of all unique genes(Gene). Then filtered for genes with a count of more than 1. But, now I get only 31.8% of genes which have more than one transcript or alternatively spliced. Where am I going wrong?
The 38% of Araport11 loci with AS variants corresponds to the "protein-coding gene" fraction
The Araport11 Pre-Release 3 dataset (dated 2015-12-02), pertains to the protein-coding gene fraction of the Arabidopsis genome annotation. Across the 5 nuclear + 2 organelle chromosomes, there are 27,667 protein-coding gene loci. Within this set, 61.3% (16,969) loci contain a single transcript, while 38.7% (10,698) loci have 2 or more transcript variants.
Unfortunately, it appears that the version of the GTF file you worked with contained an extended set of genes loci (protein-coding + non-coding). As such, in your calculations, there were fewer loci (31.8%) with 2 or more transcript variants.
We have corrected this issue by updating the GTF file, which now matches with the GFF3 file. Please re-download the file from the Araport11 downloads area (https://www.araport.org/downloads/Araport11_PreRelease_20151202/annotation) and use it for your analysis.
We apologize for the confusion and inconvenience caused.
Thank you very much!
Vivek @ Araport