Collection and processing of RNA-seq libraries
The Illumina-based RNA-seq data from NCBI short read archive (SRA) were partitioned into bins according to their tissue or organ of origin. In total, we identified 113 RNA-seq datasets (SRA accessions) using untreated or mock treated wild-type Col-0 plants for further analysis.
All RNA-seq reads were processed through a quality check and trimming pipeline using FastQC and Trimmomatic respectively to remove residual adapters, low-quality sequences (the default criteria for Trimmomatic-0.32), and reads below 36 bp. Sequences from each library were aligned onto TAIR10 genome using TopHat (v2.0.12) (Trapnell, C. et al., 2009). The maximum and minimum intron size is set to 2,000 and 50 bases respectively to accommodate the compact nature of the Arabidopsis genome. Burrows-Wheeler Aligner (0.7.12-r1039) was used to estimate the mean (-r) and standard deviation (--mate-std-dev) of the inner distance between pair-ended reads. The libraries with a low proportion of mapped reads (less than 70% of input) were removed. We used featureCounts (v1.4.6) and edgeR (v3.8.5) to explore the correlation of gene expression levels between samples, and only kept the ones clustering with others with the same tissue/organ origin for transcript assembly.
Generation of tissue-specific transcriptomes
To maximize the utilization of the reads, we adopted a hybrid approach, processing the reads using a combination of de novo and genome-guided Trinity (release 20140413). All reads within the same tissue/organ group were assembled using Trinity RNA-seq de novo assembly protocol (Haas et al. 2013). For genome-guided assembly, only uniquely mapped reads are used by Trinity to guide the de novo assembly of overlapping reads at each locus. Both procedures enforce two cutoff values: a minimum contig length of 183 bp as 95% of Arabidopsis transcripts are above this length, and a maximum intron length of 2000 bp which is the 99th percentile for intron sizes in annotated Arabidopsis gene models.
In order to instantiate tissue-specific transcript isoforms and avoid possible chimeras between tissues, each tissue bin is assembled independently using the following protocol. The separate de novo and genome-guided results are collapsed, mapped back to the reference genome and merged into compatible transcript clusters by PASA (Haas et al. 2003) using its alignment assembly module. These transcript clusters are sub-assembled into compatible transcript structures, which adhere to the following criteria: maximum intron length of 2000 bp, ≥ 90% of transcript aligned at ≥ 95% identity, both sides of the splice boundary are supported by at least 8 bp and the splice sites are canonical.
Fig 1: An overview of the Araport11 pipeline. The RNA-seq reads from SRA were grouped into 11 tissue/organ types and assembled by Trinity using de novo and genome-guided mode in parallel. For the genome-guided assembly, we used TopHat to align the reads against TAIR10 genome and exposed the mapping coverage and predicted splice junctions on JBrowse. A hybrid assembly consisting of de novo and genome-guided assemblies was used to reconstruct the transcriptome for each tissue. To update current annotation, TAIR10 genome was supplemented with novel (non-overlapping with TAIR10 loci) transcripts from NCBI and MAKER-P assemblies and this augmented TAIR10 was used as the reference for transcript mapping. PASA annotation update was run for each group separately (11 times in total) to avoid constituting hybrid transcripts from different tissue/organ. The 11 transcriptomes were consolidated using a custom Python script to collapse the isoforms that only differed only in their UTRs. Over 700 Uniprot protein records that are inconsistent with TAIR10 gene models were evaluated, filtered, and appended to the updated gene models. Additional novel transcripts were extracted from PASA assemblies and the literature and are used to identify novel loci. The Araport11 protein-coding genes, including updated structural and functional annotation were assigned new locus identifiers, will be committed to GenBank after release for community review. Key: Source data: orange; evidence JBrowse tracks: purple; final annotation: green.
Integration of gene models from independent sources
In addition to using RNA-seq information, we integrated gene models from UniProt, the NCBI RefSeq group (Pruitt et al. 2012), the MAKER-P pipeline (Campbell et al. 2014), and published work (Wang et al. 2014, Vidal et al. 2013). Protein-coding genes proposed by RefSeq and MAKER-P pipelines not found with the TAIR10 annotation were added to the TAIR10 gene set and this augmented set, TAIR10 plus, was used as the reference in structural annotation update (see below).
As part of the UniProt Plant Protein Annotation Program (Schneider et al. 2005), the consortium provided a set of 747 curated protein records that are in conflict with the TAIR10 structural annotation (Communication with C. O’ Donovan). The inconsistency could be classified into modified gene structure involving alternative initiations/exons/splice junctions, frameshifts and/or sequence problems. There were 221 cases that would require modifications to the underlying genome and were not incorporated for this release. We manually evaluated the remaining cases based on RNA-seq, EST, paired-end analysis of transcription start sites (PEAT, Morton et al. 2014), and direct RNA sequencing (DRS, Duc et al. 2013) data sets. Overall, we altered 233 existing TAIR10 gene structures and added 112 novel alternative variants based on the UniProt data.
Structural annotation refinement
The assembled transcripts from each tissue bin were independently compared against the TAIR10 plus annotation to incorporate alignment evidence, correct exon boundaries, modify untranslated regions, and identify splicing variations supported by transcript alignments. A union of the 11 independently generated PASA annotation update result sets was created using a custom Python script to collapse the isoforms sharing the same transcript structure within a given locus while allowing for variation in the terminal UTR length. The longest transcript was retained as the representative for each locus.
The PASA updates also suggested 13 cases for merging of existing TAIR10 loci in the Araport11. For these merged loci, the new identifier directly inherits its identifier from the locus which contributes to the majority of the gene length, following the locus nomenclature guidelines.
Functional annotation of newly instantiated gene loci and updates to TAIR10 annotation were based upon the JCVI pipeline that uses a weighted keyword (WK) approach (Hoover et al. in preparation). Briefly, each predicted protein was searched against a collection of databases (Priam, Uniref100, PFAM/TIGRFAM, CAZY, CDD) and motif finders (TMHMM, InterPro). Meaningful keywords were extracted from the definition lines of sets of best matches from each database. A set of heuristic rules were then used to score each candidate definition line and the highest scoring line was assigned to the query protein. Newly instantiated genes were given functional assignments from the JCVI pipeline. For those genes already having a TAIR10 functional assignment a set of rules was devised that retained the TAIR10 annotation wherever possible but substituted the JCVI annotation where it appeared to be more accurate or informative. The TAIR annotation was retained in 21,690 loci. The JCVI pipeline updated 6,875 loci including 1,162 new loci, 4,418 loci for which TAIR had simply used its computational description for the functional assignment and 1,255 proteins with a “domain of unknown function” (DUF) notation. For this last class, we merged the JCVI annotation with the TAIR-sourced DUF identifiers.
Novel transcribed region
The novel transcribed regions in Araport11 originate from literature (Wang, H., et al., 2014, Vidal, E. A., et al., 2013) and the RNA-seq data mapped to the intergenic region of the augmented TAIR10 (TAIR10 plus) annotation. These novel transcripts were tagged as full length transcripts, allowing PASA to consider them as evidence for novel gene instantiation, and running through the PASA alignment assembly and annotation comparison steps. PASA assembled 30,527 transcripts after these procedures. To determine whether any of the novel transcribed regions could encode peptides similar to known genes, we first performed blastx using the 30,527 transcript sequences against the UniRef90 protein sequences of ten plant species (Oryza sativa, Glycine max , Medicago truncatula, Populus trichocarpa, Chlamydomonas reinhardtii, Brassica rapa, Mimulus guttatus, Solanum lycopersicum, Zea mays, Brachypodium distachyon). 6,033 transcripts have hits against the reference (E-value below 0.0001). Next, we ran blastp using the peptide translated from PASA-proposed coding sequences for these 30,527 transcripts against the same reference. Among the 1,530 queries with hits against the reference (E-value below 0.0001), 1,515 overlap with the 6,033 transcripts that have hits in blastx. We took the union of these results, removed a small set that mapped to TAIR10 non-coding RNA loci and Araport11 intragenic regions. Finally, we clustered the remaining transcripts that overlap on the same strand into a single novel transcribed region. Overall, there are 481 novel transcribed regions in Araport11 Pre-release 2 (10/2015) and can be visualized on JBrowse.
Assignment of locus and isoform identifiers
Following the integration of new gene models and updates as described above, new loci were assigned AGI identifiers that followed the established convention (At#g#####) reflecting their genomic location and using a “blacklist” to avoid reusing any previously used and retired loci). On 29 occasions, deviations from a strictly monotonic order were necessary due to the high gene density and identifier usage in that region. Isoform identifiers within each locus were re-assigned by a custom Python script taking into account the relationship between a PASA updated, UniProt supplemented or unmodified isoform and the reference (TAIR10) transcript set. The original isoform identifiers were retained when applicable while novel variant transcript structures were assigned new suffixes.
In summary, this near-final release of the Araport11 protein coding gene set contains 27,667 loci with 48,389 transcripts. Of TAIR10 protein-coding gene models, 71.7% (25,376/35,385) have been updated among which 4.6% (1,191) and 95.4% (24,185) have altered CDS and UTR sequences respectively. The number of genes in Araport11 with splice variants (10,698; 38.7%) of the genes is much higher than reported in TAIR10 (20.8%). The functional annotation of over 5000 of the TAIR10 loci has been updated. In addition, a total of 719 new loci and 13,556 new splice isoforms were added in this current release.
|Araport11 Pre-release 3
|Number of protein coding loci||27,416||27,667|
|Number of transcripts including isoforms||35,385||48,389|
|Number of TAIR10 transcripts with altered CDS||1,191 (4.6%)|
|Number of TAIR10 transcripts with altered UTRs||24,185 (95.4%)|
|Number of loci with splice isoform||5,665 (18%)||10,698 (38.7%)|
|Number of novel loci||719|
Baerenfaller, K., Grossmann, J., Grobei, M. A., Hull, R., Hirsch-Hoffmann, M., Yalovsky, S., ... & Baginsky, S. (2008). Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science,320(5878), 938-941.
Lister, R., O'Malley, R. C., Tonti-Filippini, J., Gregory, B. D., Berry, C. C., Millar, A. H., & Ecker, J. R. (2008). Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3), 523-536.
Filichkin, S. A., Priest, H. D., Givan, S. A., Shen, R., Bryant, D. W., Fox, S. E., ... & Mockler, T. C. (2010). Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome research, 20(1), 45-58.
Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., ... & Regev, A. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols, 8(8), 1494-1512.
Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9), 1105-1111.
Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith Jr, R. K., Hannick, L. I., ... & White, O. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic acids research, 31(19), 5654-5666.
Pruitt, K. D., Tatusova, T., Brown, G. R., & Maglott, D. R. (2012). NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research, 40(D1), D130-D135.
Campbell, M. S., Law, M., Holt, C., Stein, J. C., Moghe, G. D., Hufnagel, D. E., ... & Yandell, M. (2014). MAKER-P: A tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant physiology, 164(2), 513-524.
Wang H., You C., Chang F., Wang Y., Wang L. Qi J., Ma H. (2014). Alternative splicing during Arabidopsis flower development results in constitutive and stage-regulated isoforms. Frontiers in Genetics, 5:25. doi: 10.3389/fgene.2014.00025
Vidal E. A., Moyano T. C., Krouk G., Katari M. S., Tanurdzic M., McCombie W. R., Coruzzi G. M., Gutierrez R. A. (2013). Integrated RNA-seq and sRNA-seq analysis identifies novel nitrate-responsive genes in Arabidopsis thaliana roots. BMC Genomics, 14:701. doi:10.1186/1471-2164-14-701
Schneider, M., Bairoch, A., Wu, C. H., & Apweiler, R. (2005). Plant protein annotation in the UniProt Knowledgebase. Plant physiology, 138(1), 59-66.
Morton, T., Petricka, J., Corcoran, D. L., Li, S., Winter, C. M., Carda, A., ... & Megraw, M. (2014). Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures. The Plant Cell, 26(7), 2746-2760.
Duc, C., Sherstnev, A., Cole, C., Barton, G. J., & Simpson, G. G. (2013). Transcription termination and chimeric RNA formation controlled by Arabidopsis thaliana FPA. PLoS genetics, 9(10), e1003867.