RFC001: Arabidopsis Data API controlled vocabulary draft specification


Status message

New Feature: Guest Login function added to facilitate site exploration without registering. Try it out!

The Araport team has developed a lightweight controlled vocabulary for use in newly-created web service APIs. The objective is to minimize confusion among services when users are making related queries. We wish for the community to examine the initial draft before we commence work on using the specification in development of production web services.

The current draft can be found at the Araport GitHub.

Comments are open until May 30, 2015.


Make this extensible to other genomes? The draft controlled vocabulary is specific to Arabidopsis thaliana. The limitation is built in to the "locus" specification, which only takes values that start with AT. Do we want a vocabulary that could be extended to other genomes? The vocabulary could be made extensible by requiring a parameter that specifies the reference sequence e.g. ref=ATC0. Then other parts of the spec could incorporate species-specific logic e.g. locus must start with AT if ref==ATC0.

Make locus ID extensible? The draft controlled vocabulary limits locus to strings like AT1G01010 and the letter G is required. To be forward looking, could this be made flexible to handle loci of non-coding genes, intergenic regions, telomeres, and centromeres?

Clarify positional coordinates? The draft controlled vocabulary limits start and stop to integer values. It could clarify the usual issues with coordinate systems: counting starts at 1 not 0 so start>0 and stop>0 always; start<=stop always, even on the reverse strand; the way to specify a single base is with start==stop; the way to specify a position between bases to give the 2-base range that includes the upstream and downstream bases.

URL per term? Araport could back every controlled vocabulary term with a URL. The page at the URL would explain the term. The URL itself would provide a unique identifier for the concept. An identifier like https://www.araport.org/cv/locus_id would be a valid data type that could be specified in JSON files and data format specifications. (An alternative is a prefix like the GO term GO:0007623 but clearly this is not globally unique).

Include the AGI list? The controlled vocabulary could list every AGI identifier. The AGI are the terms like AT1G01010 that each define one locus of the Arabidopsis thaliana Col-0 TAIR10 genome sequence. Although these identifiers follow a pattern that can be defined by a regular expression, only certain combinations of digits are valid. Thus, it would help parsers to have the canonical list available. (The draft already includes accessions for every line of Arabidospsis thaliana, so including every locus is consistent and complete.)

Organism? The controlled vocabulary should specify how to describe a species for purpose of homology for instance. Use the NCBI taxon id.

Locus or locus_id? The RFC proposes "locus" but Araport already has examples that use "locus_id". I favor "locus_id" because it provides a good example for other data types, e.g. "pathway_id", "kegg_id", and "uniprot_id".

Add score? Consider adding a generic "score" label. This would correspond to the score column in a GFF file. Most computational predictions have an overall score. Even for predictions that provide several scores (e.g. BLAST), it would help displayers to be able to rank and color predictions by their overall score. The data type should allow floating point numbers. There could be optional additional fields to help displayers describe the score: "score_short_name" for a column heading and "score_description" for pop-up text.

Dear Araport team,


I have a question related to the new smallRNA anntotation. Where are the .gff file of those features? I downloaded all the files from the FTP but I couldn´t find them. Could you give me any hint? Thank you so much