suba logo

About SUBA5

SUBA5 provides a powerful tool to investigate subcellular localisation in Arabidopsis through the unification of disparate datasets and through the provision of web services. Users can construct queries to filter SUBA data or interrogate their protein sets for protein localisation and protein location relationships in the Arabidopsis model plant. SUBA5 houses large scale proteomic, Fluorescence Protein (FP) tagging localisation, Protein-Protein Interaction (PPI) data as well as PPI localisation data from subcellular compartments of Arabidopsis.

SUBA5 also contains pre-compiled bioinformatic predictions for protein subcellular localisations and a consensus classifier taking predictive and experimental information into account. The SUBA5 search interface and SUBA5 toolbox provides flexible options for refining or interrogating protein data sets by location, expected abundance, interactions, coexpression, protein properties and bibliographic information.The database has a web accessible interface that allows advanced combinatorial queries on the data as well as downloads for downstream applications.

Experimental Data in SUBA5

SUBA5 experimental data

The current version of SUBA5 is built on the TAIR10 Arabidopsis proteome. Methodology for data collection and curation is described in the database issue of Nucleic Acids Res. An overview of the data in SUBA5 as of August 2022 is shown below.

Experimental data
localisations 11 compartments incl. suborganellar categories
MS/MS 81680 121429
GFP 5835 6687
PPI 828 889
TOTAL 88343 129005
PPI-pairs all distinct
85576 79319
Data Coverage
no. of proteins no. of studies
MSMS 14399 219
GFP 3576 1849
PPI 8432 1055
external - 342
TOTAL 18851 (of 35388) 3230


Computational Resources in SUBA5

SUBA5 predictors

SUBA5 contains 22 predictors which use distinct training data sets, input variables and prediction methods. These have been reviewed and compared for their contribution to the SUBAcon call. Predictors vary in their accuracy for each subcellular compartment. Using this table you can find the most useful single predictor for your chosen compartment. The classifyer SUBAcon achieves the highest accuracy across all 10 subcellular categories.


SUBA5 tool box

The SUBA5 tool box is an interactive analysis centre that contains the Multiple Marker Abundance Profiling (MMAP) tool, The PPI Adjacency Tool (PAT) and the Coexpression Adjacency Tool (CAT). Linking the SUBAcon data to protein abundance or to protein-protein relationships enables users to spatially interprete their data. The relative abundance and purity of protein samples can be estimated, PPI data sets can be refined and spatial co-expression lists can be generated entering a list of AGIs.


SUBA5 consensus (SUBAcon) locations

Abundant experimental data from fluorescent protein (FP) tagging or mass spectrometry (MS) are available for Arabidopsis, yet they only cover ~ 40% of the proteome. For the remaining 60% of proteins, many computational tools have been developed to predict proteome-wide subcellular location. None of the mentioned approaches are error-free and thus results are often contradictory. To help unify the multiple data contained in SUBA5, we have developed the SUBcellular Arabidopsis consensus (SUBAcon) algorithm, a naive Bayes classifier that integrates 22 computational prediction algorithms, experimental FP and MS localisations, protein-protein interaction and co-expression data to derive a consensus call and probability. SUBAcon classifies protein location in Arabidopsis more accurately than single predictors. SUBAcon is a useful tool for recovering proteome-wide subcellular locations of Arabidopsis proteins.


Arabidopsis SUbcellular REference (ASURE)

ASURE is the reference data used for training SUBAcon and its built is described in our SUBAcon report. ASURE contains 5,426 proteins of which 3497 (64%) have been independently experimentally localised. Because experimental (FP, MS) data were introduced in the SUBAcon classification algorithm, the assembly of ASURE subcellular-proteomes used additional inclusion criteria for curated ASURE proteins such as protein function and evidence from orthologoes in other species obtained from cropPAL. ASURE showed a discrepancy of less then 1% compared to the high-confidence Arabidopsis plastid proteome and to the peer-reviewed reference sets used for training the predictors MultiLoc2, EpiLoc and YLoc. ASURE is a searchable high confidence subproteome reference standard that can be accessed through the ASURE portal.