suba logo

Tutorial collection

A step-by-step tutorial explaining how to use SUBA is available here:

SUBAcon

Get The Data (26MB) Get The Code

Supplementary Information

An explanation of the theory and data of subacon


Maximum Entropy Justification for Naive Bayes

Theory

Gold Standard

We define our event space as the cross product of all the predictors outputs viz:.

$$\begin{split}\mathcal{X} = \underset{L \text{ times}}{ \underbrace{A \otimes B \otimes C \otimes ... \otimes Z}}\end{split}$$

and a particular event labeled as \(1_A,....,L_B\) for example. A picturegraphic of what we mean by this is given right.

The features are assumed to have finite domains ( i-th feature ki has values), and are often called nominal. (A nominal feature can be transformed into a numeric one by imposing an order on its domain.)

We define the function \(\delta_{i_A}(\mathcal{X})\) as a function that picks out the ith column (predictor) and the Ath category. e.g.

$$\begin{split} \delta_{i_A}(\mathcal{X}) = \begin{cases} 1& \text{ if } \mathcal{X} = \underset{\phantom{XXXXXXXX} i^{th} \text{position}}{C\otimes D\otimes ... A ... \otimes Z} \\ 0& \text{ otherwise } \end{cases}\end{split}$$

Note that this set of functions form a basis of functions over the domain \(\mathcal{X}\) such that any function defined on this domain can be represented as a linear combination of our "basis" functions \(f(\mathcal{X}) = \sum_{i,A} a_{i,A} \delta_{i_A}(\mathcal{X})\).

This is a little like dummy coding for regression analysis (see also here).

NB: In what follows it doesn't matter if the \(\chi\) characteristic functions are multiplied by a constant factor \(g_{i_A}\) since these can be absorded into a redefinition of the lagrange multipliers (see below) \(h_{i_A}\). This means that even if we "quantize" a metric into categories or features with different "areas" this will only mean that the frequencies reflect this size differentiation.

For the example where each element of X is a position on a protein strand and at this position one could have nominal values {A,B,C,D,E,F,G....Z} then

$$\begin{split}\mathcal{X} = \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \dots \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right)\end{split}$$

In our case \(\mathcal{X} = v_1 \otimes v_2 \otimes \dots v_n\) where the domain of the vi are the set of all subsets of a set each element of which can be represented as a bitmap (0,1,0,1,0) etc. For continuous underlying state the deltas become characteristic functions but importantly they must define the meaning of apriori equally likely outcomes.

Although it seems that any type of function - not just a characteristic function - could be used we need to be able to evaluate the partition function with it. This is only simple when we have a characteristic function.

Predictors


Simple categorical predictors

Our assumption here is that we can estimate \(\langle \delta_{i_A} (\chi) \rangle\) using our gold standard and the Maximum Entropy probability distribution is given by:

$$\begin{split} \frac{1}{Z} \exp\{ \sum_{i_A}^K h_{i_A}\delta_{i_A} (\mathcal{X}) \}\end{split}$$

Where K is the number of features for predictor i (In reality K depends on i but our notation is getting too complex). The normalisation term (partition function) is estimated as:

$$\begin{split} Z =& \sum_\mathcal{X} \exp\left\{ \sum_{i_A}^K h_{i_A}\delta_{i_A} (\mathcal{X}) \right\} \\ =& \prod_i^L \left( \sum_A e^{h_{i_A}}\right)\end{split}$$

because we have \(\sum_{i_A}^{q} \delta_{i_A} (\mathcal{X}) = 1 \quad \forall i\) (i.e. our categorical classifiers give only one category per prediction).

It is easy to show that to satisfy \(\partial ln Z / \partial h_{i_A} = f_{i_A}\) we need \(e^{h_{i_A}} = f_{i_A}\). And this also implies that \(Z = \prod_i 1 = 1\).

The probabilty is thus:

$$\begin{split} P( \mathcal{X} | c) \sim& \prod_{i_A} f_{c,i_A}^{\delta_{i_A} (\mathcal{X})} \\ \implies \log P( \mathcal{X} | c) \sim& \sum_i \sum_A \delta_{i_A}(\mathcal{X}) \log f_{c,i_A} = \sum_i \delta_i(\mathcal{X}) \cdot\log f_{c,i}\end{split}$$

The matrix \(f_{c,A}\) is what we caluclated with \(X^\top Y\) (see below and in the code )

Multi-locations predictor

For a categorical predictor above which recognises q features the \(\mathcal{X}\) vector will only have q values viz: (1,0,0,...,0), (0,1,0,...,0), ... (0,0,0,...,1). Now we allow \(\mathcal{X}\) to all possible vectors with 0,or 1 then the computation for Z is

$$\begin{split} Z = \sum_{\mathcal{X}} \exp \left\{\sum_{i_A} h_{i_A} \delta_{i_A}(\mathcal{X})\right\} = \sum_{\mathcal{X}_1=0}^1\dots\sum_{\mathcal{X}_{K}=0}^1 \prod_{i_A} e^{\left( h_{i_A} \delta_{i_A}(\mathcal{X}) \right)} = \prod_{i_A}^{K} \left[ e^{h_{i_A}} + 1\right]\end{split}$$

then the solution of \(\partial \ln Z/\partial h_{i_A} = \langle \delta_{i_A}(\mathcal{X}) \equiv f_{i_A}\) is \(e^{h_{i_A}}/(e^{h_{i_A}}+1) = f_{i_A}\). or \(h_{i_A} = \ln (f_{i_A}/(1-f_{i_A}))\). And \(Z = \prod_{i_A} (1/(1-f_{i_A}))\). So the probability is

$$\begin{split} P(\mathcal{X}) = & \prod_i \prod_{i_A}^K f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1- \delta_{i_A}(\mathcal{X})} \\ \equiv& \prod_i \prod_{i_A}^K \left[ f_{i_A} {\delta_{i_A}(\mathcal{X})} + \left(1-f_{i_A}\right)\left({1- \delta_{i_A}(\mathcal{X})}\right) \right]\end{split}$$

The latter representation is usually the way it appears in textbooks a or with say scikit-learn as bernoulli naive bayes.

this is normalized since:

$$\begin{split} \sum_\mathcal{X} P(\mathcal{X}) = \sum_{\mathcal{X}}\prod_i \prod_{i_A}^K f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1-\delta_{i_A}(\chi)} = \prod_i \prod_{i_A}^K \sum_{\chi_{i_a}=0}^1 f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1-\delta_{i_A}(\mathcal{X})} =\prod_i \prod_{i_A}^K 1 = 1\end{split}$$

NOTE: One can also split this kind of predictor up in to q mini-predictors with each mini-predictor having two states yes or no for each feature. So that the number of "features" this predictor has is 2. This used in the laplace correction for estimating frequencies.

The Experimental Predictors

Here we assume that for a particular subcellular location there is a constant (category dependent) probability that a paper will be published saying it is in location Y. We include in this the probability - in any particular time interval - of no publication.

We divide up time into N slices (N is arbitrary but large small enough that only one publication - or no publication - is published in any interval).

The Maximum Entropy probability for a given count of publications \(n_f\) if we know the subcellular location and we have estimated the expected number of publications \(\langle n_{c,f} \rangle\) is given by 1

$$\begin{split}P (n_f| c) \sim & \prod_f \left[ \frac{\langle n_{c,f}\rangle}{N} \right]^{n_{f}}\end{split}$$

We now assume the number of time intervals N is large so the overwhelming majority contain only "no publication" events.

$$\begin{split} P (n_f| c) =& N^{-N}\prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{f}} {\left( N- \sum_{f^\prime}\langle n_{c,f} \rangle \right)}^{N - \sum_{f^\prime}n_{f}} \\ =& N^{-N}\prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{f}} N^{N-M} {\left( 1- \sum_{f^\prime}\langle n_{c,f} \rangle/N \right)}^{N - M} \\ \lim_{N\rightarrow \infty}\approx& N^{-M} \prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{ f}} e^{-\sum_{f^\prime}\langle n_{c,f} \rangle} \\ =& N^{-M} \prod_{f^\prime} { \ f_{c,f} }^{n_{f}} \langle M_c \rangle^{M} e^{-\langle M_c \rangle} \quad \text{where}\; M = \sum_{f^\prime} \ n_{f} \;\text{and}\; \langle M_c\rangle = \sum_{f^\prime} \langle\ n_{c,f} \rangle\end{split}$$

and \(f_{c,f}\) is \(\langle n_{c,f} \rangle/\langle M_c\rangle\) The effect of this is to give a Bernoulli Multinomial distribution modulated by a poisson distribution to account for the varying number of publications. The dependency on N drops out with normalisation.

SUBAcon Data

The data discussed here is available as a download. To use this data you need an assess to a MySQL database (for example AWS)

gzcat subacon.sql.gz | mysql -u <user> -p -h <host> database

The basic SUBAcon type is the mysql set type ususally applied to a column call suba_location.

`suba_location` set('cell plate','cytoskeleton','cytosol','endoplasmic reticulum',
'extracellular','golgi','mitochondrion','nucleus','peroxisome',
'plasma membrane','plastid','unclear','vacuole','endosome')

The data is stored in MySQL as a bitmap so that (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) = 0x1238 is a bitmap of the subset ('endoplasmic reticulum','extracellular','golgi','plasma membrane','vacuole') which we name "secretory". Also we fold cytoskeleton and cytosol together and name them simply cytosol. There are no entries in the database with 'cell plate' or 'endosome' set and we filter out all entries with 'unclear' only set. These entries are there for historical reasons.

Feature counts

Each predictor has a set of features that it "predicts". In this case we are working off the same data structure so that "features" and "categories" look like the same thing.

For example predotar has a feature (endoplasmic reticulum, extracellular, golgi)

SELECT  IFNULL(p.suba_location,'no location') as suba_location, COUNT(*) as num
        from tair10 left outer join pred_predotar as p using(locus)
        GROUP BY p.suba_location
        ORDER BY p.suba_location

And a sankey digram show the mapping between the features and categories

SELECT concat('predotar-',feature) as `From`,category as `To` ,
    sum(if(gs.suba_location & categories.mask and
    (pred.suba_location & features.mask
    OR (pred.suba_location is NULL and features.mask is NULL)),1,0)
    ) as weight
    FROM gold_std as gs
      LEFT OUTER JOIN pred_predotar as pred
    USING(locus)
    JOIN
    (
        (select _latin1'peroxisome' as category , 1<<8 as mask)
        union all (select 'cytosol',    1<<1 | 1<<2)
        union all (select 'secretory',   0x1238 << 0)
        union all (select 'plastid',         1<<10)
        union all (select 'nucleus',         1<< 7)
        union all (select 'mitochondrion',   1<< 6)
    ) AS  categories
    JOIN
    (
        (select _latin1'peroxisome' as feature , 1<<8 as mask)
        union all (select 'cytosol',    1<<1 | 1<<2)
        union all (select 'secretory',   0x1238 << 0)
        union all (select 'plastid',         1<<10)
        union all (select 'nucleus',         1<< 7)
        union all (select 'mitochondrion',   1<< 6)
        union all (select 'none',NULL)
    ) as features
   GROUP BY category,feature HAVING weight > 0;

A full list is given :ref:categorical-predictor-graphs

Categorical Predictors

We generate from each predictor indicator matrices. Something like:

SELECT locus,
    case when p.suba_location & (1<< 6)  then 1 else 0 end as 'mito',
    case when p.suba_location & (1<< 10) then 1 else 0 end as 'plastid',
    case when p.suba_location & 0x1238   then 1 else 0 end as 'secretory',
    case when p.suba_location is NULL or
              p.suba_location & ~(1<<6 | 1<<10 | 0x1238) then 1 else 0 end as 'none',

    '|' as `|`,
    case when gs.suba_location & (1<<1 | 1<<2)   then 1 else 0 end as 'cytosol',
    case when gs.suba_location & (1<< 6)         then 1 else 0 end as 'mito',
    case when gs.suba_location & (1<< 7)         then 1 else 0 end as 'nucleus',
    case when gs.suba_location & (1<< 8)         then 1 else 0 end as 'perox',
    case when gs.suba_location & (1<< 10)        then 1 else 0 end as 'plastid',
    case when gs.suba_location & (0x1238)        then 1 else 0 end as 'secretory'

FROM gold_std as gs join pred_predotar as p using(locus);

That creates (effectively) a pair of matrcies X and Y. We then compute \(Y^\top X\). (We use the terms X and Y to be consitent with scikit-learn e.g. see count() functions

\(X^\top Y\) is equivalent to the categories x features matrix (for example from the data for Predotar)

SELECT
    category,
    sum( case when feat & (1<< 6)  then 1 else 0 end) as mitochondrion,
    sum( case when feat & (1<< 10) then 1 else 0 end) as plastid,
    sum( case when feat &  0x1238  then 1 else 0 end) as secretory,
    sum( case when feat is NULL    then 1 else 0 end) as `none`
FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
          gs.locus,         -- the protein
          gs.suba_location, -- gold standard for the protein
          pred.suba_location as feat  -- the predicted location or "feature"
         FROM gold_std as gs
          LEFT OUTER JOIN pred_predotar as pred
         ON gs.locus = pred.locus
        ) AS gs2pred
    JOIN ( -- MySQL attempt at a literal table of categories
      -- for latin1 charset for string literal
      (select _latin1'peroxisome' as category , 1<<8 as mask)
      union all (select 'cytosol',    1<<1 | 1<<2)
      union all (select 'secretory',   0x1238 << 0) -- ensure "numeric" context with zero shift (ugh!)
        -- see http://dev.mysql.com/doc/refman/5.0/en/hexadecimal-literals.html
      union all (select 'plastid',         1<<10)
      union all (select 'nucleus',         1<< 7)
      union all (select 'mitochondrion',   1<< 6)

      ) AS  categories

ON (gs2pred.suba_location & categories.mask)
GROUP BY category ORDER BY category;

For each row of this matrix the columns are summed and the total is divided into the cell to make a frequency matrix \(f_{c,A}\). This is the data that goes into the predictor algorithm.

Experimental Data

GFP and MSMS tables are a simple list unique triples (Locus, location, PMID). where location is one of the locations of the suba_location set. We first convert using:

SELECT
     gs.locus,
     gs.suba_location,
     sum( case when location in ('cytosol','cytoskeleton') then 1 else 0 end) as cytosol,
     sum( case when location = 'mitochondrion' then 1 else 0 end) as mitochondrion,
     sum( case when location = 'nucleus'       then 1 else 0 end) as nucleus,
     sum( case when location = 'peroxisome'    then 1 else 0 end) as peroxisome,
     sum( case when location = 'plastid'       then 1 else 0 end) as plastid,
     sum( case when location in ('vacuole','golgi','endoplasmic reticulum',
           'plasma membrane','extracellular')  then 1 else 0 end) as secretory

FROM gold_std as gs
JOIN src_gfp using(locus) GROUP BY locus,gs.suba_location;

This gives a table of paper counts for each locus. (Note it's not problematic to group with gs.suba_location since locus already uniquely locates a row in the gold_std)

This gives the \(Y^\top X\) table (for GFP as):

SELECT
     cat as category,
     sum( case when location in ('cytosol','cytoskeleton') then 1 else 0 end) as cytosol,
     sum( case when location = 'mitochondrion' then 1 else 0 end) as mitochondrion,
     sum( case when location = 'nucleus'       then 1 else 0 end) as nucleus,
     sum( case when location = 'peroxisome'    then 1 else 0 end) as peroxisome,
     sum( case when location = 'plastid'       then 1 else 0 end) as plastid,
     sum( case when location in ('vacuole','golgi','endoplasmic reticulum',
           'plasma membrane','extracellular')  then 1 else 0 end) as secretory

FROM gold_std as gs
    JOIN src_gfp  as  src

    USING(locus)

    JOIN (          (select 'peroxisome' as cat, 1<<8     as mask)
          union all (select 'cytosol',           1<<1 | 1<<2 )
          union all (select 'secretory',         1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12)
          union all (select 'mitochondrion',     1<< 6)
          union all (select 'nucleus',           1<< 7)
          union all (select 'plastid',           1<< 10)
          )  as cats

    ON( gs.suba_location & cats.mask)

GROUP BY cat ORDER BY cat

and for MS/MS as

SELECT
     cat as category,
     sum( case when location in ('cytosol','cytoskeleton') then 1 else 0 end) as cytosol,
     sum( case when location = 'mitochondrion' then 1 else 0 end) as mitochondrion,
     sum( case when location = 'nucleus'       then 1 else 0 end) as nucleus,
     sum( case when location = 'peroxisome'    then 1 else 0 end) as peroxisome,
     sum( case when location = 'plastid'       then 1 else 0 end) as plastid,
     sum( case when location in ('vacuole','golgi','endoplasmic reticulum',
           'plasma membrane','extracellular')  then 1 else 0 end) as secretory

FROM gold_std as gs
    JOIN src_msms  as  src

    USING(locus)

    JOIN (          (select 'peroxisome' as cat, 1<<8     as mask)
          union all (select 'cytosol',           1<<1 | 1<<2 )
          union all (select 'secretory',         1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12)
          union all (select 'mitochondrion',     1<< 6)
          union all (select 'nucleus',           1<< 7)
          union all (select 'plastid',           1<< 10)
          )  as cats

    ON( gs.suba_location & cats.mask)

GROUP BY cat ORDER BY cat

The PPI predictor

Protein-Protein Interaction (PPI) information is taken from .. (??) plus data curated by our researchers.

We use the following SQL to create a feature list from PPIs. The src_ppi table is a list of AGI pairs (locusA, locusB) and a PMID. We turn this into a predictor by joining locusA of the src_ppi with the gold_std and then predicting locusB location to be that of locusA. We use the publication counts exactly as for the GFP/MSMS predictors.

SELECT locus,
       sum(if( suba_location & (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) ,num,0)) as `secretory`,
       sum(if( suba_location & (1<<1 | 1<<2) ,num,0)) as `cytosol`,
       sum(if( suba_location & (1<< 6) ,num,0)) as `mitochondrion`,
       sum(if( suba_location & (1<< 7) ,num,0)) as `nucleus`,
       sum(if( suba_location & (1<< 8) ,num,0)) as `peroxisome`,
       sum(if( suba_location & (1<<10) ,num,0)) as `plastid`
FROM (
        SELECT ppi.locusB AS locus, -- locusB is in same compartment as locusA!
                gs.suba_location AS suba_location, -- gold standard says locusA is here
                -- we have this many papers for this pair of locusB
                count(DISTINCT ppi.paper) AS num,
                suba_location
            FROM gold_std as gs INNER JOIN src_ppi as ppi ON ppi.locusA = gs.locus
            WHERE ppi.locusA != ppi.locusB GROUP BY ppi.locusB, gs.suba_location
            ) as pred_src_ppi

GROUP by pred_src_ppi.locus

Atted Predictor

The Atted Data is a quadruplet of (locus, locus_coex, rank, average) The following SQL converts the Atted data in to a list of publication weights for each locus. We can the join against the gold standard

SELECT
     locus,
       sum(if( suba_location & (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) ,num,0)) as `secretory`,
       sum(if( suba_location & (1<<1 | 1<<2) ,num,0)) as `cytosol`,
       sum(if( suba_location & (1<< 6) ,num,0)) as `mitochondrion`,
       sum(if( suba_location & (1<< 7) ,num,0)) as `nucleus`,
       sum(if( suba_location & (1<< 8) ,num,0)) as `peroxisome`,
       sum(if( suba_location & (1<<10) ,num,0)) as `plastid`

FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
           gs.locus,         -- the protein
           gs.suba_location, -- gold standard for the protein
           pred_src_att.suba_location as att,  -- what atted "predicts"
           pred_src_att.num                    -- "weight" attached to this protein
          FROM gold_std as gs
           INNER JOIN

           ( -- this is the predictor table "pred_src_att"
             -- we join to the gold standard on locus and then match gs.suba_location to locus_coex
             SELECT att.locus_coex AS locus, -- locus_coex is in same compartment as locus!
                  gs2.suba_location AS suba_location, -- gold standard says locus_coex is here
                  count(DISTINCT att.rank,att.ave) AS num -- we have this many ranks locus_coex,suba_location
             FROM gold_std as gs2 INNER JOIN atted as att ON att.locus = gs2.locus
             WHERE att.locus != att.locus_coex
             AND
             (  ( gs2.suba_location & (1<<1 | 1<<2)>0  and rank < 50 and ave > .7)   -- cytosol/cytoskeleton
             or ( gs2.suba_location & (1<<10)  >0      and rank < 50 and ave > .8)   -- plastid
             or ( gs2.suba_location & (0x1238) >0      and rank < 50 and ave > .5)   -- secretory
             or ( gs2.suba_location & (1<<7)  >0       and rank < 50 and ave > .5)   -- nucleus
             or ( gs2.suba_location & (1<<6)  >0       and rank < 50 and ave > .6)   -- mito
             or ( gs2.suba_location & (1<<8)  >0       and rank < 50 and ave > .5)   -- perox
             )
             AND rank >= 0 -- avoid bad rows

             GROUP BY att.locus_coex, gs2.suba_location
             ) AS pred_src_att

          ON gs.locus = pred_src_att.locus
         ) AS gs2pred

GROUP by gs2pred.locus

This renders a table in the form of a GFP/MSMS table above and we can then use it to compute as for those predictors.

The final counts table

SELECT
     category,
     sum( case when find_in_set('cytosol',att)>0 or find_in_set('cytoskeleton',att)>0
                                                       then num else 0 end)+1 as cytosol,
     sum( case when find_in_set('mitochondrion',att)>0 then num else 0 end)+1 as mitochondrion,
     sum( case when find_in_set('nucleus',att)>0       then num else 0 end)+1 as nucleus,
     sum( case when find_in_set('peroxisome',att)>0    then num else 0 end)+1 as peroxisome,
     sum( case when find_in_set('plastid',att)>0       then num else 0 end)+1 as plastid,
     sum( case when att & 0x1238                       then num else 0 end)+1 as secretory

FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
           gs.locus, -- the protein
           gs.suba_location, -- gold standard for the protein
           pred_src_att.suba_location as att,  -- what atted "predicts"
           pred_src_att.num -- number of unique papers attached to this protein
          FROM gold_std as gs
           INNER JOIN ( -- this is the predictor table "pred_src_att"
             -- we join to the gold standard on locus and then match gs.suba_location to locus_coex
             SELECT att.locus_coex AS locus, -- locus_coex is in same compartment as locus!
                  gs2.suba_location AS suba_location, -- gold standard says locus_coex is here
                  count(DISTINCT att.rank,att.ave) AS num -- we have this many ranks locus_coex,suba_location
             FROM gold_std as gs2 INNER JOIN atted as att ON att.locus = gs2.locus
             WHERE att.locus != att.locus_coex
             AND rank >= 0 -- avoid bad rows
             AND rank < 5
             AND ave > .8
             GROUP BY att.locus_coex, gs2.suba_location
             ) AS pred_src_att
          ON gs.locus = pred_src_att.locus
         ) AS gs2pred
    JOIN ( -- MySQL attempt at a literal table of categories
       (select _latin1'peroxisome' as category ,1<<8 as mask) -- for latin1 charset for string literal
       union all (select 'cytosol',    1<<1 | 1<<2)
       union all (select 'secretory',   0x1238 << 0) -- ensure "numeric" context with zero shift (ugh!)
       -- see http://dev.mysql.com/doc/refman/5.0/en/hexadecimal-literals.html
       union all (select 'plastid',         1<<10)
       union all (select 'nucleus',         1<< 7)
       union all (select 'mitochondrion',   1<< 6)
       ) AS  categories

    WHERE gs2pred.suba_location & categories.mask
GROUP BY category ORDER BY category

References


  1. E.T. Jaynes "Probability Theory: The Logic of Science" Edwin T. (2004) Cambridge University Press ISBN 0 521 59271 2. Section 22.5 page 638