Planet Bioclipse

Syndicate content
Updated: 45 min 52 sec ago

chem-bla-ics : Comparing sets of identifiers: the Bioclipse implementation

Sat, 2016-05-21 11:22
Source: WikipediaThe problem
That sounds easy: take two collection of identifiers, put them in sets, determine the intersection, done. Sadly, each collection uses identifiers from different databases. Worse, within one set identifiers from multiple databases. Mind you, I'm not going full monty, though some chemistry will be involved at some point. Instead, this post is really based on identifiers.

The example
Data set 1:

Data set 2: all metabolites from WikiPathways. This set has many different data sources, and seven provide more than 100 unique identifiers. The full list of metabolite identifiers is here.

The goal
Determine the interaction of two collections of identifiers from arbitrary databases, ultimately using scientific lenses. I will develop at least two solutions: one based on Bioclipse (this post) and one based on R (later).

Needs
First of all, we need something that links IDs in the first place. Not surprisingly, I will be using BridgeDb (doi:10.1186/1471-2105-11-5) for that, but for small molecules alternatives exist, like the Open PHACTS IMS based on BridgeDb, the Chemical Translation Service (doi:10.1093/bioinformatics/btq476) or UniChem (doi:10.1186/s13321-014-0043-5, doi:10.1186/1758-2946-5-3).

The Bioclipse implementation
The first thing we need to do is read the files. I have them saved as CSV even though it is a tab-separated file. Bioclipse will now open it in it's matrix editor (yes, I think .tsv needs to be linked to that editor, which does not seem to be the case yet). Reading the human metabolites from WikiPathways is done with this code (using Groovy as scripting language):

file1 = new File(
  bioclipse.fullPath(
    "/Compare Identifiers/human_metabolite_identifiers.csv"
  )
)
set1 = new java.util.HashSet();
file1.eachLine { line ->
  fields = line.split(/\t/)
  def syscode;
  def id;
  if (fields.size() >= 2) {
    (syscode, id) = line.split(/\t/)
  }
  if (syscode != "syscode") { // ok, not the first line
    set1.add(bridgedb.xref(id, syscode))
  }
}

You can see that I am using the BridgeDb functionality already, to create Xref objects. The code skips the first line (or any line with "column headers"). The BridgeDb Xref object's equals() method ensures I only have unique cross references in the resulting set.

Reading the other identifier set is a bit trickier. First, I manually changed the second column, to use the BridgeDb system codes. The list is short, and saves me from making mappings in the source code. One thing I decide to do in the source code is normalize the ChEBI identifiers (something that many of you will recognize):

file2 = new File(
  bioclipse.fullPath("/Compare Identifiers/set.csv")
)
set2 = new java.util.HashSet();
file2.eachLine { line ->
  fields = line.split(/\t/)
  def name;
  def syscode;
  def id;
  if (fields.size() >= 3) {
    (name, syscode, id) = line.split(/\t/)
  }
  if (syscode != "syscode") { // ok, not the first line
    if (syscode == "Ce") {
      if (!id.startsWith("CHEBI:")) {
        id = "CHEBI:" + id
      } 
    }
    set2.add(bridgedb.xref(id, syscode))
  }
}

Then, the naive approach that does not take into account identifier equivalence makes it easy to list the number of identifiers in both sets:
intersection = new java.util.HashSet();intersection.addAll(set1);intersection.retainAll(set2)
println "set1: " + set1.size()println "set2: " + set2.size()println "intersection: " + intersection.size()
This reports:

set1: 2584
set2: 6
intersection: 3

With the following identifiers in common:
[Ce:CHEBI:30089, Ce:CHEBI:15904, Ca:25513-46-6]
Of course, we want to use the identifier mapping itself. So, we first compare identifiers directly, and if not matching, use BridgeDb and an metabolite identifier mapping database (get one here):

mbMapper = bridgedb.loadRelationalDatabase(
  bioclipse.fullPath(
    "/VOC/hmdb_chebi_wikidata_metabolites.bridge"
  )
)

intersection = new java.util.HashSet();
for (id2 in set2) {
  if (set1.contains(id2)) {
    // OK, direct match
    intersection.add(id2)
  } else {
    mappings = bridgedb.map(mbMapper, id2)
    for (mapped in mappings) {
      if (set1.contains(mapped)) {
        // OK, direct match
        intersection.add(id2)
      }
    }
  }
}

This gives five matches:
[Ch:HMDB00042, Cs:5775, Ce:CHEBI:15904, Ca:25513-46-6, Ce:CHEBI:30089]
The only metabolite it did not find in any pathway is the KEGG identified metabolite, homocystine. I just added this compound to Wikidata. That means that in the next metabolite mapping database, it will recognize this compound too.

The R and JavaScript implementations
I will soon write up the R version in a follow up post (but got to finish grading student reports first).