Feed aggregator

chem-bla-ics : Migrating pKa data from DrugMet to Wikidata

Planet Bioclipse - Sun, 2016-03-27 17:23
In 2010 Samuel Lampa and I started a pet project: collecting pKa data: he was working on RDF extension of MediaWiki and I like consuming RDF data. We started DrugMet. When you read this post, this MediaWiki installation may already be down, which is why I am migrating the data to Wikidata. Why? Because data curation takes effort, I like to play with Wikidata (see this H2020 proposal by Daniel Mietchen et al.), I like Open Data (see ), and it still much needed.

We opted for a page with the minimal amount of information. To maximize the speed at which we could add information. However, when it came to semantics, we tried to be as explicit as possible, and, e.g. use the CHEMINF ontology. So, it collected:
  1. InChIKey (used to show images)
  2. the paper it was collected from (identified by a DOI)
  3. the value, and where possible, the experimental error
A page typically looks something like this:

While not used on all pages, at some point I even started using templates, and I used these two, for molecules and papers:
    {{Molecule      |Name=      |InChIKey=      |DOI=      |Wikidata=    }}
    {{Paper      |DOI=      |Year=      |Wikidata=    }}
These templates, as well as the above screenshot, already contain a spoiler, but more about that later. Using MediaWiki functionality it was now easy to make lists, e.g. for all pKa data (more spoilers):

I find a database like this very important. It does not capture all the information it should be capturing, though, as is clear from the proposal some of use worked on a while back. However, this project got on hold; I don't have time for it anymore, and it is not core to our department enough to spend time on write grant proposals for it.

But I still do not want to get this data get lost. Wikidata is something I have started using, as it is a machine readable CCZero database with an increasing amount of scientific knowledge. More and more people are working on it, and you must absolutely read this paper about this very topic (by a great team you should track, anyway). I am using it myself as source of identifier mappings and more. So, migrating the previously collected data to Wikidata makes perfect sense to me:

  1. if a compound is missing, I can easily create a new one using Bioclipse
  2. if a paper is missing, I can easily create a new one using Magnus Manske's QuickStatements
  3. Wikidata has a pretty decent provenance model
I can annotate data with the data source (paper) it came from and also experimental conditions:


In fact, you'll note that the the book is a separate Wikidata entry in itself. Better even, it's an 'edition' of the book. This is the whole point we make in the above linked H2020 proposal: Wikidata is not a database specific for one domain, it works for any (scholarly) domain, and seamlessly links all those domains.

Now, to keep track of what data I have migrated, I am annotating DrugMet entries with links to Wikidata: everything with a Wikidata Q-code is already migrated. The above pKa table already shows Q-identifiers, but I also created them for all data sources I have used (three of them are two books and one old paper without a DOI):


I have still quite a number of entries to do, but all the protocols are set up now.

On the downstream side, Wikidata is also great because of their SPARQL end point. Something that I did not get worked out some weeks ago, I did manage yesterday (after some encouragement from @arthursmith): list all pKa statements, including literature source if available:

If you run that query on the Wikidata endpoint, you get a table like this:


We here see experimental data from two papers: 10.1021/ja01489a008 and 10.1021/ed050p510. This can all be displayed a lot fancier, like make histograms, tables with 2D drawings of the chemical structures, etc, but I leave that to the reader.

chem-bla-ics : Re: How should we add citations inside software?

Planet Bioclipse - Sun, 2016-03-27 12:01
Practice is that many cite webpages for the software, sometimes even just list the name. I do not understand why scholars do not en masse look up the research papers that are associated with the software. As a reviewer of research papers I often have to advice authors to revise their manuscript accordingly, but I think this is something that should be caught by the journal itself. Fact is, not all reviewers seem to check this.

In some future, if publishers would also take this serious, we will citation metrics for software like we have to research papers and increasingly for data (see also this brief idea). You can support this by assigning DOIs to software releases, e.g. using ZENODO. This list on our research group's webpage shows some of the software releases:


My advice for citation software thus goes a bit beyond what traditionally request for authors:

  1. cite the journal article(s) for the software that you use
  2. cite the specific software release version using ZENODO (or compatible) DOIs

 This tweet gives some advice about citing software, triggering this blog post:
Should you cite software? Most probably yes! #collabw16 https://t.co/rol6KT0vhW pic.twitter.com/Xk4o8UwU51— dimazest (@dimazest) March 23, 2016 Citations inside software
Daniel Katz takes a step further and asked how we should add citations inside software. After all, software reuses knowledge too, stands on algorithmic shoulders, and this can be a lot. This is something I can relate to a lot: if you write a cheminformatics software library, you use a ton of algorithms, all that are written up somewhere. Joerg Wegner did this too in his JOELib, and we adopted this idea for the Chemistry Development Kit.

So, the output looks something like:


(Yes, I spot the missing page information. But rather than missing information, it's more that this was an online only journal, and the renderer cannot handle it well. BTW, here you can find this paper; it was my first first author paper.)

However, at a Java source code level it looks quite different:


The build process is taking advantage of the JavaDoc taglet API and uses a BibTeXML file with the literature details. The taglet renders it to full HTML as we saw above.

Bioclipse does not use this in the source code, but does have the equivalent of a CITATION file: the managers, that extend the Python, JavaScript, and Groovy scripting environments with domain specific functionality (well, read the paper!). You can ask in any of these scripting languages about citation information:

    > doi bridgedb

This will open the webpage of the cited article (which sometimes opens in Bioclipse, sometimes in an external browser, depending on how it is configured).

At a source code level, this looks like:


So, here are my few cents. Software citation is important!

chem-bla-ics : Adding disclosures to Wikidata with Bioclipse

Planet Bioclipse - Sun, 2016-03-20 17:05
Last week the huge, bi-annual ACS meeting took place (#ACSSanDiego), during which commonly new drug (leads) are disclosed. This time too, like this one tweeted by Bethany Halford:

CORRECTION This is @genentech's Btk inhibitor. I got a pyridine N in the wrong spot. MEDI #ACSSanDiego #sigh pic.twitter.com/TscaMzPTlW— Bethany Halford (@beth_halford) March 17, 2016 Because getting this information out in the open is important, I think it's a good idea to add them to Wikidata (see doi:10.3897/rio.1.e7573). So, with Bioclipse (doi:10.1186/1471-2105-8-59) I redrew the structure:


I previously blogged about how to add chemicals to Wikidata, but I realized that I wanted to also use Bioclipse to automate this process a bit. So, I wrote this script to generated the SMILES, InChI, InChIKey, double check the compound is not already in Wikidata (using the Wikidata SPARQL endpoint), an look up the PubChem compound identifier (example SMILES).

smiles = "CCCC"

mol = cdk.fromSMILES(smiles)
ui.open(mol)

inchiObj = inchi.generate(mol)
inchiShort = inchiObj.value.substring(6)
key = inchiObj.key // key = "GDGXJFJBRMKYDL-FYWRMAATSA-N"

sparql = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?compound WHERE {
  ?compound wdt:P235 "$key" .
}
"""

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "https://query.wikidata.org/sparql", sparql
  )
  missing = results.rowCount == 0
} else {
  missing = true
}

formula = cdk.molecularFormula(mol)

// Create the Wikidata QuickStatement,
// see https://tools.wmflabs.org/wikidata-todo/quick_statements.php

item = "LAST" // set to Qxxxx if you need to append info,
              // e.g. item = "Q22579236"

pubchemLine = ""
if (bioclipse.isOnline()) {
  pcResults = pubchem.search(key)
  if (pcResults.size == 1) {
    cid = pcResults[0]
    pubchemLine = "$item\tP662\t\"$cid\""
  }
}

if (!missing) {
  println "===================="
  println "Already in Wikidata as " + results.get(1,"compound")
  println "===================="
} else {
  statement = """
    CREATE
    
    $item\tDen\t\"chemical compound\"
    $item\tP233\t\"$smiles\"
    $item\tP274\t\"$formula\"
    $item\tP234\t\"$inchiShort\"
    $item\tP235\t\"$key\"
    $pubchemLine
  """

  println "===================="
  println statement
  println "===================="
}

The output of this script is a QuickStatement for Magnus Manske's tool (IMPORTANT: it's not meant to automate editing Wikidata! I only automate creating the input, which I carefully check (e.g. checking all stereochemistry is defined)! Note, how Bioclipse opens up the structure in a viewer with ui.open()), which is a list of commands to create and edit entries in Wikidata. You need to enable it first, but if you have an account, this is not too hard. Of course, the advantage is that it is a lot quicker. I have similar script to create QuickStatements starting with only a ChEMBL identifier.

The QuickStatement for GDC-0853 looks like:

    CREATE
    
    LAST Den "chemical compound"
    LAST P233 "O=C1C(=CC(=CN1C)c2ccnc(c2CO)N4C(=O)c3cc5c(n3CC4)CC(C)(C)C5)Nc6ncc(cc6)N7CCN(C[C@@H]7C)C8COC8"
    LAST P274 "C37H44N8O4"
    LAST P234 "1S/C37H44N8O4/c1-23-18-42(27-21-49-22-27)9-10-43(23)26-5-6-33(39-17-26)40-30-13-25(19-41(4)35(30)47)28-7-8-38-34(29(28)20-46)45-12-11-44-31(36(45)48)14-24-15-37(2,3)16-32(24)44/h5-8,13-14,17,19,23,27,46H,9-12,15-16,18,20-22H2,1-4H3,(H,39,40)/t23-/m0/s1"
    LAST P235 "WNEODWDFDXWOLU-QHCPKHFHSA-N"
    LAST P662 "86567195"


The first line creates a new Wikidata item, while the next ones add information about this compound. GDC-0853 is now also Q23304817. The label I added manually afterwards. Note how the Bioclipse script found the PubChem identifier, using the InChIKey. I also use this approach to add compounds to Wikidata that we have in WikiPathways.

chem-bla-ics : Adding chemical compounds to Wikidata

Planet Bioclipse - Sat, 2016-02-27 14:03
Adding chemical compounds to Wikidata is not difficult. You can store the chemical formula (P274), (canonical) SMILES (P233), InChIKey (P235) (and InChI (P234), of course), as well various database identifiers (see what I wrote about that here). It also allows storing of the provenance, and has predicates for that too.

So, to enter a new structure for a compound, you should enter the compound information to Wikidata. Of course, make sure to create the needed accounts, particularly one for Wikidata (create account) (not sure if the next steps needs a more general Wikimedia account too).

Entering the research paper
Magnus Manske pointed me to this helper tool. If you have the DOI of the paper, it is easy to add a new paper. This is what the tool shows for doi:10.1128/AAC.01148-08 (but no longer when you try!):


You need permission to run this script and the tool will alert you about that, and give the instructions how to get permission. After I clicked the Open in QuickStatements I get this output, showing me an entry in Wikidata was created for this paper:


Later, I can use the new Q-code (Q22309806) to use as source for statements about the compound (formula, etc).

Draw your compound and get an InChIKey
The next step is to draw a compound and get an InChIKey. This can be done with many tools, including Bioclipse. Rajarshi opted for alternatives:

@collabchem @egonwillighagen OSRA or https://t.co/ZIQdgrYsmr?— Rajarshi Guha (@rguha) January 27, 2016 Then check if the compound is not already in Wikidata. You can use this SPARQL query for that using the InChIKey of the compound (it's for acetic acid, so it will be found):


For convenience, here the copy/pastable SPARQL:
PREFIX wdt:
SELECT ?compound WHERE {
?compound wdt:P235 "QTBSBXVTEAMEQO-UHFFFAOYSA-N" .
}
Entering the compound
So, the compound is not already in Wikidata, so time to add it. The minimal information you should provide is the following:
  • mark the new entry as 'instance of' (P) 'chemical compound (Q)
  • the chemical formula and SMILES (use as reference the paper)
    • add the reference to the paper you entered above
  • add the InChIKey and/or InChI
The first step is to create a new Wikidat entry. The Create new item menu in the left side panel can be used, showing a page like this:


As a label you can use the name used in the paper for the compound, even if a code, and as description 'chemical compound' will do for now; it can be changed later.
    Feel free to add as much information about the compound as you can find. There are some chemically rich entries in Wikidata, such as that for acetic acid (Q47512).