Planet Bioclipse
A blog dedicated to Bioclipse - a workbench for life science
BioclipseBlog
2010-07-09T13:31:12Z
A blog dedicated to Bioclipse - a workbench for life science
BioclipseBlog
2010-07-09T13:31:12Z
A blog dedicated to Bioclipse - a workbench for life science
BioclipseBlog
2010-07-09T13:31:12Z
This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields.
The big difference between chemblaics and areas as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, making experimental results reproducible and validatable. And this is a big difference!
chem-bla-ics
2010-07-31T04:15:02Z
A blog dedicated to Bioclipse - a workbench for life science
BioclipseBlog
2010-07-09T13:31:12Z
Updated: 32 min 10 sec ago
How to use GitHub for [CDK|Bioclipse] code review
Triggered by posts in the past three days, I though about writing up a short tutorial on how to perform code review for existing code on GitHub. Therefore, this applied to CDK and Bioclipse source code, many but will work for any project hosted in GitHub. Even if it is not, you could consider putting up a copy there yourself. This example will demonstrate the procedure on CDK functionality in Bioclipse in the bioclipse.cheminformatics repository.Click on the images to get a higher resolution version.Step 1: find the class you want to reviewUse the GitHub web interface to browse your way towards the source code of the class you want to review. For example, the SmartsMatchingHelper.java:Step 2: identify something you like to comment onNext step is to perform some code reviewing. For example, we might want to ask something about how parseProperty() works: Now, this page on GitHub does not provide the means to leave comments; instead, you comment on commits.Step 3: find the last commit that touched the line you like to comment onGit has a blame option (also called annotate) which will show you for each line who last changed that line. The GitHub web page makes this functionality available with the 'blame' link just above the first line of the source code: This link will lead us to a page with a new column on the left side showing commit hashes, name of the commit author, and the first few characters of the commit message. For example, the web page bits relevant to code we want to comment on, looks like: This shows us that commit 3ce78ba5 is the one we are interested in:Step 4: Look up the line again and add a commentIn the web page with the appropriate commit looked in the previous step, you scroll down to the line you want to comment on. If you hover over that line, a blue comment bubble will show up on the left side: Clicking that blue comment icon, you get a dialog where you can enter your comment: The 'Add Line Note' button confirms and saves your comment:Step 5: inform the commiter about your reviewThe next step would be to inform the commit author. GitHub actually helps here, and should send a message, like this one: But it would certainly not hurt of you filed a bug report or sent an email.Now, I should only convert this into a screencast...
2010-05-10T10:29:00Z
Egon Willighagen
noreply@blogger.com
Bioclipse 2.4 released
The Bioclipse team is proud to announce the release of Bioclipse version 2.4. The release contains various new features and bug fixes in cheminformatics and drug discovery, including improved QSAR functionality, site-of-metabolism prediction, semantic web functionality, browsing of large compound collections, editing of chemical structures, and numerous bug fixes.Bioclipse 2.4 is available for 32 and 64 bit versions of Mac OS X, Linux, and 32 bit version of Windows (Bioclipse for 64 bit Windows is currently unavailable, but will be provided as soon as a native Standard InChI is available for 64 bit Windows).Download BioclipseGetting started guidePlanet Bioclipse integrates blogs related to Bioclipse.
Bioclipse 2.4.0.RC3 is here
A new release candidate is out and can be found at the usual site for release candidates:http://pele.farmbio.uu.se/bioclipse-devel/I have a good feeling about this one. I think Bioclipse 2.4 is really close now.And as usual, if you download and try the release candidate of course you already know that we love to get bug reports in our Bugzilla. :)
Final version of degree project report
The last administrative details of my thesis project are now finished, and the report is now available in final form, for download as PDF in on this page (no 14 in the list), or this direct link. (The title of the project was "SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking").
2010-07-05T20:01:21Z
Samuel Lampa
A small but wonderful add-on
Look at the following scenarios:
(a) > var camk = chembl.MossGetProtFamilyCompAct("camk", "IC50") > chembl.MoSSViewHistogram(camk)
> var camkBounds = chembl.MossSetActivityBound(camk, 1,1000000) > camkBounds.getRowCount() 2565 >chembl.MossSaveFormat("/ChEMBL-MoSS/Rapport/CAMKIC501", camk)
(b) > var camk=chembl.MossGetProtFamilyCompActBounds("CAMK","IC50",1, 1000000) > camk.getRowCount() 2565 > chembl.MossSaveFormat("/ChEMBL-MoSS/Rapport/CAMKIC502", camk)
(a)+(b) Scripts taken from the context of retrieving molecules for molecular substructure mining. (a) Collects compounds that bind to proteins from the family CAMK with the activity IC50. The activities for the compounds are looked at in a histogram and the bound is later set to involve molecules within activities between 1-1000,000. Lastly saved out to a file that supports MoSS input file.
(b)Lets say you been working with this set a couple of times and know exactly your parameters then the script in (b) would reduce unnecessary steps in retrieving molecules by simply adding the upper and lower value to the query directly. At last saving into an input file of MoSS.
Small step but wonderful when you run scripts all day!
The ChEMLB-MoSS interaction in Bioclipse
There are two ways of accessing the chEMBL- MoSS feature in Bioclipse, javascript and by wizard. I will present both ways here!In both situation I work with an example of accessing molecules for the Kinase protein family Tyrosin Kinase also known as TK. I want to look at the compounds that bind to any protein in this family with the activity Ki. Also, to specify in what activity span my molecules should be in.Starting of with the wizard, this is what it looks like when it is first open.Only one box is accessible and that is the one for protein families. When a family is selected a SPARQL query run towards the endpoint and returns the available activities for that family. By simply selecting a preferred activity an other SPARQL query will update the table with compounds (with a limitation of 50, the button add all(which is done in the picture) will of course add them all=).Now I would like to only collect the active compounds hence I first look at the graph displaying the activities.When I know in what activity span I would like to work with I update the table with help from the lower and upper boxes and simply press update table. When I now press finish a file that supports MoSS will be produced.JavascriptPerforming almost the same task now provides the following javascript.> var tkki = chembl.MossGetProtFamilyCompAct("tk","ki",50)> tkki.getRowCount()50 Here I collect 50 compounds from the TK family with the activity of KI.> var tkki = chembl.MossGetProtFamilyCompAct("tk","ki")> tkki.getRowCount()976Here I perform the same thing as above without a limit leaving to returning 976 compounds, the same number that was returned when "add all" was pushed in the wizard.> var tkkiActBound = chembl.MossSetActivityBound(tkki, 1,15000)> tkkiActBound.getRowCount()850> tkkiActBound[["actval","smiles"],["160","Cc1nc(N)sc1c2ccnc(Nc3cccc(c3)[N+](=O)[O-])n2"],["700","CCOc1nc(cc(N)c1Cl)C(=O)NCc2ccc(cc2)S(=O)(=O)C"],["10000","CS(=O)(=O)Nc1cc2OCCCCCOc3nc(NC(=O)Nc2cc1Cl)cnc3C#N"],["10000","OCCCOc1cc2OCCCCCOc3nc(NC(=O)Nc2cc1Cl)cnc3C#N"],["10000","OCCCc1cc2OCCCCCOc3nc(NC(=O)Nc2cc1Cl)cnc3C#N"],["19.4","Cc1cc(cc2nnc(Nc3ccc(OCCN4CCCC4)cc3)nc12)c5c(Cl)cccc5Cl"],["950","COc1cc2ncc(C#N)c(N[C@@H]3C[C@H]3c4ccccc4)c2cc1OC"],["10000","N#Cc1cnc2ccc(cc2c1N[C@@H]3C[C@H]3c4ccccc4)c5ccc(CN6CCOCC6)cc5"],["10000","N#Cc1cnc2ccc(cc2c1N[C@@H]3C[C@H]3c4ccccc4)c5cccc(CN6CCOCC6)c5"],…With the specification of an activity span between 1 and 15000 nm the number of compounds are reduced to 850(as in the wizard). If I write the name of the variable a string matrix will display all the information. But in order to work with MoSS it has to be saved in a certain way. That's why we save the matrix to a file just as we did when we pressed finish in the wizard.> chembl.saveMossFormat("/chembl/Script/tkki",tkkiActBound)Taken from the produced file(s)(they are exactly the same).1,0,Cc1nc(N)sc1c2ccnc(Nc3cccc(c3)[N+](=O)[O-])n22,0,CCOc1nc(cc(N)c1Cl)C(=O)NCc2ccc(cc2)S(=O)(=O)C3,0,CS(=O)(=O)Nc1cc2OCCCCCOc3nc(NC(=O)Nc2cc1Cl)cnc3C#N4,0,OCCCOc1cc2OCCCCCOc3nc(NC(=O)Nc2cc1Cl)cnc3C#N………848,0,Clc1cc2NC(=O)Nc3cnc(C#N)c(OCCCCOc2cc1NCc4cncs4)n3849,0,OC[C@@H](NC(=O)c1cc(c[nH]1)c2[nH]ncc2c3cccc(Cl)c3)c4ccc(F)c(Cl)c4850,0,FC(F)(F)c1cccc(c1)c2nnc3ccc(NC4CCNCC4)nn23With this shown I will soon let you know what MoSS can do with the saved data!
Bioclipse 2.4.0.RC1 is out
A new release candidate is out and can be found at the usual site for release candidates:http://pele.farmbio.uu.se/bioclipse-devel/What is new?Among the main news are:New molecules table.We are now using Java 1.6Eclipse 3.5.2CDK 1.3.5 And of course bug fixes and probably a lot of other stuff which I don't know about. If you download and try the release candidate of course you already know that we love to get bug reports in our Bugzilla.
A moss-chembl application
After a month of traveling I'm now back to devote my time to what's left of my project which would be about 8-9 weeks. My work is progressing and much of my time I'm working with human-computer-interaction but also advancing the SPARQL queries and test for accuracy.MoSS as I probably mentioned a couple of times before is a molecular substructure mining software produced by Christian Borgelt, http://www.borgelt.net/moss.html. I implemented that application for Bioclipse in 2008, http://wiki.bioclipse.net/index.php?title=MoSS_in_Bioclipse, and I'm now making use of my own application.As my chEMBL work is coming along I'm at the moment working on a specific working flow, "from chEMBL to MoSS". With the functionality of SPARQL I am now via java methods accessing compounds from various Kinase protein familes. A method could look like something like thispublic IStringMatrix MossProtFamilyCompounds(String fam, String actType)throws BioclipseException{String sparql ="PREFIX chembl: " +"PREFIX bo: "+ "SELECT DISTINCT ?smiles where{ " + " ?target a chembl:Target;" +" chembl:classL5 ?fam. " +" ?assay chembl:hasTarget ?target . " +" ?activity chembl:onAssay ?assay ;" +" chembl:type ?actType ; " +" chembl:forMolecule ?mol ."+" ?mol bo:smiles ?smiles. " +" FILTER regex(?fam, " + "\"^" + fam + "$\"" + ", \"i\")."+" FILTER regex(?actType, " + "\"^" + actType + "$\"" + ", \"i\")."+" }";IStringMatrix matrix = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql",sparql); return matrix;}Inside this java method there is a SPARQL query which is a string named sparql. It is possible to run a query like this due to the rdf project done by Egon. I use that feature when I call rdf.sparqlRemote, what that command basically do is accessing the SPARQL endpoint(URL) with my query which is made into a String. So for this to work an internet connection must exist.I will try to find something that can check if such a connection exist or not to improve the use of the application(no connection -> no search).The compounds are saved into a file supported by MoSS. This makes it possible for MoSS to run on the compounds drawn from the chEMBL database. Also a java script environment is available.The pictures shows(top) the moss-chembl wizard and (bottom) the moss wizard.The moss-chembl applications is dynamic which means that you can search for wanted compounds and look at them directly. This ease the work a lot! Also to be mentioned is that the compounds are at the moment only compounds that bind to a protein in a Kinase Family.When a preferred data set is chosen moss will read in the data and now you are able to perform a substructure mining on them!Next problem to manage Visualization...
Prolog query much faster when mimicking SPARQL
I reported earlier that Jena/SPARQL outperformed Prolog for a lookup query with some numerical value comparison. It later on turned out that the results were flawed and finally that Prolog indeed was the fastest as soon as turning to datasets with more than a few hundred peaks. The Prolog program I was using was rather complicated with recursive operations on double lists etc. Then, some week ago, I tried, in order to highlight differences in expressivity between Prolog and SPARQL, to implement a Prolog query that mimicked the structure of the SPARQL query I used, as close as possible. Interestingly it turned out that this Prolog query can be optimized to become blazing fast by reversing the order of shift values to search for, so that the largest values are searched for first. With this optimization the query outperforms both the SPARQL and the earlier used prolog code. See figure 1 below for results (The new prolog query is named "SWI-Prolog Minimal"). It appears that the querying time does not even increase with the number of triples in the RDF store!Figure 1: Spectrum similarity search comparison: SWI-Prolog vs. JenaThe explanation seems to stem from the fact that larger NMR Shift values are in general more unique than smaller values (see histogram of shift values in the full data set in figure 2 below). Thus, by testing for the largest value first, the query will be much less prone to get stuck in false leads. (Well, looking at the histogram, it appears that one could in fact do even better sorting than just from larger to smaller, like testing for values around 100 before values around 130 etc.)Figure 2: Histogram of NMR Shift values in 25000 spectrum dataset(Find the new Prolog code below. The SPARQL query, and earlier Prolog code, can be found as attachments to this blog post.)read more
2010-05-01T20:20:32Z
Samuel Lampa
Correction of flawed results: Close competition between Jena and Prolog
UPDATE 29/3: See new results hereI reported in a previous blog post (with a bit of surprise) that Jena clearly outperformed SWI-Prolog for a NMR Spectrum similarity search run inside Bioclipse. I have now realized that indeed these previous results were flawed for a number of reasons.read more
2010-03-24T03:50:12Z
Samuel Lampa
Querying multiple SPARQL endpoints from single query, with Jena SERVICE extension
Egon pointed to an interesting blog post about a feature that is available as a an extension to Jena, the semantic web framework available in Bioclipse. It allows to very easily query multiple SPARQL endpoints from a single SPARQL query (using the SERVICE keyword), and use variable bound from one endpoint when querying the next.
This is very useful in general. I was also thinking of the specific scenario (along the lines we have partly already been thinking) to use multiple Semantic MediaWikis as community maintained databanks, for querying back into Bioclipse. Being able to use multiple MediaWiki installs is very useful because it is hard to incorporate a very efficient access restriction system in MediaWiki (due to the nature of how it works, with template calls and all), so then it is better to be able to have separate wikis for content which needs special restrictions.
read more
2010-03-16T16:28:27Z
Samuel Lampa
Screencast: Experimental Prolog integration in Bioclipse
I wanted to test out some screen casting, so I chose to demo the (still experimental) SWI-Prolog integration into Bioclipse, showing how Prolog code (or a "Prolog knowledge base") can conveniently be stored inside Bioclipse's JavaScript environment (in a JS variable), loaded into the prolog engine, and then queried, all from the JS environment, and finally the results can be returned as well to the Javascript environment for further processing or output. Note that this is still at the experimental stage, so things are a bit rough around the edges! read more
2010-03-03T19:51:13Z
Samuel Lampa
Chemical ring and cage structures need rules (seemingly)
During my short stay at EBI, Egon had kindly arranged an opportunity to talk to Janna from the Steinbeck group, about some work she has done on searching for cage structures in molecules, using Prolog, so we met over a coffee, together with Nico who's now at the Steinbeck group (visiting).
Bot Nico and Janna kindly gave a lot of good advice about research in general (as I'm currently looking into possibly doing PhD somewhere), which I highly appreciated. :)
And we talked some about the original topic too :), that is the cage structure problem. The cage structure problem is kind of an extension to the problem of expressing rings, which has previously been reported as a problem for OWL-DL. So because of this, it is interesting that Janna came up with a working solution, using Prolog.
As a highlighting example from the DL-side, Michel Dumontier has done some work on representing molecules, including rings. But they also had to use rules, not plain OWL.
So that seem to be the general conclusion: In order to express ring structures (or extensions of it, such as cage structures), you'll need to use rules in some way.
Unfortunatly my project is now running out of time, so I might not have much time to look more into this topic as part of my project :(. Will see if I can include this as a part of another course I still have to finish ("knowledge based systems in bioinformatics"), but that remains to see.
read more
2010-03-01T22:31:23Z
Samuel Lampa
I just found an awesome way to merge git patches with opendiff
Have you ever been in the situation that someone sent you a patch and you want to take some things from it but probably not all and you definitely want to check each change before accepting that? If you have worked with git, off course you have. There are many ways of doing this. However maybe you have a favorite merge tool? For me this was open-diff which can show my old version to the left, the new version which I got from someone to the right and the current merge of them at the bottom. I can also edit the bottom one and write whatever I want there. In other words an awesome tool for doing manual merging and adopting of some but not all changes which have been sent to you. It took me a while (and some help from #git on freenode) to figure out how to use open diff for this but in the end it was well worth it. This is how you can do it:1. Create a new branch and apply the patch. Copy the #hash it get.2. Checkout the branch you want to apply the merged commit to.3. Write: git diff-tool -t opendiff #hash This will open the opendiff tool for your original compared to the new version and when saving you can overwrite you original files and commit as usual.
The things you can do with a wizard . . .
Now I have started to get a feeling for SPARQL but do you have one?Well I do not want to force anyone to learn new languages all the time therefor I began to develop a wizard. This wizard is far from done but it do mange some functions at the moment which is really cool. As you write an id or keyword SPARQL queries against http://rdf.farmbio.uu.se/chembl/snorql/ is on the go returning the values to the wizard. If you change your search the old data will be deleted and the new one displayed.A search may now be done with keywords, SMILES or chebi id to find information about compounds. This search will expand as I implement biological networking to other knowledge bases(http://chebi.bio2rdf.org/sparql as an example).If the checkbox for target is check a search with proteins id's, keywords, ec-number etc will take place instead.As you write the table will fill up with various data depending on what you search on.The upper picture searches for targets that have some connection to sodium channels. The bottom picture search for a chebi id from a SMILES. Unfortunately I don't know yet how to distinguish between strings written in the box so the line have to end with a # at the moment. Working on solving that...
Interesting SPARQL queries for QSAR and PCM data!
The following two SPARQL queries are really interesting for QSAR projects and proteochemometric(PCM) project. By accessing chEMBL data via RDF with SPARQL I can easily retrieve necessary data to build up these kind of projects.For a QSAR project following query could be used:var forQSAR = "\PREFIX chembl: \PREFIX blueobelisk: \SELECT DISTINCT ?act ?ass ?conf ?mol ?SMILES ?val ?unit WHERE { \ ?act chembl:type \"IC50\" ; \ chembl:onAssay ?ass; \ chembl:forMolecule ?mol;\ chembl:standardValue ?val;\ chembl:standardUnits ?unit.\ ?mol blueobelisk:smiles ?SMILES. \ ?ass chembl:hasTarget ; \ chembl:hasConfScore ?conf. \}";Since I run my queries through Bioclipse the SPARQL query is given a name to ease up the following run ¨var qsar = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", forQSAR)chembl.saveCsv("/QSAR/q",qsar)The query will return unique id's for activity(?act), molecules(?mol) and assays(?ass), SMILES(?SMILES) for the molecules, values(?val) and units(?unit) for the activities and confidence values(?conf). And it is really easy expand the query to return more data!The query for PCM returns unique id's for targets(?target), molecules(?mol) and pubmeds(?pubmed), SMILES(?SMILES), protein sequences(?seq), varoius classifications(?l4, ?l5, ?l6), activities(?type) ans activity values(?val).The activities are narrowed down to only include IC50 and Ki and the ion channels should only be Na(the last two lines in the query).The query looks like the following:var kic50na ="\PREFIX chembl: \PREFIX blueobelisk: \SELECT DISTINCT ?type ?target ?pubmed ?l4 ?l5 ?l6 ?mol ?SMILES ?val ?seq \WHERE {\ ?act chembl:type ?type;\ chembl:onAssay ?ass;\ chembl:forMolecule ?mol;\ chembl:standardValue ?val.\ ?ass chembl:hasTarget ?target;\ chembl:extractedFrom ?journal.\?ass chembl:hasTargetCount 1 .\?journal ?pubmed.\ ?mol blueobelisk:smiles ?SMILES.\ ?target a ;\ chembl:classL3 \"VGC\" ;\ chembl:classL4 ?l4 ;\ chembl:classL5 ?l5 ;\ chembl:classL6 ?l6 ;\ chembl:sequence ?seq.\FILTER regex(?l6, \"NA\")\FILTER (?type = \"Ki\" || ?type = \"IC50\")\}";One problem that was encountered here was that the assays are not always specified for one target but for many which lead to the return of the same information for different targets. This was solved by Egon who created ?ass chembl:hasTargetCount 1 to solve this problem. That line says that the assays should only contain one target to accurate data for PCM.
Review of Towards pharmacogenomics knowledge discovery with the semantic web
Me and Jonathan Alvarsson made a review on the article Towards pharmacogenomics knowledge discovery with the semantic web.Have a look!
Background presentation
I held this presentation for the department last week. It's basically a presentation about the background and progress of the project. Enjoy!Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFView more presentations from annzi.
Bioclipse is finalist for the Eclipse Community Awards 2010
Bioclipse is one of three Finalists for the Eclipse Community Awards 2010 in the category Best RCP Application. I am looking forward to going to EclipseCon, it will be my first visit. I have also submitted a poster, and hope it will be among the selected ones (see below or the submission abstract). UPDATED: My poster is now accepted!Below is the screencast which I constructed for the Community Awards submission:
Update post
I have so many half-finished sub-project that I don't have anything interesting to blog about hence my update post!My sub-projects:Looking into other syntax languages, especially Manchester OWL syntax. In Journal Club we read the article Towards pharmacogenomics knowledge discovery with the semantic web and encountered the Manchester syntax language. I will blog about it when I'm done. And speaking of Journal Club I'm also writing a review together with Jonathan. And have to find time to read the next article.... Moss Manager needs to be rearranged since the net.bioclipse.rdf plug-in no longer returns lists of arraylist. It now amazingly returns String Matrices which will make things so much easier especially when I'm only interested in the SMILES part of the SPARQL outcome. I'm also working on a presentation that I'm going to present on Thursday 4/3. I will try to put it up here afterwards. It's about the background and status of this project. (Spend hours on creating a gantt chart in excel..well I'm not friends with excel anymore..)And last I'm trying to structure up a new bioclipse plug-in for drug/compound, target and other valuable info retrieval i.e. query ChEMBL in a effective and powerful way with SPARQL. To learn from this post, use TODO lists =0)




