## Thursday, May 21, 2009

### Bioclipse beta5: really the last one now

Bioclipse beta 5 was just released by Ola, and the team had some bad days over an problem that happened after a merge of an important branch regarding the managers we are using to allow scripting of Bioclipse.

In the end, Jonathan found a workaround for the problem, even though we still have no clue what was the exact cause. Additionally, Arvid implemented one of the last missing features of the JChemPaint editor, being the ability to draw bonds in any arbitrary direction, and the ability to create a new bond to an already existing atom. This really seems to be the last beta before the 2.0 release candidate. So, head over to SourceForge as it is now time to report this smaller things you like to see improved.

The beta has many really nice features, and we will have much to write about in later blogs. One thing I particularly like, is the support for (really) large SD files; the above screenshot is a 800MB file with StarLite structures, though we also tried files larger than 1GB. There is a 2D-Structure tab, which will zoom in on the structure in a regular JChemPaint editor.

For the Bioclipse scripting, I can just encourage you to browse this blog for example scripts.

There are many extensions currently being developed, around the globe, which will extend the basic Bioclipse workbench towards particular use cases. While surely these will get blogged about in detail later, I do want to briefly mention them. In the works are features for: QSAR, Decision Support, Speclipse (NMR and MS spectrum handling), Resource Description Framework, a StructureDatabase, Metabolomics, Medea (MS spectrum and fragmentation prediction), XMPP, and much more.

Focus of Bioclipse 2.1 will be towards bioinformatics: sequence handling, BLAST, better PDB/CIF support for protein structures, and who knows.

## Monday, May 18, 2009

### Open Data: license, rights, aggregation, clean interfaces?

A recent post by Cameron on his visit last week with Nico, Peter and Jim, discussed Open Data licensing. This lead to an interesting discussion on these matters, and questions by me on why people care so much about only public domain data (or licensed with PDDL or CC0).

Open licensing for data has not as much matured as for software, and international law seems to be more confusing about the issues. I guess that is because data aggregation has been around for way before the computer era. The PDDL and CC0 both try to overcome this fuzziness. But there is another issue we need to keep in mind. A lot of useful Data was aggregated and made Open before these licenses came about, and use, for example, the GNU FDL license, such as the NMRShiftDB.

Rights

Right now, there are two Open Data camps, much like the BSD-vs-GPL wars in Open Source: one that believes in waiving any rights on the Data, indicating that facts are free; others that believe that data must be protected to not be eaten by big companies and lost to the community (e.g. the WolframAlpha arragnements are suspect).

Of course, both camps are not that far apart, and both believe Open is important. Interestingly, there are some noteworthy differences with the Open Source wars. I see parallels between the two, which details an important difference: Open Source has algorithms (uncopyrightable) and implementations (copyrightable); Open Data has Data (uncopyrightable) and aggregation (copyrightable). Open Source talks mostly about the implementation, not the algorithm; it's Open Source, not Open Algorithms after all. In cheminformatics it is even often the case that the algorithms are not even specified and that there only truly is source.

However, Open Data in title does not make distinction. Data is fairly cheap and acquisition can be automated and computerized; Aggregation, on the other hand, requires human involvement: curation and thinking about data models, etc. This is where added value is. Consider an assigned NMR spectrum or the raw data returned from the spectrometer.

It is this added value that people want to protect, not the data itself. I think.

Aggregation

One important argument that tend to show up when people argument for PDDL and CC0 is that it makes data aggregation easier. This is most certainly true: if you can do whatever you like with a blob of data, that also means aggregate with any other blob of data. However, copyleft licenses, like the GNU FDL, require the aggregation to have a compatible license too. It is the license incompatibilities that make this impossible. Or ... ?

Open Source has matured to such a point that it is fairly clear what the intended behaviour is, regarding derivatives. An aggregation of software (typically refered to as a distribution) is only a derivative under certain conditions. This makes it possible to run proprietary software on top of GNU/Linux, which uses the GNU GPL but does not require software to run on top of it to be GPL too. Unless... unless, not a clear well-defined interface has been used, indicating a strong dependency. Now, surely, these things have not been confirmed to match actual law in court, but the intentions are clear.

Clean Data Interfaces?

Now, if we would translate this to Open Data, would there be the equivalent of a clean interface? Can we build a data distribution with data of various licenses? I think we can! I am not a lawyer and please consider this an invitation to discuss these matters...

Let's start simlpe... if I put a GNU FDL image in this blog, by linking to it with a open, free, clean HTML interface (<img src=""/>), would that make my blog GNU FDL too? I don't think so. Surely, I would need to list copyright owner, and actually would be required to put the GNU FDL in my blog too, but hope linking to the license text would suffice too. (Let's skip fair use at this moment, and assume the use goes beyond fair use). Question: am I not using a clean interface, and would this not make the image's license no infect my blog?

A more difficult example, consider rdf.openmolecules.net, which surely aggregated facts, including data from the NMRShiftDB and DBPedia. I am using a unique identifiers here, the NMRShiftDB compound ID, and the DBPedia URL, which surely is GNU FDL, and use this to make a <owl:sameAs> statement. Again, please do not consider fair use, which this certainly is. But, let's say I put in some more DBPedia and NMRShiftDB data in this aggregation. The GNU FDL data on rdf.openmolecules.net would be separate RDF blocks, with proper dc:license, dc:author annotation. But the block would be part of a larger aggregation. The clean interface here is Resource Description Framework.

This second case does not only affect my rdf.openmolecules.net website, but, for example, bio2rdf.org is also in the same situation and aggregated and distribute DBPedia's GNU FDL data (e.g. hexinanose. Does that make the whole of bio2rdf database GNU FDL. They too use RDF as clean interface.

Call for Discussion

Despite what one of the two camps like to see, the mere fact of added value when making data aggregations will keep copyleft license stay around, and instead of trying to convince everyone of the virtues of PDDL- and CC0-like licenses, we should think about to what extend it really matters.

I can do my data analysis with data sources of various licenses. I can search and retrieve data from various sources with various licenses. What obstacles are really there that disallow us to do science? Do the data interfaces we have now not provide enough technical means to address the license incompatibilities? They have in Open Source, why would that not apply to Open Data too?

## Friday, May 15, 2009

### ChemSpider and the RSC: where next?

Last Monday the CHMINF-L brought the news to me that ChemSpider was acquired by the RSC (not the press release). Twitter (my Twitter post) and FriendFeed (see this series).

Reading blogs used to be to get the news, but this has changed. Still, blogging gives more freedom, more space. Blogs did soon follow. Chris was the first to blog about it:
This is great news and I’m confident that it will be a move to even more openess in chemistry and cheminformatics. It will also allow the RSC to use Tony fantastic tools for even more semantic markup of articles. I’m looking forward to talking to everyone about the implications. For now, congratulations, Tony, and congratulations, RSC, for this great deal.
I think Tony himself was next:
This is good for us for a number of reasons. Specifically we will no longer have to deal with our very significant resource limitations but more than that it lends credence and validation to the work that we have been doing over the past 2 years. It seems so long ago now but ChemSpider was first unveiled to the world at the ACS Spring meeting 2007. What began then only as a hobby project is now being recognized by the community as one of the primary resources for internet chemistry.
His network and insight in required data curation is what I think made ChemSpider a success.

Later views followed from Peter, Rich and Neil. I have only congratulations, which I hereby join, and expect that only future will tell us if our cheers are correct.

Where next?
As Tony indicated, the deal will practically mean better support for ChemSpider in terms of computing power, making if easier for them to make upgrades, hence better uptime, etc. It may, indeed, also mean more data, provided from RSC archives, as suggested by Neil. More practically, I can imagine seeing Project Prospect contributing InChI-DOI links to ChemSpider very soon.

And this would be one of the two recommendations I have to ChemSpider at this moment:

1. now linked to a publisher, and with both text mining efforts and expertise, focus on these InChI-DOI links, and, in particular, focus on those InChI-DOI links which involve papers that describe measured properties of the molecules;

2. with the increased support, finish the Open Data work done, by making it easy for people to download the ChemSpider-OpenData subset. This, I believe, is crucial for a wider adoption in the OpenData community, as OpenData which is practically made impossible to easily download is not Open enough. Previous priorities may have been focused on setting up a viable commercial alternative, but with the RSC backing, this can no longer be a reason to not do this.

Once more, congratulations to the ChemSpider-team and the involved RSC people, and very much looking forward to seeing how this will change chemistry for the better!

## Monday, May 11, 2009

### Which feature must I install for org.eclipse.zest?

Dear lazyweb!

I have been trying to figure out which Eclipse 3.4 feature I must install from the update site to get the org.eclipse.zest plugin in my environment.

I installed the Zest feature (which I am going to use to visualize an RDF network), but my workspace still complained that I did not have the plugin.

Maybe I should rerun Set Target Platform for our product, but I and others in the Bioclipse development community have been wondering how we can know what feature to install via the Software Updates... to get a particular plugin on your machine?

Looking forward to hearing from you,

Kind regards,

Egon

### PubChem-CDK

PubChem-CDK is a project that runs CDK code on the PubChem data. As we speak. a groovy script reads about 100 PubChem Compounds XML entries per second into the database. Mind you, not the SDF they distribute which uses a custom extension to overcome the limits of the real MDL SDF format.

Right now, it has run the atom type perception algorithm on about 1M compounds, and has a pretty good coverage of the organic chemistry domain. I will analyze the results statistically soon, but will likely use this data first to add some missing atom types to CDK 1.2.x. BTW, did you know only three carbon atoms failed? A C4- (CID:156031), a C3+ (CID:161072), and a C2+ (CID:161073). Would your cheminformatics library know what their properties are?

It is really nice way of browsing PubChem, BTW. For example, did you know there are several boron compounds which have a substructure [N+]-[B+]-[N+]? Yes, three positive charges, next to each other? For example (CID:3612285):

Well, neither did I. How was it synthesised? What are the spectral properties? How do they stabilise it? What magic counter ion? PubChem, unfortunately, does not have links to primary literature, and there is no free source for that available. A failure in chemistry. The source points to ChemDB, but the entry in that database does not shed light on this either.

Anyway, more on this later. Much more, as I plan to run many CDK algorithms on this code.

## Friday, May 08, 2009

### Nomination of the CDK for a SF Community Award

Just hit the below icon, and use 140 characters why you think the CDK should be nominated. Please select the Best Project for Academia, and we might make a chance:

## Thursday, May 07, 2009

### /me is having Bioclipse/XMPP/RDF fun

Johannes asked me what the Lipinski Rule of Five for farnesol is, in reply to the matching XMPP cloud service. Thanx to DBPedia for providing a machine readable form of the wikipedia entry:

Here's the solution (yes, suboptimal, but since we were hacking on XMPP support in Bioclipse) which shows the structure in JChemPaint and Jmol as bonus (gist:107507):
// Today, Johannes challenged me to use Bioclipse and XMPP to calculate the Lipinski Rule of Five for// http://en.wikipedia.org/wiki/Farnesolquery = "Farnesol" // Zero: clear the consolejs.clear();js.print("Query: " + query + "\n"); // One: connect to the XMPP hive, and make contact with the CDK descriptor service here in Uppsalaxmpp.connect();var service = xmpp.getService("descriptor.ws1.bmc.uu.se");service.discoverSync(5000);service.getFunctions();var func = service.getFunction("LipinskiRuleOfFive"); // Two: take advantage of RDF, DBPediastore = rdf.createStore()rdf.importURL(store, "http://dbpedia.org/data/" + query + ".rdf")rdf.importURL(store, "http://dbpedia.org/data/" + query + "/section1/Chembox_Identifiers.rdf") // Three: run the SPARQL query and extract the SMILES from the List<List<String>>, and remove// the '@en' suffixvar sparql = "PREFIX dbprop: <http://dbpedia.org/property/> SELECT ?o WHERE { ?s dbprop:smiles ?o }"smiles = rdf.sparql(store, sparql).get(0).get(0)smiles = smiles.substring(0, smiles.length()-3) // Four: create a CML documentpropane = cdk.fromSMILES(smiles);js.print("Molecule SMILES: " + smiles + "\n"); // Five: call the functionresult = func.invokeSync(propane.getCML(), 900000);cmlReturned = xmpp.toString(result); // Six: tune the CML so that the Bioclipse CML reader is happycmlReturned = cmlReturned.replace("xsd:int", "xsd:integer") // Seven: extract the Lipinski Rule of Five scorepropertyList = cml.fromString(cmlReturned);value = propertyList.getPropertyElements().get(0).  getScalarElements().get(0).getValue()js.print("Lipinski Rule of Five: " + value + "\n") // Eight: while at it, let's create a 2D and open in JChemPaintservice = xmpp.getService("cdk.ws1.bmc.uu.se");service.discoverSync(5000);service.getFunctions();func = service.getFunction("generate2Dcoordinates");mol = cdk.fromSMILES(smiles)result = func.invokeSync(mol.getCML(), 900000);cmlReturned = xmpp.toString(result);mol2d = cdk.fromCml(cmlReturned);ui.open(mol2d) // Nine: oh, and a 3D model in Jmolfunc = service.getFunction("addExplicitHydrogens");result = func.invokeSync(mol.getCML(), 900000);mol = cdk.fromCml(xmpp.toString(result));func = service.getFunction("generate3Dcoordinates");result = func.invokeSync(mol.getCML(), 900000);mol3d = cdk.fromCml(xmpp.toString(result));file = "/Virtual/foo.cml";ui.remove(file)cdk.saveCML(mol3d, file);ui.open(file)