Friday, December 30, 2016

"10 everyday things on the web the EU Commission wants to make illegal" #03

The third example in this series is not too hard to explain.

03. Posting a blog post to social media

Because many of you are familiar with blogging and many of you blog yourself, you know what this one is about. The way I understood it, it will be legal, and you just have to figure out if and how much I would need to pay Kerstin to share this wonderful story about clinical trials in Wiki{pedia|data} on Google+:

As with all original examples, Julia's post provides a lot of legal detail, which I reshare here for this item, because you may initially think this is just about news from newspapers, but here too, wording matters:

And while I have argued a long time ago, that there are many kinds of blogging (it's just a medium, like paper), many can certainly be considered of journalistic nature. In fact, some even use their blog for getting press tickets for scientific conferences (but that's another story ;).

Well, if you are still reading this series, maybe it is time to head over to the website.

"10 everyday things on the web the EU Commission wants to make illegal" #01

OK, after moving to the second example, I realized the subtle difference with the first: I got example 01 and 02 mixed up, and while the previous post was really discussing Julia's second example. Example 01 is really about snippets of publications, like quotes. Now, before you argue that quoting is legal, realize that depends on specifics in various jurisdictions, and, as Julia writes:

"[..] in many EU countries, sharing an extract without further commenting on its substance is not covered by that exception".

So, I hope this post provides enough commenting and substance. But that clearly does not apply for modern way of dissemination of science via Twitter.

01. Sharing what happened 20 years ago

Anyway, now I got a kickstart for the first example too: both tweets were actually about news of close to twenty years ago: both publications are of about 20 years ago! So, take the first tweet with the title of the Nature News article, but now with a quote.

This will be illegal for commercial entities, and possible me too: there is no significant commenting. It practically means that covering the news of the past will be practically illegal or very hard at least, or at least to some, where some is ill-defined, because of the proposal is very unclear about who can and who cannot.

Oh, and if you're not already freaked out: it's retroactive. That is, happy cleaning up the past 20 years of dissemination you did and figure out where this example applied. Nice excuse to not do research!

Wednesday, December 28, 2016

"10 everyday things on the web the EU Commission wants to make illegal" #02

OK, let me first say that I hope Julia Reda does not consider herself a publisher. If you did not get the above joke, then continue reading.

This post is an attempt to translate the proposed copyright laws to, well, research. The choice of words is critical here. I deliberately did not say: academic, university, scholarly. I hope my hesitation will become more clear after reading this post too. I am not a lawyer (IANAL), but it is also important to realize that the exact meaning of law often only becomes clear when tried in court, where judges will create de facto examples of what really is allowed and not, following the intention of the law. I am a researcher and a teacher. I implement this by being a strong proponent of Open Science.

This post is about proposed clean up of the European copyright situation, or so it was meant to be. The practice shows differently, unfortunately. The problem is what I see happen around me (and wrote about it and speak about it). I see a huge gap with how the previous generation of scholars think about research dissemination and copyright, and how the modern society sees this. And having read more about it than I should as a scientist, I cannot undo seeing all the contradictions there are in there.

This post will, therefore, take the 10 example activities that will soon be illegal, if we vote badly in upcoming elections and don't follow Julia's knowledge, as a starting point to highlight some of the problems I expect that will happen, based on observations in doing research in the European Community.

Before I start off, one more disclaimer. The proposal is not hard to read, but like other legal works, uses a specific language. Words that have a common or scientific meaning may have a different meaning in law. So, when this post talks about a "hyperlink", I may get the legal meaning wrong. I strongly rely here on more legally knowledgeable people, like Julia. But as she also indicated in her #33C3 talk, legal definitions can be tweaked by newer laws. Two terms that are critical here which are not well-defined (IMHO), but central in the proposal are: commercial (see e.g. Breaking News: CC-NC only for personal use!) and publisher. But that's part of the problem with this proposal.

Finally, what is critical, we must not let ourselves be deluded: law only exists as a formal way to agree on things. Increasingly, very sadly, it is being used to force people into criminality.

02.a Tweeting a creative news headline

I will actually split this up into two examples, one which will be illegal, the other also, but depending on how far the term "publisher" extends. That is, are press outlets the only intended copyright holders here, or also scientific journal publishers.

This tweet reposts a news item from Nature News of about 18 years ago. This will be illegal for commercial websites. So, how does that affect me as scholar? If I do this on my personal behalf (my social accounts are not Maastricht University accounts), it probably still affects me. As Julia points out, first of all, Twitter is commercial, and they may or may not pay Springer+Nature for being allowed tweet this....

WTF? Ho, ho, ho... you're not saying that tweeting the title of an article is illegal???

Actually, yes, that's exactly what this proposal is saying. So, let me continue. If Twitter does not pay Springer+Nature for the right to tweet this, I may have to. May, because it depends on a court to formally decide if I am commercial or not, if I ever get challenged.

It's weird, isn't it? I'm making free advertisement here, and I may need to pay money to do have to right to advertise that.

However, and this is also critical, commercial entities need to apply. Some argue that some universities are commercial, what about SMEs? What about H2020 projects, where often a significant part of the project are SMEs? Are they commercial? Can a project like eNanoMapper still make such tweets, or would that be illegal? Who knows, but even if probably not, will they take the risk? Can they afford to? How much will it cost to make a decision? They will likely not bother and just not do it, inhibiting scholarly dissemination.

02.b Tweeting a creative news headline

Well, OK, I cannot copy/paste the title of the article and still do the advertisement. But I stress that this practice is very common among scholars; it's one of the foundations of #altmetrics.

Now, the above example used Nature News, but what about Nature itself? Or Cell by Elsevier? This is where my legal knowledge fails. At this moment I am not sure if scholarly journals are the rights owners in mind in this proposal, but I currently doubt that the owners and legal departments of the big scientific publishers will say so differently.

So, will the next tweet still be legal?

I honestly do not know, but my current guess is this will be illegal for commercial entities.

OK, to not make this post too long, I will save the next example for a next post. To be continued!

Friday, December 23, 2016

Facts, Data and Open Data

Source, CCZero.
I was recently asked my experiences around data sharing, and in particularly the legal aspects of it. Because whether we like it or not (I think "we" generally do not like it and I see many scholars ignore it), society has an impact on scholarly research. Particularly, copyright and intellectual property (IP) laws make research increasingly expensive. I wrote up the following aspects related to that discussion. I am not a lawyer, and these laws are different in each country (think about facts, governmental output, etc). Your mileage may vary.

#1 Don't give away your copyright to any single other party

Scholars are common to this. For a very long time we would freely give our research IP to publishers. By selling that IP, publishers would fund the knowledge dissemination (often with huge profits). But institutes start thinking about this, and are backtracking on it. Bottom line: do not give away your copyright.

The importance of this is that you will loose all control over the data. You will no longer be able to give your data to others, because it is no longer yours. Also, you can never repurpose the data anymore, because it is no longer yours. Instead, give other rights to work with the data, by removing copyright or by giving people a suitable license (see the next point).

#2 The three pillars of Open: the rights to (re)use, modify, and redistribute

Really, these three points are critical: it gives anyone the rights to work with the data.

(Re)use is clear.

The right to modify is critical because it is needed for changing the format in which the data is shared (e.g. create ISATab-Nano) but also for data curation!

Redistribute is the right that anyone needs to make your data available to others. In fact, all those EULAs (end-user license agreements) that all of us sign when creating an online account give Google, Facebook, etc, etc the right to reshare (some of) the data you share with them. Clearly, without this right, ECHA, eNanoMapper, CEINT, etc cannot reshare the data with others.

#3 Copyright

Copyright law around data is very complex. For example, there are huge differences between law in European countries and in the USA. The latter, for example, have the concept of "public domain" that many European countries do not have (though we still happily use that term here too). In Europe, databases have database rights. Facts are excluded, but I have yet to find a clear statement of what a "fact" is. But a collection of facts is the outcome of a creative process (like any EC FP7 or H2020 project) and hence has copyright.

For starting projects, the consortium agreement (CA) defines how this is dealt. And like you can give the copyright of a research paper to a publisher, a CA can define that all partners of a project have shared IP. That ensures they can all use it, but it also means it becomes really hard to share it outside the consortium. Instead, my recommendation is to keep the IP with the data creator, and make it available within the consortium with a license. Or just waive the copyright. Copyright with one legal department can already be complicated, and if you have multiple legal departments discussing IP, it certainly does not become easier.

Of course, consensus among all partners is best. I also stress that laws are just tools. Any partner can give others more rights without problems. They cannot hide behind laws. Ideally, each project proposal writing starts with a formal consensus how data will be available. Solve that before you get the money. But I will write more about that later during these holidays.

#4 Licenses and waivers

The open source community realized these issues decades ago. First with source code, leading to Open Source Initiative (OSI)-approved licenses, providing the aforementioned rights. For source code, there are also so-called waivers. The difference with licenses is that the latter gives you specific rights, while a waiver "waves" away any rights any law (from any jurisdiction) might automatically give. For the three "pillars" the outcome is the same: you will have those three rights. In case of a waiver, you just get any right you can think of too, whereas a license is limited to those rights specified in the license.

Now, these ideas developed in the open source community found their way to the "Open Access" (OA, for documents) community and the "Open Data" community in the last 10 years. Some lobbying forces managed to clutter the definition of Open Access, which is why the community talks about green OA and gold OA. The first is not really Open and does not give you all three rights. Gold Open Access does. A green OA article you cannot reshare.

For data there are basically two options:

  • licenses: Creative Commons (CC) license
  • waiver: CCZero (not a licence)

For the first option, the licenses, the CC licenses come in various flavors, and this is implemented with "clauses". For example, there is an "attribution" clause. This creates the CC-BY license as you know from gold Open Access journals. This clause gives you the three rights, but also requires you to cite where you got the data.

A second CC clause is the ND (No Derivative) clause, which defines that no one can make derived products. Effectively, it removes on of the three rights. It exists with the idea that some things are not meant to change. Think for example about the JRCNMxxxx codes for nanomaterials. No one should be changing them, because it would defy the purpose of the definition of those codes.

A third CC clause is the NC (Non-Commercial) clause. This clause specifies that you can only use that data for non-commercial purposes. Some publishers use that in their implementation of "Open Access" and basically says that only some people get the three basic rights. Now, who "some" is, is not clearly defined. Not legally, not practically. No one really knows when something is commercial and when not. Some legal experts have argued that some American universities are commercial enterprises (source needed). For Europe SME's are a clearly commercial entity.

A final CC clause is the SA (Share Alike) clause, which requires that people redistributing your data also make it available under the same license. This is in the open source community referred to as "copylefting" and has upsides and downsides.

I stress that in case of licenses, no IP is reassigned and the producers of the data keep owner of the IP.

At a recent NanoSafety Cluster meeting I gave a presentation about these matters and the slides are available here.

Sunday, December 18, 2016

The SWAT4LS poster about eNanoMapper

SWAT4LS was once again a great meeting. I doubt I will find time soon enough to write up notes, but at least I can post the eNanoMapper poster I presented, which is available from F1000Research:

Willighagen E, Rautenberg M, Gebele D et al. Answering scientific questions with linked European nanosafety data [v1; not peer reviewed]. F1000Research 2016, 5:2848 (poster) (doi: 10.7490/f1000research.1113520.1)

Sunday, November 13, 2016

OpenTox Euro 2016: "Data integration with identifiers and ontologies"

Results from a project by MSP students.
J. Windsor et al. (2016): Volatile Organic Compounds:
A Detailed Account of Identity, Origin,
Activity and Pathways
. Figshare.
A few weeks ago OpenTox Euro 2016 meeting was held in Rheinfelden at the German/Swiss border (which allowed me a nice stroll across the Rhine into Switzerland and by a nice x-mas countdown clock. The meeting was co-located with eNanoMapper-hosted meetings, where we discussed, among other things the nanoinformatics roadmaps, that outline where research in this area should go to.

There were many interesting talks, around various data initiatives, adverse outcome pathways (AOPs) and their links to molecular initiating events (MIEs), and ontologies (like the AOP ontology talk by ). In fact, I quite enjoyed the discussion with Chris Grulke about ontologies during the panel discussion. Central was, where is the border between data and ontological concepts. Some slides are available via Lanyrd.

During the Emerging Methods and Practice session hosted by Ola Spjuth, I presented the work at the BiGCaT department into identifier mapping and the use of ontologies for linking data sets.

The presentation integrates a lot of things I have been working on in the last few years, and please note the second slide with all people I have worked with on things presented in these slides.

Recent presentation: "Open Access: a practical perspective"

Source: MediaWiki Commons
For a local grant acquisition course I recently gave a presentation about Open Access (OA). My interest in OA started from my Open Science background and lack of access to literature was a serious problem. Journals were invented to make knowledge dissemination easier, but many publishers are stuck with outdated technologies that make their knowledge dissemination not caught up with the 21st century. BTW, OA to me is the one that actually really helps knowledge dissemination and allows:
  1. download and use (text mining!)
  2. modification (format change!)
  3. redistribute (allow others to read it to! share your modifications!)
There are several stories around showing that fast knowledge exchange saves lives (is there an overview of well-documented examples?). Honestly, I would be surprised that people do not also die because of disseminated knowledge, but then it is of misuse of knowledge, and not because of knowledge denied. And this is what access to knowledge can mean:
It shows that you can get far with access to the right knowledge (here in the form of data). This must be a right every human has. In fact, it is part, but as often, legal wording complicates things. Wikipedia has a good overview. Like with free speech, it tries to find a balance between rights of all people: the right of one cannot restrict the rights of others. Well, I don't know if "caching in" is a human right, but surely many people believe so.

And not every human has this opportunity that Pepke had. Access to knowledge is a serious problem. A problem I am facing every week myself, and then I find myself at a relatively well equipped Maastricht University Library. A recent study found that even researchers at my university found Sci-Hub an important resource, as can be seen in the below slides. I do not encourage Sci-Hub. The legal basis in unclear, but at least it's not found illegal at this moment (as far as I could keep up with the process). And there are many alternatives, which I blogged about earlier.

Fact is, we have a knowledge dissemination issue. And that was the main message of my presentation. Because it is easy to solve as author: don't give away your IP to publishers and by choosing an Open Access license of your work (the gold OA version, as green OA is like the Rolex you by for 10 euro at the black market).

And I'll end with this quote from John Oliver:

"Knowledge dissemination: a topic you know so little about, you think the best kind of dissemination if a Nature journal ReadCube."

Pepke, S., Steeg, G. V., Sep. 2016. Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer. bioRxiv, 043257+.

Friday, November 11, 2016

New paper: "SPLASH, a hashed identifier for mass spectra"

I'm excited to have contributed to this important (IMHO) interoperability paper around metabolomics data: "SPLASH, a hashed identifier for mass spectra" (doi:10.1038/nbt.3689, readcube:msZj). A huge thanks to all involved in the great collaborative project! The source code project is fully open source and coordinated by Gert Wolgemuth, the lead author on this paper. It provides an implementation of the algorithm in various programming languages and I'm happy that the splash functionality is available in the just released Bioclipse 2.6.2 (taking advantage of the Java library). An R package by Steffen Neumann is also available.

This new identifier greatly simplifies linking between spectral databases and will in the end contribute to a Linked Data network. Furthermore, journals can start adopting this identifier and list the 'splash' for mass spectra in document, allowing for simplified dereplication and finding additional information around spectra.

There are several databases that have adopted the SPLASH already, such as MassBank, HMDB, MetaboLights, and the OSDB published in JCheminf recently (doi:10.1186/s13321-016-0170-2).

Screenshot snippet of a spectrum in the OSDB.

PS. I personally don't like the idea of ReadCubes (which I may blog about at some point) and how they have been pitched as a "legal" way of sharing papers, but this journal does not have a gold Open Access option, unfortunately.

Wohlgemuth, G., Mehta, S. S., Mejia, R. F., Neumann, S., Pedrosa, D., Pluskal, T., Schymanski, E. L., Willighagen, E. L., Wilson, M., Wishart, D. S., Arita, M., Dorrestein, P. C., Bandeira, N., Wang, M., Schulze, T., Salek, R. M., Steinbeck, C., Nainala, V. C., Mistrik, R., Nishioka, T., Fiehn, O., Nov. 2016. SPLASH, a hashed identifier for mass spectra. Nature Biotechnology 34 (11), 1099-1101.

Sunday, October 16, 2016

New paper: "XMetDB: an open access database for xenobiotic metabolism"

Back in 2013 at the OpenTox conference in Mainz I spoke with Ola, Patrik, and Nina. They were working on a database for CYP metabolism, XMetDB, which I joined on the spot. The database has Open Data, an Application Programming Interface (API), is Open Source, and good amount of experimental detail, like specific enzyme involved and the actual atom mapping of the reaction. A few weeks ago, the paper describing the database was published in the Journal of Cheminformtics (doi:10.1186/s13321-016-0161-3). It's not perfect, but we hope it is a seed for more to follow.

The data, it turns out, is really hard to come by. While I was adding data to the database for most-selling drugs, it was hard to find publications where a human experiment was done (many experiments use rat microsome experiments. Not only makes that hard to identify the specific CYP enzyme, it also is not the human homologue. BTW, since the background of this paper is to create a knowledge base for computational prediction of CYP metabolism, ideally we would even have a specific protein sequence, including any missense SNPs affecting the 3D structure of the enzyme.

However, even for the (at least then) most selling drug aripiprazole, literature was really hard to find! There is a lot of literature just copy/pasting knowledge from other papers, and those other "papers" may in fact be the information sheet you get when you buy the actual drug. Alternatively, personal communication and conference posters can be cited as primary literature too. So, only stressing the importance of a database like this.

At this moment the project is a stalled. None of the currently involved groups has funding for continued development. I guess collaborations are welcome! ChEMBL 22 now was metabolism data for compounds, but I have not explored yet if it has all the details for the transformations needed for XMetDB. At the very least, it may serve as a source of primary literature references.

Spjuth, O., Rydberg, P., Willighagen, E. L., Evelo, C. T., Jeliazkova, N., Sep. 2016. XMetDB: an open access database for xenobiotic metabolism. Journal of Cheminformatics 8 (1). doi:10.1186/s13321-016-0161-3

Friday, September 30, 2016

NanoSafety Cluster presentation: Open Data & NSC Activities

Two weeks ago (already!), the NanoSafety Cluster (NSC) organized two meetings. First, there was on Wednesday afternoon the NSC half-yearly meeting. Second, on Thursday and Friday, in the beautiful Visby on Gotland, the 2nd NanoSafety Forum for Young Scientists. I ran an experiment there, which I will blog about later. Here, please find the slides of my presentation about Open Data I gave on Wednesday:

Oh, and I also presented a few slides about the Working Group 4 activities:

Monday, September 12, 2016

Metabolite identifier mapping databases

Caffeine metabolites. Source: Wikimedia.
If you want to map experimental data to (digital) biological pathways, you need to know what measured datum matches which metabolite in the pathways (that also applies to transcriptomics and proteomics data, of course). However, if a pathways does not have a single database from which identifiers are used, or your analysis platform outputs data with CAS registry numbers, then you need something like identifier mapping. In Maastricht we use BridgeDb for that, and I develop the metabolite identifier mapping databases, which provide the mapping data to BridgeDb, which performs the mapping.

However, identifier mapping for metabolites is non-trivial, and I won't got into details in this post. Instead, the mapping databases that I have been releasing under the CCZero waiver on Figshare use other data sources. When I took over the building of these databases, it used data from the Human Metabolome Database (doi:10.1093/nar/gks1065). It still does. However, I added as data sources to this, ChEBI (doi:10.1093/nar/gkv1031) and Wikidata. The latter I need to support people with, for example, KNApSAcK (doi:10.1093/pcp/pct176).

So, this weekend I released a new mapping database, based on HMDB 3.6, ChEBI 142, and data from Wikidata from September 7. Here are the total number of identifiers and changes compared to June release for the supported identifier databases:

Number of ids in Kd (KEGG Drug): 2013 (unchanged)
Number of ids in Cks (KNApSAcK): 4357 (unchanged)
Number of ids in Ik (InChIKey): 52337 (unchanged)
Number of ids in Ch (HMDB): 41520 (6 added, 0 removed -> overall changed +0.0%)
Number of ids in Wd (Wikidata): 22648 (195 added, 10 removed -> overall changed +0.8%)
Number of ids in Cpc (PubChem-compound): 30699 (154 added, 36 removed -> overall changed +0.4%)
Number of ids in Lm (LIPID MAPS): 2611 (unchanged)
Number of ids in Ce (ChEBI): 131580 (4 added, 6 removed -> overall changed -0.0%)
Number of ids in Ck (KEGG Compound): 15968 (unchanged)
Number of ids in Cs (Chemspider): 24948 (10 added, 2 removed -> overall changed +0.0%)
Number of ids in Wi (Wikipedia): 4906 (unchanged)

An overview of recent releases (I'm trying to keep a monthly schedule) can be found here and the version I release this weekend has doi:10.6084/m9.figshare.3817386.v1.

Friday, September 09, 2016

Doing science has just gotten even harder

Annotation of licenses of life science
databases in Wikidata.
Those following me on Twitter may have seen the discussion this afternoon. A weird law case went to the European court, which sent our their ruling today. And it's scary, very scary. The details are still unfolding and several media have written about it earlier. It's worth checking out for everyone doing research in Europe, particularly if you are a chem- or bioinformatician. I may be wrong in my interpretation, and hope to be, but hope even more to be proven wrong soon, but fear it will not be soon at all. The initial reporting I saw was in a Dutch news outlet, but I was pointed by Sven Kochmann to this press release from the Court of Justice of the European Union. Worth reading. I will need to write more about this soon, to work out the details why this may turn out disastrous for European research. For now, I will quote this part of the press release:
    Furthermore, when hyperlinks are posted for profit, it may be expected that the person who posted such a link should carryout the checks necessary to ensure that the work concerned is not illegally published.
I stress this is only part of the full ruling, because the verdict is on a combination of arguments. What this argument does, however, is turn around some important principle: you have to proof you are not violating copyright.

Now, realize that in many European Commission funded projects, with multiple partners, sharing IP is non-trivial, ownership even less (just think about why traditional publishers require you to reassign copyright to them! BTW, never do that!), etc, etc. A lot of funding actually goes to small and medium sized companies, who are really not waiting for more complex law, nor more administrative work.

A second realization is that few scientists understand or want to understand copyright law. The result is hundreds of scholarly databases which do not define who owns the data, nor under what conditions you are allowed to reuse it, or share, or reshare, or modify. Yet scientists do. So, not only do these database often not specify the copyright/license/waiver (CLW) information, the certainly don't really tell you how they populated their database. E.g. how much they copied from other websites, under the assumption that knowledge is free. Sadly, database content is not. Often you don't even need wonder about it, as it is evident or even proudly said they used data from another database. Did they ask permission for that? Can you easily look that up? Because you are now only allowed to link to that database until you figured out if they data, because of the above quoted argument. And believe me, that is not cheap.

Combine that, and you have this recipe for disaster.

A community that knows these issues very well, is the open source community. Therefore, you will find a project like Debian to be really picky about licensing: if it is not specified, they won't have it. This is what is going to happen to data too. In fact, this is also basically why eNanoMapper is quite conservative: if it does not get clear CLW information by the rightful owner (people are more relaxed with sharing data from others, than their own data!), it is not going to be included in the output.

IANAL, but I don't have to be to see that this will only complicate matters, and the last thing that will do is help the Open Data efforts of the European Commission.

I have yet to figure out what this means for my Linked Data work. Some databases do great work and have very clear CLW information. Think ChEMBL, WikiPathways, and also Open PHACTS did a wonderful job in tracking and propagating this CLW information. On the other hand, Andra Waagmeester did an analysis of database license information of life sciences databases and note the number of 'free content' and 'proprietary' databases (top right figure), which are the two categories of databases where the CLW info is not really clear. How large the problem is with illegal content in those databases (e.g. text mined from literature, screenscraped from another database), who knows, but I can tell you this is not insignificant, unless you think it's 99%.

At the same time, of course, the solution is very simple. Only use and link to websites with clear CLW information and good practices. But that rules out many of the current databases, but also supplementary information, where, even more than in databases, the rules of copyright are ignored by scientists.

And, honestly, I cannot help but wonder what all the publishers will now do with all the articles published in the past 20 years with hyperlinks in them. I hope for them it doesn't link to illegal material. Worse, the above quoted argument will have to make sure, none(!) of those hyperlinks point to material with unclear copyright.

I'll end this post with a related Dutch law (well, at least for the sake of this post). If you buy second hand goods, and the price is less than something like 1/3rd of the new price, you must demand the original receipt of the first buy. Because if not provided, you are legally assumed to realize it is probably stolen. How will that translate to this situation? If the linked scientific database is less then 1/3rd of the cost of the commercial alternative, you may assume it is illegal? Fortunately, this argumentation does not apply.

Problem is, there are enough "smart" people that misuse weird laws and ruling like this to make money. Think of the patent trolls, or about this:
What can possibly go wrong?

Friday, September 02, 2016

Elsevier launches

Elsevier (RELX Group) has seen a lot of publicity this week again. After the patent on peer review earlier this week, today I learned from Max Kemman about the website. This is great! Finding data (think FAIR, doi:10.1038/sdata.2016.18) is hard. Elixir Europe aims at fixing this, and working on open standards to have data explain itself, e.g. adoption of But an entry point that finds information is still very much welcome. Like the search interface for eNanoMapper that indexes information from multiple data sources (well, two at this moment, including the server).

For scientific information this doesn't exist; we have to do with tools like Google Scholar and Google Images. Both are pretty brilliant and allow you to filter on things, besides your regular keyword search. Of course, what we really need is an ontology-backed search, which Google seamlessly integrates under the hood, e.g. using the aforementioned

Now, particularly for my teaching roles, I am frequently looking for material for slides, to support my message. Then, Google Images is great, as it allows me to filter for images that I am allowed to use, reuse, and even modify (e.g. highlight part of the image). Now, I know that some jurisdictions (like the USA) have more elaborate rules about fair use in education, but these rules are too often challenged and money, DRM, etc, limit those rights. Let alone scary, proposed European legislation (follow Julia Reda!).

So, I very much welcome this new effort! Search engine have a better track record than catalogs, like the Open Knowledge Foundation's DataHub. Of course, some repositories are getting so large, like FigShare, to a large extend by very active population by publishers like PLOS, they may soon become a single point of entry.

Anyway, Elsevier is looking for peer-review, which I give them for free (like I gave them free peer reviews until they crossed an internal, mental line, see The Cost of Knowledge). I can only hope that I am not violating their patent. Oh, and please don't look at the HTML of the website. You would certainly be violating their Terms of Use. They really need to talk to their lawyers; they're making a total mess of it.

Saturday, August 27, 2016

cAMP as a signalling compound?

cAMP. Picture from Wikipedia.
Maastricht University gives me the opportunity to study how chemical differences between individuals affect the metabolism, particularly for humans (I'm a chemist working in biology). Reading biological literature and text books sometimes makes my jaw drop. Biology is beautifully complex and sometimes just doesn't make sense at all.

So, in my WTF-moment of the day, I was reading about various RNA, then nucleotides, etc, and got to cAMP. This, and I know that from WikiPathways too, can act as a secondary signalling compound: membrane receptor passes the signal on to cAMP. But then? I mean, one single molecule. Supposed to give a variety of signals. How?? How can it be selective? How is the hormone-specific signal not lost when passing the cytoplasma?? Or is it just a general "ALERT ALERT, SOMETHING OUTSIDE HAPPENED"?

Back to the book.

Monday, August 15, 2016

Alzheimer’s disease, PaDEL-Descriptor, CDK versions, and QSAR models

A new paper in PeerJ (doi:10.7717/peerj.2322) caught my eye for two reasons. First, it's nice to see a paper using the CDK in PeerJ, one of the journals of an innovative, gold Open Access publishing group. Second, that's what I call a graphical abstract (shown on the right)!

The paper describes a collection of Alzheimer-related QSAR models. It primarily uses fingerprints and the PaDeL-Descriptor software (doi:10.1002/jcc.21707) for it particularly. I just checked the (new) PaDeL-Descriptor website and it still seems to use CDK 1.4. The page has the note "Hence, there are some compatibility issues which will only be resolved when PaDEL-Descriptor updates to CDK 1.5.x, which will only happen when CDK 1.5.x becomes the new stable release." and I hope Yap Chun Wei will soon find time to make this update. I had a look at the source code, but with no NetBeans experience and no install instructions, I was unable to compile the source code. AMBIT is now up to speed with CDK 1.5, so the migration should not be too difficult.

Mind you, PaDEL is used quite a bit, so the impact of such an upgrade would be substantial. The Wiley webpage for the article mentions 184 citations, Google Scholar counts 369.

But there is another thing. The authors of the Alzheimer paper compare various fingerprints and the predictive powers of models based on them. I am really looking forward to a paper where the authors compare the same fingerprint (or set of descriptors) but with different CDK versions, particularly CDK 1.4 against 1.5. My guess is that the models based on 1.5 will be better, but I am not entirely convinced yet that the increased stability of 1.5 is actually going to make a significant impact on the QSAR performance... what do you think?

Simeon, S., Anuwongcharoen, N., Shoombuatong, W., Malik, A. A., Prachayasittikul, V., Wikberg, J. E. S., Nantasenamat, C., Aug. 2016. Probing the origins of human acetylcholinesterase inhibition via QSAR modeling and molecular docking. PeerJ 4, e2322+. 10.7717/peerj.2322

Yap, C. W., May 2011. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry 32 (7), 1466-1474. 10.1002/jcc.21707

Saturday, August 13, 2016

The Groovy Cheminformatics scripts are now online

My Groovy Cheminformatics with the Chemistry Development Kit book sold more than 100 times via now. An older release can be downloaded as CC-BY from Figshare and was "bought" 39 times. That does not really make a living, but does allow me to financially support CiteULike, for example, where you can find all the references I use in the book.

The content of the book is not unique. The book exists for convenience, it explains things around the APIs, gives tips and tricks. In the first place for myself, to help me quickly answer questions on the cdk-user mailing list. This list is a powerful source of answers, and the archive covers 14 years of user support:

One of the design goals of the book was to have many editions allowing me to keep all scripts updated. In fact, all scripts in the book are run each time I make a new release of the book, and, therefore, which each release of the CDK that I make a book release for. That also explains why a new release of the book currently takes quite a bit of time, because there are so many API updates at the moment, as you can read about in the draft CDK 3 paper.

Now, I had for a long time also the plan to make the scripts freely available. However, I never got around to making the website to go with that. I have given up on the idea of a website and now use GitHub. So, you can now, finally, find the scripts for the two active book releases on GitHub. Of course, without the explanations and context. For that you need the book.

Happy CDK hacking!

Sunday, July 17, 2016

Use of the BridgeDb metabolite ID mapping database in PathVisio

A long time ago Martijn van Iersel wrote a PathVisio plugin that visualizes 2D chemical structures of metabolites in pathways as found on WikiPathways. Some time ago I tried to update it to a more recent CDK version, but did not have enough time at the time to get it going. However, John May's helpful DepictionGenerator made it a lot easier, so I set out this morning in updating the code base to use this class and CDK 1.5.13 (well, strictly speaking it's running a prerelease (snapshot) of CDK 1.5.14). With success:

The released version is a bit more tweaked and shows the 2D structure diagram more filling the Structure tab. I have submitted the plugin to the PathVisio Plugin Repository.

Now, you may know that these GPML pathways only contain identifiers, and no chemical structures. But this is where the metabolite identifier mapping database helps (doi:10.6084/m9.figshare.3413668.v1): it contains SMILES strings for many of the compounds. It does not contains SMILES string from Wikidata, but I will start adding those in upcoming releases too. The current SMILES strings come from HMDB.

To show how all this works, check out the below PathVisio screenshot. The selected node in the pathway has a label uracil and the left most front dialog was used to search in the metabolite identifier mapping database and it found many hits in HMDB and Wikidata (middle dialog). The Wikidata identifier was chosen for the data node, allowing PathVisio to "interpret" the biological nature of that node in the pathway. However, along with many mapped identifiers (see the Backpage on the right), this also provides a SMILES that is used by the updated ChemPaint plugin.

Sunday, July 10, 2016

Setting up a local SPARQL endpoint

... has never been easier, and I have to say, with Virtuoso it already was easy.

Step 1: download the jar and fire up the server
OK, you do need Java installed, and for many this is still the case, despite Oracle doing their very best to totally ruin it for everyone. But seriously, visit the Blazegraph website (@blazegraph) and download the jar and type:

$ java -jar blazegraph.jar

It will give some output on the console, including a webpage with SPARQL endpoint, upload form etc.

That it tracks past queries is a nice extra.

Step 2: there is no step two

Step 3: OK, OK, you also want to try a SPARQL from the command line
Now, I have to say, the webpage does not have a "Download CSV" button on the SPARQL endpoint. That would be great, but doing so from the command line is not too hard either.

$ curl -i -H "Accept: text/csv" --data-urlencode \

But it would be nice if you would not have to copy/paste the query into a file, or go to the command line in the first place. Also, I had some trouble finding the correct SPARQL endpoint URL, as it seems to have changed at least twice in recent history, given the (outdated) documentation I found online (common problem; no complaint!).

HT to Andra who first mentioned Blazegraph to me, and the Blazegraph team.

Friday, July 08, 2016

Metabolomics 2016 Write up #1: some interesting bits

A good conference needs some time to digest. A previous supervisor advised me that a conference travel of 5 days takes 5 full day to follow up on everything. I think he is right, though few of us actually block our schedules to make time for that. Anyway, I started following up on things last weekend, resulting in a first two blog posts:
The second was pretty much how I have been blogging a lot: it's my electronic lab notebook. The first is about how people can link out to WikiPathways. That post explains how people can create links between identifiers and pathways.

But there was a lot of very interesting stuff at Metabolomics 2016. I hope to be blogging about more things, but please find some initial coverage in the slides of a presentation I gave yesterday at our department:

Also check the Twitter hashtag #metsocdublin2016.

Saturday, July 02, 2016

Harmonized identifiers in the WikiPathways RDF

Biological knowledge should not only be captured
in nice graphics, but should be machine readable.
Public domain image from Wikipedia.
WikiPathways described biological processes. Entities in these processes are genes, gene products, like miRNAs, proteins, and metabolites. The pathways do not describe what these entities are, but only provide identifiers in external databases allowing you to study the identity in those databases. Therefore, for metabolites you will not find chemical graphs but identifiers from HDMB, CAS, KEGG, ChEBI, and others.

To ensure experimental data can be mapped to these pathways, independent of whatever identifiers are used, BridgeDb was developed. WikiPathways uses a BridgeDb webservice, Open PHACTS embeds BridgeDb technologies in their Identifier Mapping Service (particularly developed by Carole Goble's team), and PathVisio uses local BridgeD ID mapping files.

The WikiPathways SPARQL end point is not using the Open PHACTS IMS and Andra introduced harmonized identifiers and provides these as additional triples in the WikiPathways RDF. For example:

SELECT DISTINCT ?gene fn:substring(?ensId,32) as ?ensembl
  ?gene a wp:GeneProduct ;
    wp:bdbEnsembl ?ensId .

Now, the gene resource IRIs actually use the Ensembl identifier when available, so this query returns redundant information, but there are other harmonized identifiers available:

  ?entity a ?type; ?pred [] .
  FILTER (regex(?pred,'bdb'))

That results in a table like this:

Therefore, for these databases it is easy to make links between those identifiers and the pathways in which entities with those identifiers are found. For example, to create a link between Ensembl identifiers and pathways, we could do something like:

  ?pathwayRes str(?wpid) as ?pathway
  str(?title) as ?pathwayTitle
  fn:substring(?ensId,32) as ?ensembl
  ?gene a wp:GeneProduct ;
    dcterms:identifier ?id ;
    dcterms:isPartOf ?pathwayRes ;
    wp:bdbEnsembl ?ensId .
  ?pathwayRes a wp:Pathway ;
    dcterms:identifier ?wpid ;
    dc:title ?title .

I am collecting a number of those queries in the WikiPathways help wiki's page with many example SPARQL queries. For example, check out the federated SPARQL queries listed there.

Two Apache Jena SPARQL query performance observations

Doing searches in RDF stores is commonly done with SPARQL queries. I have been using this with the semantic web translation of WikiPathways by Andra to find common content issues, though sometimes combined with some additional Java code. For example, find PubMed identifiers that are not numbers.

Based on Ryan's work on interactions, a more complex curation query I recently wrote in reply to issues that Alex ran into with converting pathways to BioPax, is to find interactions that convert a gene to another gene. Such occurred in WikiPathways because graphically you do not see the difference. I originally had this query:

SELECT (str(?organismName) as ?organism) ?page
       ?gene1 ?gene2 ?interaction
  ?gene1 a wp:GeneProduct .
  ?gene2 a wp:GeneProduct .
  ?interaction wp:source ?gene1 ;
    wp:target ?gene2 ;
    a wp:Conversion ;
    dcterms:isPartOf ?pathway .
  ?pathway foaf:page ?page ;
    wp:organismName ?organismName .
} ORDER BY ASC(?organism)

This query properly found all gene-gene conversions to be fixed. However, it was also horribly slow with my JUnit/Apache Jena set up. The queries runs very efficiently on the Virtuoso-based SPARQL end point. I had been trying to speed it up in the past, but without much success. Instead, I ended up batching the testing on our Jenkins instance. But this got a bit silly, with at some point subsets of less than 100 pathways.

Observation #1
So, I turned to twitter, and quite soon got three useful leads. The first two suggestions did not help, but helped me rule out the problem. Of course, there is literature about optimizing, like this recent paper by Antonis (doi:10.1016/j.websem.2014.11.003), but I haven't been able to convert this knowledge into practical steps either. After ruling out these options (though I kept the sameTerm() suggestion), and realized it had to be the first two triples with the variables ?gene1 and ?gene2. So, I tried using FILTER there too, resulting with this query:

  ?interaction wp:source ?gene1 ;
    wp:target ?gene2 ;
    a wp:Conversion ;
    dcterms:isPartOf ?pathway .
  ?pathway foaf:page ?page ;
    wp:organismName ?organismName .
  FILTER (!sameTerm(?gene1, ?gene2))
  FILTER (?gene1 a wp:GeneProduct)
  FILTER (?gene2 a wp:GeneProduct)
} ORDER BY ASC(?organism)

That did it! The time to run a query halved. Not so surprising, in retrospect, but it all depends on the SPARQL engine: which parts does it run first. Apparently, Jena's SPARQL engine starts at the top. This seems to be confirmed by the third comment I got. However, I always understood engine can also start at the bottom.

Observation #2
But that's not all. This speed up made me wonder something else. The problem clearly seems to engine approach to run parts of the query. So, what if I remove further choices in what to run first? That leads me to a second observation. It helps significantly if you reduce the number of subgraphs it should later "merge". Instead, if possible, use property paths. That again, about halved the runtime of the query. I ended up with the below query, which, obviously, no longer give me access to the pathway resources, but I can live with that:

  ?interaction wp:source ?gene1 ;
    wp:target ?gene2 ;
    a wp:Conversion ;
    dcterms:isPartOf/foaf:page ?pathway ;
    dcterms:isPartOf/wp:organismName ?organismName .
  FILTER (!sameTerm(?gene1, ?gene2))
  FILTER EXISTS {?gene1 a wp:GeneProduct}
  FILTER EXISTS {?gene2 a wp:GeneProduct}
} ORDER BY ASC(?organism)

I'm hoping these two observations may help other with using Apache Jena with unit and integrated testing of RDF generation too.

Loizou, A., Angles, R., Groth, P., Mar. 2015. On the formulation of performant SPARQL queries. Web Semantics: Science, Services and Agents on the World Wide Web 31, 1-26.

Saturday, June 25, 2016

New Paper: "Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources"

Andra Waagmeester published a paper on his work on a semantic web version of the WikiPathways (doi:10.1371/journal.pcbi.1004989). The paper outlines the design decisions, shows the SPARQL endpoint, and several examples SPARQL queries. These include federates queries, like a mashup with DisGeNET (doi:10.1093/database/bav028) and EMBL-EBI's Expression Atlas. That results in nice visualisations like this:

If you have the relevant information in the pathway, these pathways can help a lot in helping understanding of what is biologically going on. And, of course, used for exactly that a lot.

Press release
Because press releases have become an interesting tool in knowledge dissemination, I wanted to learn what it involved to get one out. This involved the people as PLOS Computational Biology and the press offices of the Gladstone Institutes and our Maastricht University (press release 1, press release 2 EN/NL). There is already one thing I learned in retrospect, and I am pissed with myself that I did not think of this: you should always have a graphics supporting your story. I have been doing this for a long time in my blog now (sometimes I still forget), but did not think of that in the press release. The press release was picked up by three outlets, though all basically as we presented it to them (thanks to

But what makes me appreciate this piece of work, and WikiPathways itself, is how it creates a central hub of biological knowledge. Pathway databases capture knowledge not easily embedded an generally structured (relational) databases. As such, expression this in the RDF format seems simple enough. The thing I really love about this approach, is that your queries become machine readable stories, particularly when you start using human readable variants of SPARQL for this. And you can share these queries with the online scientific community with, for example, myExperiment.

There are two applications how I have used SPARQL on WikiPathways data for metabolomics: 1. curation; 2. statistics. Data analysis is harder, because in the RDF world resources scientific lenses are needed to accommodate for the chemical structural-temporal complexity of metabolites. For curation, we have long used SPARQL for unit tests to support the curation of WikiPathways. Moreover, I have manually used the SPARQL end point to find curation tasks. But now that the paper is out, I can blog about this more. For now, many examples SPARQL queries can be found in the WikiPathways wiki. It features several queries showing statistics, but also some for curation. This is an example query I use to improve the interoperability of WikiPathways with Wikidata (also for BridgeDb):

  ?metabolite a wp:Metabolite .
  OPTIONAL { ?metabolite wp:bdbWikidata ?wikidata . }
  FILTER (!BOUND(?wikidata))

Feel free to give this query a go at!

This papers completes a nice triptych of three papers about WikiPathways in the past 6 months. Thanks to whole community and the very many contributors! All three papers are linked below.

Waagmeester, A., Kutmon, M., Riutta, A., Miller, R., Willighagen, E. L., Evelo, C. T., Pico, A. R., Jun. 2016. Using the semantic web for rapid integration of WikiPathways with other biological online data resources. PLoS Comput Biol 12 (6), e1004989+.
Bohler, A., Wu, G., Kutmon, M., Pradhana, L. A., Coort, S. L., Hanspers, K., Haw, R., Pico, A. R., Evelo, C. T., May 2016. Reactome from a WikiPathways perspective. PLoS Comput Biol 12 (5), e1004941+.
Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Jan. 2016. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research 44 (D1), D488-D494.