Hacking on the Wikipedia APIs for Health Tech

Recently I wrote about my work hacking on the PubMed API. Which I hope is helpful to people. Now I will cover some of the revelations I have had working with DocGraph on the Wikipedia APIs.

This article will presume some knowledge of the basic structure of open medical data sets, but we have recently released a pretty good tool for browsing the relationships between the various data sets: DocGraph Linea (that project was specifically backed by Merck, both financially and with coding resources, and they deserve a ton of credit for it working as smoothly as it does).

Ok. here are some basics to remember when hacking on the Wikipedia API’s if you are doing so from a clinical angle. Some of this will apply to Wikipedia hacking in general, but much of it is specifically geared towards understanding the considerable clinical content that Wikipedia and it’s sister projects posses.

First, there is a whole group of editors that might be interested in collaborating with you at Wikiproject Medicine. (There is also a Wikiproject Anatomy, which ends up being strongly linked to clinical topics for obvious reasons). In general you should think of Wikiprojects as a group of editors with a shared interest in a topic, that collectively adopt a group of medical articles. Lots of behind the scenes things on Wikipedia take place on Wikipedia talk pages, and the connection between Wikiprojects and specific wiki articles is one of them. You can see the connection between wikiproject medicine and the Diabetes article, for instance, on the Diabetes Talk page.

Wikiproject Medicine maintains an internal work list that is the best place to understand the fundamental quality levels of all of the articles that they overlook. You can see the summary of this report embedded in the project page and also here. There is a quasi-api for this data using the quality search page data, you can get articles that are listed as “C quality” but are also “High Priority”.

Once a clinical article on Wikipedia article has reached a state where the Wikipedian community (Wikipedian is the nick-name for Wikipedia contributors and editors) regards it as either a “good” article or a “feature” article, it can generally be considered to be highly reliable. To prove this, several prominent healthcare wikipedians converted the “dengue fever” wikipedia article into a proper medical review article, and then got that article published in a peer-reviewed journal.

All of which is to say: the relative importance and quality of wikipedia articles is something that is mostly known and can be accessed programmatically if needed. For now “programmatically” means parsing the HTML results of the quality search engine above, I have a request in for a “get json” flag.. which I am sure will be added “real soon now”.

The next thing I wish I had understood about Wikipedia articles is the degree to which they have been pre-datamined. Most of the data linking for Wikipedia articles started life as “infoboxes” which are typically found at the top right of clinically relevant articles. They look like this:

ethanol_1 ethonal_infobox diabetes_infobox

The Diabetes infobox contains links to ICD9 and ICD10 as well as MeSH. Others will have links to Snomed or CPT as appropriate. The ethanol article has tons of stuff in it, but for now we can focus just on the ATC code entry. Not only does it have the codes, but the correctly link to the relevant page on the WHO website.

An infobox is a template on wikipedia, which means it is a special kind of markup that can be found inside the wikitext for a given article. Later we will show how we can download the wikitext. But for now, I want to assure you that the right way to access this data is through wikidata, parsing wikitext is not something you need to do in order to get at this data. (This sentence would have saved me about a month of development time, if I had been able to read it.).

For instance, here is how we get ATC codes and ethonol via the wikidata API:

Most of this data mining is found in the Wikidata project. Lets have a brief 10000 ft tour of the resources that it offers. First, there are several clinically related data points that it tracks. This includes ATC codes, which are the WHO maintained codes for medications. (It should be noted that recent versions of RX Norm, can link ATC codes to NDC codes, which are maintained by the US FDA, and are being newly exposed by the Open FDA API project.

I pulled all of the tweets I made from wikimania about this into a storify.

Other things you want to do in no particular order:

Once you have wikitext its pretty easy to mine for pmid so that you can use the PubMed API. I used regular expressions to do this, which does occasionally miss some pmids. I think there is an API way to do this perfectly but I cannot remember what it is…

Thats a pretty good start. Let me know if you have any questions. Will likely expand on this article when I am not sleepy….


Hacking on the Pubmed API

The pubmed API is pretty convoluted. Every time I try to use it, I have to try and relearn it from scratch.

Generally, I want to get JSON data about an article, using its PubMED ID and I want to do searches programmatically… These are pretty basic and pretty common goals…

The PubMED api is an old-school RESTish API that has hundreds of different purposes and options. Technically the PubMed API is called the Entrez Database, and instructions for using it begin, and end with the Entrez Programming Utilities Help document. Heres the things you probably really wanted to know…

How to search for articles using the PubMed API

To search pubmed you need to use the eSearch API.

Here is the example they give…


The first thing we want to do is not have this thing return XML, but JSON instead. We do that by adding a GET variable called retmode=json. The new url


Ahh… thats better… No lets get more ids in each batch of the results…


Breaking this down…


is kind the entry point for the whole system..


is the actual function that you will be using…

This tells the API that you want to search pubmed.


Next you want to set the “return mode” so that JSON is returned.


And then you want to add the retmax to get at least 1000 results at a time… The documentation says that you can get 100,000 but I get a 404 if I go over 1000


The term argument


db and term are seperated by the classic GET variable layout (starts with a ? and is then seperated by a &) if that sounds strange to you, I suggest you learn a little more about how GET variables work in practice.

Now about the “YOUR SEARCH TERMS HERE” What that is a url_encoded string of arguments to the search string for pubmed. URL coding is (something of a trivialized explanation) how you make sure that there are no spaces or other strangeness in a URL. Here is a handy way to get data into and out of url encoding if you do not know what that is..

Thankfully the search terms are well defined, but not anywhere near the documentation for the API. The simplest way to understand the very advanced search functionality on pubmed is to use the PubMed advanced query builder or you can do a simple search, and then pay close attention to the box labeled “search details” on the right sidebar. For instance, I did a simple search for “Breast Cancer” and then enabled filters for Article Type of Review Articles and Journal Categories of “Core Clinical Journals”.. which results in a search text that looks like this:

("breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]) AND (Review[ptyp] AND jsubsetaim[text])

Lets break that apart into a readable syntax display…

("breast neoplasms"[MeSH Terms] 
  OR ("breast"[All Fields] 
        AND "neoplasms"[All Fields]) 
  OR "breast neoplasms"[All Fields] 
  OR ("breast"[All Fields] 
        AND "cancer"[All Fields]) 
  OR "breast cancer"[All Fields]) 
AND (Review[ptyp] 
  AND jsubsetaim[text])

How did I get this from such a simple search? PubMed is using MesH terms to map my search to what I “really wanted”. MesH stands for “Medical Subject Headings” is an ontology built specifically to make this task easier.

After that, it just tacked on the filter constraints that I manually set.

Now all I have to do is use my handy URL encoder.. to get the following url encoded version of my search parameters.


Lets put the retmode=json ahead of the term= so that we easily just paste this onto the back of the url.. we get the following result.


I wish that my css could handle these really long links better… but oh well. I know it looks silly, lets move on.

To save you (well mostly me at some future date) the trouble of cut and pasting here is the trunk of the url that is just missing the url encoded search term.


At the time of the writing, the PubMed GUI returns 2622 results for this search, and so does the API call… which is consistent and a good check to indicate that I am on the right track. Very satisfying.

The JSON that I get back has a section that looks like this:

    "esearchresult": {
        "count": "2622",
        "retmax": "20",
        "retstart": "0",
        "idlist": [

With this result it is easy to see why you want to set retmax… getting 20 at a time is pretty slow… But how do you page through the results to get the next 1000 results? Add the retstart variable


If you need more help, here is the link to the full documentation for eSearch API again…


How to download data about specific articles using the PubMed API

There are two stages to downloading the specific articles. First, to get article meta-data you want to use the eSummary API… using the ids from the idlist json element above… you can call it like this:


This will return a lovely json summary of this abstract. Technically, you can get more than one id at a time, by separating them with commas like so…


This summary is great, but it will not get the abstracts, if and when they are available. (it will tell you if there is an abstract available however…) In order to get the abstracts you need to use the eFetch API


Unlike the other APIs, there is no json retmode, the default is XML, but you can get plaintext using retmode=text. So if you want structured data here, you must use xml. Why? Because. Thats why. This API will take comma separated id list too, but I cannot see how to separate the plaintext results easily, so if you are using the plaintext (which is fine for me current purposes) better to call it a single id at a time.




Codapedia launched

I heard about codapedia during my annual tour of the floor looking for FOSS-related projects that I had not heard about before.

www.codapedia.com is among a new breed of ‘medical wikis’, designed to support the concepts of group editing like a standard wiki, but also to be more reliable and authoritative. The corelations to medpedia are obvious. Like medpedia, there is some vetting that goes on before an article is posted, in this respect it is similar to the Google concept of a knol. The content of the site is licensed using the GNU free document license.

The site was setup by greenbranch publishing which is the main publisher of the paper resource Journal of Medical Practice Management. They sell books, journals and audio content. This is not the organizations first foray into new media, they have run a podcast site since 2005 called SoundPractice.net

Rather than go into further details I will just let you listen to the podcast I did with Nancy Collins, but most of links that she mentions are encoded above.

Codapedia launch interview Nancy Collins (mp3)

Codapedia launch interview Nancy Collins (in ogg)

Here is a shot of the codapedia booth, Nancy is on the left, it would be nice if someone could leave the name of the woman on the right in the comments… (I forgot to ask)