Hacking on the Pubmed API

The pubmed API is pretty convoluted. Every time I try to use it, I have to try and relearn it from scratch.

Generally, I want to get JSON data about an article, using its PubMED ID and I want to do searches programmatically… These are pretty basic and pretty common goals…

The PubMED api is an old-school RESTish API that has hundreds of different purposes and options. Technically the PubMed API is called the Entrez Database, and instructions for using it begin, and end with the Entrez Programming Utilities Help document. Heres the things you probably really wanted to know…

How to search for articles using the PubMed API

To search pubmed you need to use the eSearch API.

Here is the example they give…

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d 

The first thing we want to do is not have this thing return XML, but JSON instead. We do that by adding a GET variable called retmode=json. The new url

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d&retmode=json

Ahh… thats better… No lets get more ids in each batch of the results…

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d&retmode=json&retmax=1000

Breaking this down…

http://eutils.ncbi.nlm.nih.gov/entrez/

is kind the entry point for the whole system..

/eutils/esearch.fcgi

is the actual function that you will be using…

This tells the API that you want to search pubmed.

db=pubmed

Next you want to set the “return mode” so that JSON is returned.

retmod=json

And then you want to add the retmax to get at least 1000 results at a time… The documentation says that you can get 100,000 but I get a 404 if I go over 1000

retmax=1000

The term argument

term=YOUR SEARCH TERMS HERE

db and term are seperated by the classic GET variable layout (starts with a ? and is then seperated by a &) if that sounds strange to you, I suggest you learn a little more about how GET variables work in practice.

Now about the “YOUR SEARCH TERMS HERE” What that is a url_encoded string of arguments to the search string for pubmed. URL coding is (something of a trivialized explanation) how you make sure that there are no spaces or other strangeness in a URL. Here is a handy way to get data into and out of url encoding if you do not know what that is..

Thankfully the search terms are well defined, but not anywhere near the documentation for the API. The simplest way to understand the very advanced search functionality on pubmed is to use the PubMed advanced query builder or you can do a simple search, and then pay close attention to the box labeled “search details” on the right sidebar. For instance, I did a simple search for “Breast Cancer” and then enabled filters for Article Type of Review Articles and Journal Categories of “Core Clinical Journals”.. which results in a search text that looks like this:

("breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]) AND (Review[ptyp] AND jsubsetaim[text])

Lets break that apart into a readable syntax display…

("breast neoplasms"[MeSH Terms] 
  OR ("breast"[All Fields] 
        AND "neoplasms"[All Fields]) 
  OR "breast neoplasms"[All Fields] 
  OR ("breast"[All Fields] 
        AND "cancer"[All Fields]) 
  OR "breast cancer"[All Fields]) 
AND (Review[ptyp] 
  AND jsubsetaim[text])

How did I get this from such a simple search? PubMed is using MesH terms to map my search to what I “really wanted”. MesH stands for “Medical Subject Headings” is an ontology built specifically to make this task easier.

After that, it just tacked on the filter constraints that I manually set.

Now all I have to do is use my handy URL encoder.. to get the following url encoded version of my search parameters.

(%22breast%20neoplasms%22%5BMeSH%20Terms%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22neoplasms%22%5BAll%20Fields%5D)%20OR%20%22breast%20neoplasms%22%5BAll%20Fields%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22cancer%22%5BAll%20Fields%5D)%20OR%20%22breast%20cancer%22%5BAll%20Fields%5D)%20AND%20(Review%5Bptyp%5D%20AND%20jsubsetaim%5Btext%5D)

Lets put the retmode=json ahead of the term= so that we easily just paste this onto the back of the url.. we get the following result.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=1000&term=(%22breast%20neoplasms%22%5BMeSH%20Terms%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22neoplasms%22%5BAll%20Fields%5D)%20OR%20%22breast%20neoplasms%22%5BAll%20Fields%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22cancer%22%5BAll%20Fields%5D)%20OR%20%22breast%20cancer%22%5BAll%20Fields%5D)%20AND%20(Review%5Bptyp%5D%20AND%20jsubsetaim%5Btext%5D)

I wish that my css could handle these really long links better… but oh well. I know it looks silly, lets move on.

To save you (well mostly me at some future date) the trouble of cut and pasting here is the trunk of the url that is just missing the url encoded search term.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&term=

At the time of the writing, the PubMed GUI returns 2622 results for this search, and so does the API call… which is consistent and a good check to indicate that I am on the right track. Very satisfying.

The JSON that I get back has a section that looks like this:

    "esearchresult": {
        "count": "2622",
        "retmax": "20",
        "retstart": "0",
        "idlist": [
            "25081398",
            "25056393",
            "25055284",
            "25055283",
            "24956046",
            "24926080",
            "24912480",
            "24890451",
            "24889167",
            "24880509",
            "24878027",
            "24849143",
            "24838656",
            "24830599",
            "24792660",
            "24792659",
            "24792658",
            "24792657",
            "24792656",
            "24792655"
        ],

With this result it is easy to see why you want to set retmax… getting 20 at a time is pretty slow… But how do you page through the results to get the next 1000 results? Add the retstart variable

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=1000&retstart=1000&term=(%22breast%20neoplasms%22%5BMeSH%20Terms%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22neoplasms%22%5BAll%20Fields%5D)%20OR%20%22breast%20neoplasms%22%5BAll%20Fields%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22cancer%22%5BAll%20Fields%5D)%20OR%20%22breast%20cancer%22%5BAll%20Fields%5D)%20AND%20(Review%5Bptyp%5D%20AND%20jsubsetaim%5Btext%5D)

If you need more help, here is the link to the full documentation for eSearch API again…

 

How to download data about specific articles using the PubMed API

There are two stages to downloading the specific articles. First, to get article meta-data you want to use the eSummary API… using the ids from the idlist json element above… you can call it like this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&rettype=abstract&id=25081398I

This will return a lovely json summary of this abstract. Technically, you can get more than one id at a time, by separating them with commas like so…

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&rettype=abstract&id=25081398,24792655

This summary is great, but it will not get the abstracts, if and when they are available. (it will tell you if there is an abstract available however…) In order to get the abstracts you need to use the eFetch API

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=text&rettype=abstract&id=25081398

Unlike the other APIs, there is no json retmode, the default is XML, but you can get plaintext using retmode=text. So if you want structured data here, you must use xml. Why? Because. Thats why. This API will take comma separated id list too, but I cannot see how to separate the plaintext results easily, so if you are using the plaintext (which is fine for me current purposes) better to call it a single id at a time.

 

 

 

Healthcare IT reading list

My Programmable Self Behavior Change Reading list has been one of my most popular posts.

I still think any Health IT expert should be well-versed in behavior change science, since so many healthcare issues boil down to behavior change problems… either for patients or providers or both.

But the other day, I was having drinks during HIMSS with Keith Toussaint, Matt Burton (both Health IT rock stars at Mayo Clinic) and Sulie Anna Tay (a rising star at Cisco). Soon talk turned to “have you read this, have you read that” (you know how those conversations usually play out) and we started creating a “Required Reading List for Health IT”. I forgot about it until today, when I needed to find some references in one of the books… and realized I had left the project undo. So here are my required reading list for Health IT and healthcare reform, in no particular order:

 

I think its important to listen to end of life issues from Alex Drane. And read the same topics from Atul Gawande Letting Go.

 

 

 

I hate to humblebrag so I will just be plain: David Uhlman and I wrote what is probably the most popular book on Health IT, Hacking Healthcare.

 

EHR Vulnerability Reporting issues

For those who actually bother to read to the bottom of my bio, I was actually in Internet Security before going into Health IT. I spoke at DefCon and everything.

During my career in Health IT I have had to report a security vulnerability to an EHR developer once, and it was such a painful process that I basically just gave up.

My poor friend Josh Mandel and his group at SMART found an XSLT vulnerability in an HL7 provided file that is a part of essentially every modern EHR system (the standard, if not the file itself, is mandated my Meaningful Use).

They have had a horrible time trying to get the attention of the major EHR vendors, with less than 10% paying any real attention.

I am saddened, but not at all surprised. I will write more later…

-FT

How to submit prior art on the Medicity Direct Patent

Recently Medicity has tried to patent the concept of a HISP. Please join me in submitting prior art to prevent this undermining of everything that the Direct Project stands for.

Groklaw shows the way

Here is a specific page that I had some trouble with and the right answers for it…

The Patent number in question is 61/443,549

The confirmation number is: 9529

The first names inventor is: Alok Mathur , Alpharetta, GA (US)

The date of file is: 02-16-2011

The strange string they are going to ask you in the middle appears to be: 201161443549

Read Groklaw carefully because the form is massively unnecessarily complex. (Because that is how the government rolls)..

The following prior art exists for their claims:

* Conversion of encrypted payload content, perhaps CCDs, into HL7 2.3 transactions sent to an EMR over TCP/IP ports

Of course, converting to HL7 v2 is not actually a good idea in 99% of the cases, but it was always part of the original vision of the Direct Project

http://wiki.directproject.org/page/diff/Direct+Project+FAQ/122979323

 

Just search this page for HL7 to find Arien discussing the need for HL7 2.x interoperability

or you can read about how we dithered over 2.x versions of HL7

http://wiki.directproject.org/share/view/21291669

I will no dignify the fact that they note that this happens over TCP/IP with a comment. Really, you are going to use the networks protocol for that?

Are you sure you do not want to use UDP? Or perhaps IPX? Wow. Innovation. <- (sarcasm, see note for USPTO employees below)

* Conversion of encrypted payload content, perhaps HL7 v3, into rendered PDF formatted reports that are automatically printed to a local printer device per the provider’s workflow preferences.

* Construct of a standard Direct compliant outbound S/MIME transaction with CCD attachments by converting native PDF or HL7 v2.x formats and contents.

This of course makes direct look like a fax machine. Which is a -huge- step backwards. But generally, converting between different healthcare interop standards has been done for quite some time.

A main goal of the HISP is to convert between various formats. We spent months talking about the particularly difficult conversions, i.e. Direct to IHE

http://wiki.directproject.org/Threat+Model+-+Direct+to+and+from+XDR

As far as I know the central advantage of a PDF is that you can print with it.

Here is Keith Boone discussing the issue on his blog

http://motorcycleguy.blogspot.com/2010/11/converting-from-hl7-version-2-message.html?showComment=1337606813597#c3948689104995255223

http://wiki.directproject.org/Session+Notes+6

 

This is 2 months too late but shows that we including printers as possible devices to send direct messages to.

The second set of claims is particularly annoying to me because I got involved in Direct specifically because it was not possible to do coordination of care without an underlying point to point messaging infrastructure.

  • Sharing of virtual care team records across disparate networks

  • Dynamic updates to disparate patient reocrds using encrypted serialized patient objects across disparate networks

  • Sharing of application context within applications across disparate networks

  • Sharing of user context within applications across disparate networks

  • Establishing long-term patient and provider object-level communication across disparate networks.

Its late, so my patience for this is wearing thin. Email handles “sharing PHI across disparate networks”. The whole fucking point of direct is that is -just- email.

So everywhere that Medicity is saying “share (PHI Type here) across disparate networks” they are full of shit. This is the problem that Direct itself solves.

Then the question becomes. “Hey, now that we have this amazing capacity to share PHI across disparate networks, what specifically should we share?”

Hmm… perhaps we should use this to keep patient records in sync… no shit.

(in case you cannot tell. The preceding text is sarcasm. I am saying this because someone from the USPTO might be reading this, and I am not sure you might not have picked up on that. Working at the USPTO might be the kind of job where you lose your sense of humor. I am just saying. )

The whole concept of a HISP is that it site on the edge of the Direct network and integrates the local environment into Direct.

Medicity has a HISP product. It does things that HISPs do.

They do not deserve a patent for concepts that are -both- obvious and well described by the Direct community during the -entire- process of developing Direct. The fact that the US government did not dictate what a HISP should do does not mean that it was not discussed carefully, completely and commonly by everyone working on this project.

The “HISP as a bridge concept” is something that I had a hand in creating. I do not appreciate my own work being co-opted and abused in this fashion. I am requesting that Medicity withdraw this patent application, and consider… I don’t know… competing for Direct HISP business, instead of applying for bullshit patents on ideas that were created as part of an Open Source project.

-FT

 

 

 

 

 

 

 

 

About to have a call with the National Health Service

I am about to have a call with a group of people who work with the UK National Health Service.
I know for a fact that the people on the call are doing serious, thoughtful work on behalf of their government.

In contrast, my government just started paying the electricity bill again.

It is fairly hard to describe accurately how I feel about going into a call like this. Happily I have Reddit/Imgur to help!!

DocGraph Journal releasing new data sets

If you are near Boston, you should consider trying to make it out to Strata RX.

If you live anywhere on the West Coast, its not too late to sign up for Health 2.0

The DocGraph Journal will again be releasing very significant data sets at these conferences. For many purposes, this data release is even more important that the original DocGraph Data Set. Attendees to the conference will be able to get early access to the new data set.

We will also be announcing two new projects that we will be crowdfunding with Medstartr.

There is a third data set that we will be announcing as part of our presentation to the RWJF Pioneer Pitch Day. The pitch from DocGraph about hacking the “medical translation” beat out more than 500 other proposals. We have been keeping this proposal under pretty tight wraps. We are going to build a prototype, from scratch at the Health 2.0 Code-a-thon coming up next weekend. It would be a great irony (and honor) to win that code-a-thon, because I actually proposed this particular skunkworks plan at a previous Health 2.0 Hackathon, and I could not convince any developers to work on it.

DocGraph Journal is also asking Knight News Foundation to fund our next-generation Open Data infrastructure.

In all of this, we are soft launching the DocGraph Journal itself a backbone supporter of the DocGraph project! We have been preparing for all of the projects for months now and we are very happy to launch what we think is going to be the first Open Source Healthcare Data Journal. Our basic business plan with the Journal:

  • Create completely new, uncomfortably relevant healthcare data.
  • Open Source that data.
  • Learn with our community how to leverage those data sets using the latest Big Data methods.
  • Which we can use to move power into the hands of good doctors and empowered patients.
  • Change the World. Profit. In that order.

Wish us luck!

-FT

 

 

 

Simpler Direct Directories

Alan Viars is making the case for simpler direct directories.

He has allowed me to republish some of his ideas here!!

A couple of weeks ago, I attended ONC’s Direct Bootcamp in Crystal City, VA. A hot topic at the two-day conference was the notion of a “Provider Directory” that incorporates Direct email addresses.

I also read that HHS/CMS intends to revamp the National Plan and Provider Enumeration System (NPPES). This is the system that manages National Provider Identifiers or (NPIs). Every individual provider and provider organization has one of these numbers, sort of like a tax ID for providers. A common complaint I hear is that it contains information that is often out of date and/or incorrect.

So what, you might ask, does the NPPES have to do with the Direct Project? Having worked with the NPPES data and having some background with Direct, the idea of “killing two birds with one stone” has captured my imagination. (Nerdy and wonky I know.) This is an opportunity for government efficiency by consolidating systems. Efficiency can only be achieved if the new system is simple, however. Too often in health information technology, consultants and vendors introduce complexity for complexity’s sake. After all, complexity is good for the bottom line for many companies because it means more billable hours and more services sold. Sadly, I see this sort of thing all the time. As an American and a taxpayer it ticks me off.(See footnote)

To illustrate what I mean by “simple”, I’ve built a prototype web service application that illustrates my vision of a combined NPPES and Direct email Provider Directory. Before I outline that technical proposal, however, I’d like to point out how adding some other data fields to NPPES could result in a an empowering service for patients, providers, and payers.

The whole article is worth a read. The man makes a good case.

 

Novice EHR Development is now unethical

The original Hipoocratic Oath states:

I will not use the knife, not even on sufferers from stone, but will withdraw in favor of such men as are engaged in this work.

One modern version reads:

I will not be ashamed to say “I know not,” nor will I fail to call in my colleagues when the skills of another are needed for a patient’s recovery.

The idea here is that a doctor needs to recognize when another practitioner has a skill that they do not, and that they must refrain from “practice” when another person has demonstrable expertise in that area of practice.

It is now 2013. It is time for doctors to stop “writing their own EHR” from scratch. They need to bow out of this in favor of people who have developed expertise in the area.

I just found out about another doctor who has decided to write his own EHR, because he has not been able to find one that supports his new direct pay business model adequately. In the distant past I encountered a doctor who believed that his “Microsoft Word Templates” qualified as an EHR system. This is a letter to any doctor who feels like they are comfortable starting from-scratch software development for an EHR in 2013 or later.

You might believe yourself to be an EHR expert.

Are you sure about that? Are you sure that you are not just an EHR expert user?

This difference is not unlike your relationship with your favorite thoracic surgeon. Or for that matter, your relationship with the person who built your car. The fact that you are capable of expertly evaluating and using EHR products does not mean you are qualified to build one. Just like the fact that you are qualified to treat a patient who has recently had heart surgery or to discern when a patient might need heart surgery does not make you qualified to perform that heart surgery. Similarly, the fact that you can drive, or even repair your automobile, does not provide you with the expertise you need to build a car from scratch.

The ethical situation that you are putting yourself in by developing your own EHR is fairly tenuous. Performing heart surgery without being a heart surgeon, building and driving your own car without being an automotive engineer and a doctor coding their own EHR system from scratch all have the same fundamental problem: You might be smart enough to pull it off, but if you don’t you can really mess up another person’s life. Make no mistake, you can kill someone with a shoddy EHR just as easily as by performing medical procedures that you are not qualified for or by driving a car that is not road-safe.

It is not that heart surgeons, automotive engineers and EHR developers are not going to kill people with faulty performance. All experts are fallible. But they will kill far fewer people than you would, performing outside your expertise.

I can understand your feelings of frustration. You likely have totally different goals in mind than the average third-party-payer oriented EHR system has. You are right to be frustrated with the shackles that those systems have placed on you. But you are very wrong to presume that it is ethical for you to do “amateur hour” on your own.

You presume that because you can see the problems with EHR developer performance, that this makes you qualified to build a better EHR. You are utterly and unequivocally wrong about this. Sometimes, EHRs have features that are designed for clinical CYA, basically over-documentation for the sake of unethical defensive medicine. Sometimes EHR systems are designed to be glorified practice management systems, designed mostly to ensure that doctors maximize their paycheck at the expense of patient care. Sometimes EHR design decisions have no rational behind them at all… they are frequently the result of original design whims that are hard to correct in subsequent editions of an EHR product.

But sometimes a feature that frustrates you is precisely what makes that EHR safe for patients. I can promise you that you cannot tell the difference between flaws and features without looking carefully at both the internals of the EHR system and all of the clinical workflows it is exercised in. What you think of as a flaw might be a software crumple zone.

Happily, you get to have your cake and eat it too. There are several Open Source EHR systems that are already meaningful use certified. You can use these Open Source EHR systems for nothing, and for very little money you can even get Meaningful Use credit for using these systems. Given this, you have no excuse to continue to develop an new EHR.

Open Source gives you the right to change what you need to, in order to get the functionality that you want.. and more importantly can connect you with experienced health IT developers, who can serve as a gut check for you as you consider how to implement the features that you need for whatever clinical variation you are interested in implementing.

This is very like the person who orders a “kit car” to build in their garage. They get to -feel- like they are building the car, and indeed they get lots of options normal car owners do not. But in the end, they are able to build a car safely because someone else, someone with specific expertise, has made sure that design of the kit car is fundamentally sound.  You can always shoot yourself in the foot with kit cars and Open Source.. but you have the power you need without being in over your head.

The development of mature EHR systems has been very similar to the development of surgical methods. Primitive EHR systems and primitive surgical procedures were both deadly. In both cases, medical science has already sacrificed thousands of people to the “cause” of learning how to do these things right. In 1850, it would have been entirely appropriate for any doctor to “dabble” with creating their own surgical methods. Even as recently as 2000, it would have been appropriate for you to “dabble” with the creation of your own EHR system. (eMDs was started by a doctor dabbling in 1996. eClinicalWorks was started in a similar fashion in 1999). But those days are over.

A doctor developing a new EHR system from scratch, by themselves, without extensive Health IT programming experience is in over their head. If they continue to develop an EHR, even after being warned of the dangers here, then this is hubris.

Ask yourself: Are you absolutely sure that this action is not a fundamental violation of the oath that you took when you became a doctor?

I want to be clear, I have worked on or around the development of EHR systems for more than a decade, and I would not presume to write a new EHR system without a team of programmers and years of funding. Its not that I think that “a doctor” is not qualified to undertake this task. No single person is.

I wrote a book designed to ensure that novice programmers had basic training in complex Health IT principles. Programmers can be guilty of hubris too, and I consistently advocate for a “clinical pair programming” approach. David Uhlman (my co-author) and I wrote the book because too many people assume that Health IT is easy, and they wonder why things in the industry are so “primitive”. The book is intended to teach clinicians and programmers alike humility when approaching clinical information systems, both as users and as developers. FSM knows that I have been dangerously arrogant regarding clinical information systems, and I have and will make serious mistakes. But there comes a point where making the same mistakes that others have made, and written about, becomes unethical. I think we have reached this point with EHR systems.

Some people took offense that I should link to my own book at the end of this article, so instead I have included some of the reference materials that I use frequently. This is a good sampling of the kind of context that really should be required of any modern Health IT developer.

Begin with Information and Medicine by Marsden S Blois. Then move on to Principles of Health Interoperability HL7 and SNOMED by Tim Benson and The CDA Book by Keith Boone. Finally, you should read about what can go wrong in Health IT by studying EHR generated errors with Clinical Information Systems, Overcoming Adverse Consequences by Dean E Sittig and Joan S. Ash.

These are the books that I refer to when I get stuck on something. I wish I could just hold it all in my head, and in many ways my book is just the cliff notes I need for myself. If you know of other books that should be on the “Health IT required reading list” please leave them in the comments…

-FT