Better NDC downloads from the FDA

Recently, the FDA Division of Drug Information, Center for Drug Evaluation and Research dramatically improved how their NDC search tool data downloads work in response to some complaints they received from… someone. Most notably they:

  • Added the NDC Package Code (the NDC-10 code with the dashes) to each row as a distinct field. This is the only field that is unique per row!
  • Added the ability to download the results in plain CSV. (Previously you could only get Microsoft Excel files, which is a proprietary data standard)

 

NDC search and data download improvements
NDC search and data download improvements

This makes the download functionality much more useful, and IMHO, that improvement makes the searching generally much more worthwhile.

Data hounds like me just download the entire NDC database which is already available as open data already. But these files use non-standard data formats and require special ETL processing to work with conveniently. Now, you can make useful subsets of the NDC data and then download those subsets in an open standard.  Those CSV files will make working with the data in both spreadsheets (other than Excel) and automatic import into databases much easier.

Especially given my recent rant about using simple download formats.  I think it is really important to recognize the folks at the FDA who work every day to ensure that medication information is a little more useful to the public.

Thank you!

-FT

 

Open Data Frustrations

First, let me say that I applaud and salute anyone who releases open data about anything as relevant as healthcare data. It is a tough and thankless slog to properly build, format and document open data files. Really, if you work on this please know that I appreciate you. I value your time and your purpose in life.

But please get your shit together.

Get your shit together
Get your shit together

Please do not make your own data format standards. Please use a standard that does not require me to buy any proprietary expensive software to read. The best open standards have RFCs. Choose one of those.

And most of all. If a comma-delimited file will work for your data, just use a CSV. If you were thinking, “but what if I have commas in my data?”… well you are just wrong. CSV is an actual standard. It has ways to escape commas and most importantly, you do not need to think about that. All you need to do is use the CSV export functionality of whatever you are working with. It will automatically do the right thing for you.

You are not doing yourself any favors creating a fixed length file structure. Soon, you will find that you did not really account for how long last names are. Or the people at the FDA will add another digit to sort out NDC codes… or whatever. CSV files mean that you do not have to think about how many characters your data fields use. More importantly, it means that I do not need to think about it either.

You might be thinking “We should use JSON for this!” or “XML is an open standard”. Yes, thank you for choosing other good open formats… but for very large data sets, you probably just want to use a CSV file. The people at CMS thought JSON would be a good standard to use for the Qualified Health Plan data… and they did in fact design the standard so you could keep the JSON filed to a reasonable size. But the health insurance companies have no incentive to make their JSON files a reasonable size and so they have multiple gigabyte JSON files. That is hugely painful to download and it is a pain to parse.

Just use CSV.

I was recently working with the MAX Provider Characteristics files from Medicaid. Here are the issues I had.

  • They have one zip file from 2009 which empties into a directory with the same name as the zip file. That means that the zip file will not open, because it is trying to write to a directory with the same name as the original file. I have to admit, I am amazed that this mistake is even possible.
  • in 2009, the zip files made subdirectories. In 2010 and 2011 they dumped to the current directory tar-bomb style. (either way is fine, pick one)
  • sometimes the file names of the ‘txt’ files are ALL CAPS and sometimes not, even in the same years data.
  • Sometimes the state codes are upper case like ‘WI’ and ‘WV’, sometimes they are camel case ‘Wy’ and ‘Wa’, sometimes they are lowercase ‘ak’ and ‘al’. Of course, we also have ‘aZ’.
  • Usually the structure is StateCode.year.maxpc.txt .. like GA.2010.maxpc.txt. Except for that one time when they wrote it FL.Y2010.MAXPC.TXT
  • the actual data in the files is fixed length format. Each year, you have to confirm that all of the field lengths are the same in order to ensure that your parser will continue to work.
  • They included instructions for importing the data files in SAS, the single most expense data processing tool available. Which is, of course, what they were using to export the data.
  • They did not include instructions for any of the most popular programming languages. SAS does not even make the top 20 list.
  • There are multiple zip files, each with multiple files inside. We can afford a download that is over 100 MB in size. Just make, one. single. csv file. please.
  • Sometimes the files end in .txt Other times they just end in a ‘.’ (period).
  • The files are not just text files, they have some cruft at the beginning that ensures that they are interpreted as binary files.

Now how does that make me feel as someone trying to make use of these files? Pretty much like you might expect.

I love open data in healthcare. But please, please, start using easy to use and simple data standards. Get your shit together. I spend too much time hacking on ETL, I need to focus on things that change the world. And guess what… you need me to focus on those things too.

So if you are reading this, and you might very well be because I specifically referred you to this rant. Please do the right thing.

  1. Use an open standard for your data
  2. Use CSV if you can
  3. Are you ABSOLUTELY SURE that you cannot use CSV?
  4. Use JSON if you cannot use CSV
  5. Use XML if you cannot use CSV or JSON
  6. Make your data and file naming consistent, so that a machine can process it.

Thank you.

 

 

Google Intrusion Detection Problems

(Update: Less than four hours after tweeting out this blog post, we got our access turned back on. So the Google support team is definitely listening on social media. We very much appreciate that, because it resolves this issue as a crisis. We are still concerned by the “auto-off” trend and the missing button. But we will be working to make sure there is a better long term solution. Will update this post as appropriate moving forward.)

So today our Google Cloud Account was suspended. This is a pretty substantial problem, since we had committed to leveraging the Google Cloud at DocGraph. We presumed that Google Cloud was as mature and battle tested as other carrier grade cloud providers like Microsoft, Rackspace and Amazon. But it has just been made painfully clear to us that Google Cloud is not mature at all.

Last Thursday, we were sent a message titled “Action required: Critical Problem with your Google Cloud Platform / API Project ABCDE123456789″ here is that message.

action_required

Which leads to our first issue Google is referring to the project by its id, and not its project name. It took us a considerable amount of time to figure out what they were talking about when they said “625834285688”. We probably lost a day trying to figure out what that meant. This is the first indication that they would be communicating with us in a manner that was deeply biased towards how they view their world of their cloud service internally, totally ignoring what we were seeing from the outside. While that was the first issue, it was nowhere near the biggest.

The biggest issue is that it was not possible to complete the “required action”. Thats right, Google threatened to shut our cloud account down in 3 days unless we did something… but made it impossible to complete that action. 

Note that they do not actually detail the action needed, in the “action required” email. Instead they refer to a FAQ, where you find these instructions:

request_appeal

So we did that.. and guess what, we could not find the blue “Request an appeal” button anywhere. So we played a little “wheres waldo” on the Google Cloud console.

  • We looked where they instructed us to.
  • We looked at the obvious places
  • We looked at the not-obvious places

As far as we can tell, there was no “Request an appeal” button anywhere in our interface. Essentially, this makes actually following the request impossible.

So we submitted a support request saying “Hey you want us to click something, but we cannot find it” and also “what exactly is the problem you have with our account in any case?”

However, early yesterday morning, despite us reaching out to their support services to figure out what was going on, Google shut our entire project down. Note that we did not say “shutdown the problematic server” or even “shutdown all your servers”. Google Cloud services shutdown the entire project. Although we use multiple google cloud APIs we thought it made sense to keep everything we were doing on the same project. For those wondering that is a significant design flaw, since Google has fully-automated systems that can shut down entire projects that cannot be manually overridden. (Or at least, they were not manually overridden for us).

We have lost access to multiple critical data stores because Google has an automated threat detection system that is incapable of handling false positives.  This is the big takeaway: It is not safe to use any part of Google Cloud Services because their threat detection system has a fully automated allergic reaction to anything that has not seen before, and it is capable of taking down all of your cloud services, without limitation. 

So how are we going to get out of this situation? Google offers support solutions where you can talk to a person if you have a problem. We view it as problematic that interrupting an “allergic reaction” as a “support issue”. However, we would be willing to purchase top-tier support in order to get this resolved quickly. But there does not appear to be an option to purchase access to a human to get this resolved. Apparently, we should have thought about that before our project was suspended.

Of course, we are very curious as to why our account was suspended. As data journalists, we are heavy users of little-known web services. We suspect that one of our API client implementations looked to Googles threat detection algorithms like it was “hacking” in one way or another. There are other, much less likely explanations, but that is our best guess as to what is happening.

But we have not idea what the problem is, because Google has given us no specific information about where to look for the problem. If were actually doing something nefarious, we would know which server was the problem. We would know exactly how we are breaking the rules, but because we are (in one way or another) a false positive in their system, we have no idea where to even start looking for the traffic pattern that Google finds concerning.

Now when we are logged in, we simply see an “appeal” page that asserts, boldly “Project is in violation of Google’s Terms of Service”. There is no conversation capacity, and filling out the form appears to simply loopback to the form itself.

It hardly matters, Googles support system is so completely broken, that this issue represents a denial of service attack vector. The simplest way to take down any infrastructure that relies on Google would be to hack a single server, and then send out really obvious hack attempts outbound from that server. Because Google ignores inbound support requests and has a broken “action required” mechanism, the automated systems will automatically take down an entire companies Cloud infrastructure, no matter what. 

Obviously, we will give Google a few hours to see if they can fix the problem and we will update this article if they respond in a reasonable timeframe, but we will likely have to move our infrastructure to a Cloud provider that has a mature user interface and support ticketing system. While Google Cloud offers some powerful features, they are not safe to use until Google abandons its “guilty until proven innocent, without an option to prove” threat response model. 

-FT

 

 

 

 

 

Mourning Jess Jacobs

Yesterday, Jess Jacobs died.

Like so many others on Twitter, I knew Jess just well enough to be profoundly frustrated as I watched helplessly as the healthcare system failed her again and again. Today, the small world of Internet patient advocates mourn for her across blogs and twitter. The world of people who are trying to fix healthcare from underneath is small, and relationships that form on social media around a cause can be intense. There is nothing like an impossible, uphill battle to make lasting friendships. Now this community is responding to the loss of not only one of our own, but one of our favorites.

Is the NSA sitting on medical device vulnerabilities?

Today is not a fun day to read slashdot if you care about healthcare cybersecurity. First, it highlights how the DEA is strong-arming states into divulging the contents of their prescription databases.

Second, and even more troubling, was the claim that the NSA was looking to exploit medical devices. The story was broken by Intercept reporter Jenna McLaughlin. Since then, the article has been picked up by the Verge. Their title is even more extreme: “The NSA wants to monitor pacemakers and other medical devices”  Jenna did not specifically mention where she heard the comments, but her twitter feed gave me a hint.

The comments were from NSA deputy director Richard Ledgett, who is the same guy that countered the Ted talk from Snowden with his own. He was speaking at the Defense One Tech Summit. It is incredibly hard to find, but his comments are available as a video he goes on almost exactly at 3 hours. I tried to embed the talk below, YMMV.


In one sense this has been blown out of proportion. Patrick Tucker is the moderator/interviewer here, and he is the one that is pressing Ledgett on the issue of biomedical devices. Start at 3:15 for the discussion on medical devices.

Ledgett insists that targeting medical devices is not a priority for the NSA. But the troubling thing is his answer to the first question:

Question: ” What is your estimation of their security ”

Answer: ” Varies alot depending on the device and the manufacturer”

The problems with this is that I know of no examples of the NSA releasing data on insecure medical devices. In fact, the FDA has recently released information about specific medical devices that were insecure, without giving credit to the NSA.

This means that the NSA is investigating the security of medical devices, but then not releasing that information to the public. Ironically, it is a quantified self device that is most illustrated here. Ledgett specifically highlights fitbit, which I know had some pretty strange wireless behavior (that many regarded as insecure, in its early versions). So we know they have looked at one specific device, but there has been no release of information from the NSA on the device. At least I cannot find any.

If indeed the NSA is investigating medical devices, and is not releasing all of that information to the FDA, device manufactures and the public, then that is a huge problem.

I am still thinking about this, but it does not look good.

I suppose I should also mention that I ran across the interesting fact that Osama Bin Laden was using a CPAP machine.

Update: I have submitted a FOIA request for access to vulnerabilities about “healing devices” and it has been denied.

Clintons Server Politifact

Most of the time that I spend as a security-wonk is focused on email security. This is due almost entirely to my involvement as one of the architects of the Direct Project, which is a specification for using secure encrypted email in healthcare settings.

Which is why I was surprised by a recent analysis from Politifact evaluating something that Hillary Clinton said about her email servers. I should mention that I am apolitical. I care, but both US parties fail to resonate. So I have no reason to pick one side of this debate over the other. I am interested in the implications and perceptions of Hillary’s email system, however, because it is very revealing of basic attitudes about email systems.

For those that do not know Politifact is an organization that evaluates the veracity of specific statements that politicians make. Given my attitude about politics, you can understand why I am a fan of such a service. The statement that Hillary Clinton made that Politifact was evaluating was that “my predecessors did the same thing” regarding her email practices.

Politifact said:

And there’s a big difference between a private account, which is generally free and simple to start, and a private server, which requires a more elaborate setup…. The unorthodox approach has opened up questions about her system’s level of security.

later concluding:

This is a misleading claim chiefly because only one prior secretary of state regularly used email, Colin Powell. Powell did use a personal email address for government business, however he did not use a private server kept at his home, as Clinton did.

We rate this claim Mostly False.

The central assumption that Politifact is making is that Clinton’s email server was fundamentally less secure than using a service. Specifically, Colin Powell used AOL. In fact, for the average person, you are probably better off using a service like AOL. But Hillary Clinton and Colin Powell are hardly the average person. There are considerably advantages to having your own email server and your own domain, if you are particularly concerned with security.

First, all of the email services are constantly the targets of hackers. If someone broke into AOL, they could find Colin Powell’s account as a side-effect of the overall hack. It would be a bonus for hacking a system that is already regarded as a high-value target. Second, it is still relatively easy to spoof email. That means that it is fairly simple for someone to send emails pretending to be a particular person on a public email service. So if I had wanted to pretend to be Colin Powell, it would have been a little easier to get away with it, given that he was using an email service. It would be much easier to setup specific defenses (there is not that much you can do without encryption of some kind) to combat spoofing on your own server.

Unless Colin Powell had some special relationship with AOL (which is actually a real possibility) then login attempts from eastern Europe to his account would not have been flagged in any way. On a private server, however, you could always say “Is Secretary Clinton in eastern Europe today? No. Then that login attempt is a problem”. Of course, if you are not watching the logs on your private server, then this advantage is negated.

As it turns out, Clinton was also publicly serving up Windows Remote Desktop on her server, which makes it unlikely that she was taking the steps needed to get the security benefits. Even with that information, however, I cannot see the merit to the assumption that using AOL vs hosting your own Exchange server is fundamentally less or more secure for a public official like this.

Ultimately when you trust an organization like AOL you are effectively trusting thousands of people all at once. Clinton probably trusted somewhere between 10 and 100 people with the contents of her email server. Colin Powell probably trusted somewhere between 1000 and 10000. If I was making suggestions for the security of the email of my grandmother.. I would go with AOL. If I were making suggestions for the secretary of state? It is much less clear, and would depend alot on how the two different email systems were configured and regularly used.

As per almost always, the Wikipedia article on the subject is a tremendous source of the kinds of detail that a security researcher like me might need to evaluate whether there were security advantages, or disadvantages to hosting your own server. But still it is obvious that Colin Powell trusted state secrets to a massive Internet provider and Hillary Clinton trusted state secrets to a small team of generalist (i.e. not security) consultants. Neither of those decisions was well-informed by proper security thinking for securing emails that might eventually become state secrets.

So from my perspective as a security researcher with a focus on email security, it is a pretty fair statement for Hillary to say “My recent bad decision about email is equivalent to previous bad decisions made by members of the other party”.

Which means I think Politifact got it wrong. What is more interesting is why. They got it wrong because they made some flawed. This is deeply ironic, because that is precisely the same mistake that both Clinton and Powell made about exactly the same issue.

But I also think this is a problem with the way that technical options are presented. Politifact quotes Clifford Neuman as saying “you would need to stay current on patches”. I can promise you that this is not all Clifford had to say on the matter, but this is the only thing that Politifact chose to surface. The reality of the technical issues is a huge debate about whether Software as a Service is more secure than locally deployed and supported software. In reality, locally deployed software clearly can me made more secure, because one can choose to enforce parameters that improve security at the expense of convenience (like two factor authentication, for instance). However, Software as a Service is usually more secure in practice because you have teams of people ensuring that bare-minimums are always met.

I really could not care less about Clinton’s or Powell’s choices when it comes to email servers. It is a little silly to be accusing people who get to decide what is classified and what is not with mis-handling classified information. Personally, I think the fact that Clinton was exposing an RDP connection to the public Internet is the only thing that I have heard that is truly scandalous here, and this is clearly not the focus of the media circus here. I do not care at all about the political side of this.

I am very concerned, however, about how novices think about about complex security and privacy issues. How did Politifact, which is charged with getting to the bottom of this issue, discussed precisely none of these complex technical issues? The conclusion they reached is pretty shallow. Which I do not think is their failing… I think this is a symptom of dogmatic thinking in InfoSec messaging.

Still not done thinking about this.

-FT

 

 

Hacking on the Wikipedia APIs for Health Tech

Recently I wrote about my work hacking on the PubMed API. Which I hope is helpful to people. Now I will cover some of the revelations I have had working with DocGraph on the Wikipedia APIs.

This article will presume some knowledge of the basic structure of open medical data sets, but we have recently released a pretty good tool for browsing the relationships between the various data sets: DocGraph Linea (that project was specifically backed by Merck, both financially and with coding resources, and they deserve a ton of credit for it working as smoothly as it does).

Ok. here are some basics to remember when hacking on the Wikipedia API’s if you are doing so from a clinical angle. Some of this will apply to Wikipedia hacking in general, but much of it is specifically geared towards understanding the considerable clinical content that Wikipedia and it’s sister projects posses.

First, there is a whole group of editors that might be interested in collaborating with you at Wikiproject Medicine. (There is also a Wikiproject Anatomy, which ends up being strongly linked to clinical topics for obvious reasons). In general you should think of Wikiprojects as a group of editors with a shared interest in a topic, that collectively adopt a group of medical articles. Lots of behind the scenes things on Wikipedia take place on Wikipedia talk pages, and the connection between Wikiprojects and specific wiki articles is one of them. You can see the connection between wikiproject medicine and the Diabetes article, for instance, on the Diabetes Talk page.

Wikiproject Medicine maintains an internal work list that is the best place to understand the fundamental quality levels of all of the articles that they overlook. You can see the summary of this report embedded in the project page and also here. There is a quasi-api for this data using the quality search page data, you can get articles that are listed as “C quality” but are also “High Priority”.

Once a clinical article on Wikipedia article has reached a state where the Wikipedian community (Wikipedian is the nick-name for Wikipedia contributors and editors) regards it as either a “good” article or a “feature” article, it can generally be considered to be highly reliable. To prove this, several prominent healthcare wikipedians converted the “dengue fever” wikipedia article into a proper medical review article, and then got that article published in a peer-reviewed journal.

All of which is to say: the relative importance and quality of wikipedia articles is something that is mostly known and can be accessed programmatically if needed. For now “programmatically” means parsing the HTML results of the quality search engine above, I have a request in for a “get json” flag.. which I am sure will be added “real soon now”.

The next thing I wish I had understood about Wikipedia articles is the degree to which they have been pre-datamined. Most of the data linking for Wikipedia articles started life as “infoboxes” which are typically found at the top right of clinically relevant articles. They look like this:

ethanol_1 ethonal_infobox diabetes_infobox

The Diabetes infobox contains links to ICD9 and ICD10 as well as MeSH. Others will have links to Snomed or CPT as appropriate. The ethanol article has tons of stuff in it, but for now we can focus just on the ATC code entry. Not only does it have the codes, but the correctly link to the relevant page on the WHO website.

An infobox is a template on wikipedia, which means it is a special kind of markup that can be found inside the wikitext for a given article. Later we will show how we can download the wikitext. But for now, I want to assure you that the right way to access this data is through wikidata, parsing wikitext is not something you need to do in order to get at this data. (This sentence would have saved me about a month of development time, if I had been able to read it.).

For instance, here is how we get ATC codes and ethonol via the wikidata API:

Most of this data mining is found in the Wikidata project. Lets have a brief 10000 ft tour of the resources that it offers. First, there are several clinically related data points that it tracks. This includes ATC codes, which are the WHO maintained codes for medications. (It should be noted that recent versions of RX Norm, can link ATC codes to NDC codes, which are maintained by the US FDA, and are being newly exposed by the Open FDA API project.

I pulled all of the tweets I made from wikimania about this into a storify.

Other things you want to do in no particular order:

Once you have wikitext its pretty easy to mine for pmid so that you can use the PubMed API. I used regular expressions to do this, which does occasionally miss some pmids. I think there is an API way to do this perfectly but I cannot remember what it is…

Thats a pretty good start. Let me know if you have any questions. Will likely expand on this article when I am not sleepy….

-FT

Susannah Fox is the new CTO of HHS

I am not actually sure that anybody reads this blog. I suppose they must, that is really the magic of RSS… letting you know when your friends blog… right??

Still. If you wanted to actually follow what I am doing you should probably be reading the DocGraph Blog, or the CareSet Blog, or the OpenSourceHealth News Page. I just don’t tend this blog the way I should… But when I have news that I think deeply interesting and I cannot find a category for it anywhere, this is the perfect place.

Discovering that my dear friend Susannah Fox is now the CTO of HHS is just that kind of category defying important news.

I cannot think of a better person for a role like this. I mean that literally. I tried. I failed.

Susannah is a geek, enough of one that she cannot easily be snowed by other technologists (I should be explicit: I am talking about government contractors), even where she does not have direct technical expertise. Not every geek I know can do that. On the other hand she is not soo much of a geek that people find her arrogant, or incomprehensible. (I have problems with both). Most of that job will not be directly geeky stuff. That sounds contradictory, but HHS is just too large to have any one technical strategy. There is no way that a reasonable technical vision for the FDA would apply at CMS or at NLM. Being the CTO at HHS is about seeing the connections, understanding how things fit together, and then having a vision that is not actually technology centric, but patient centric.

As technology savvy as Susannah is, it is her capacity to hold a huge vision, and keeping patients at the center of that vision, that make her so deeply qualified for this job. No body asked me who the next CTO of the government was going to be, and frankly I was a little worried about who would be next. Bryan Sivak and Todd Park (her predecessors in this role) leave pretty damn big shoes to fill. Someone in the whitehouse/HHS is casting a net wide enough to know who the really transformational thinkers in our industry are.

I have to admit, I am still reeling from this news. I am usually pretty good at figuring out what the implications of something are… at calculating where the hockey puck is going… But I really have no idea what is the implications of this are going to be… other than to say:

This is going to matter, in precisely the way that most things in healthcare reform don’t.

-FT

Does Epic resist or support interoperability? Hell if I know.

I just realized that my somewhat infamous question at the ONC annual meeting is recorded on video!

The background on my question, which I made me very popular at the meeting afterwards, was that I had heard that Epic hired a lobbyist to convince congress that it is an interoperable company.

That lobbyist and others at Epic have been heard saying stuff like “Interoperability is Epics strength”… and “Epic is the most open system I know” etc etc.. This makes me think “what planet I am on?”

I have actually heard of hospitals being told “no at any price” by Epic, and I have never heard that regarding another vendor… although there are lots of rumors like that about Epic I would prefer to be fair. How would I know if Judy et al, had really turned a corner on interoperability. Epic has been a faithful participant in the Direct Project, which is the only direct (see what I did there?) experience I have had with them.

But I want data… and here is what happened when I asked for it at the annual ONC meeting. Click through to see the video.. it auto plays so I did not want it on the my main site.

Continue reading Does Epic resist or support interoperability? Hell if I know.