The Federal Government Recommends JSON

It is the policy of many governments to support transparency with the release of Open Data. But few understand how important it is that this Open Data be released in machine-readable openly available formats. I have already written a lengthly blog post about how most of the time, the CSV standard is the right data standard to use for releasing large open data sets.  But really JSON, XML, HTML, RTF, TXT, TSV and PDF files, which are all open standard file formats, each have their place as appropriate data standards for governments to use as they release Open Data.

But it can be difficult to explain to someone inside a government or non-profit, who is already releasing Open Data that CSV is a good standard, but XLSX (Microsoft Excel) is not. For many people, a CSV really is an Excel file, so there is no difference in their direct experience. But for those of us who want to parse, ETL or integrate that data automatically there is a world of difference in the level of effort required between a clean CSV and a messy XLSX file (not to mention the cybersecurity implications).

A few months ago (sorry I get distracted) Project Open Data which is a policy website maintained and governed jointly by the Office of Management and Budget and the Office of Science and Technology Policy of the US Federal Government updated its website to include W3C and IETF as sources of Open Data Format Standards, by accepting a pull request that I made. As I had expected, not including IETF and W3C in the list of sources of Open Standards was an omission and not a conspiracy (sometimes I panic).

This is a very important resource for those of us who advocate for Open Data. It means that we can use a single URL link, specifically, this one:

https://project-open-data.cio.gov/open-standards/

To indicate that it is the policy of the United States Federal Government that not only release Open Data, but it do so using specific standards that are also open. Now that the W3C and IETF are added, the following data standards are by proxy included in the new policy regarding open data standards:

Obviously these four standards make up almost all of the machine readable Open Data that is already easy to work with, and with a few exceptions represents the data formats that 95% (my guesstimate) of all Government data should be released in. In short, while there are certainly other good standards, and even cases where we must tolerate proprietary standards for data, most of the data that we need to release should be released in one of these four data formats.

For those of us who advocate for reasonableness in Open Data releases.. this is a pretty big deal. We can now simply include a few links to publicly available policy documents rather than arguing independently for the underlying principles.

And because the entire Project Open Data website is so clear, concise and well-written and because it comes with the implicit endorsement of US Federal Governments (OMB and OSTP), this is a wonderful new resource for advocating with National, State, City, Local and International governments for the release of Open Data using reasonable data formats. Hell, we might even be able to get some of the NGOs to consider releasing data correctly because of this. My hope is that this will make complaining about proprietary format data releases easier, and therefore more frequent, and help us to educate data releasers on how to make their data more useful. Which in turn will make it easier for data scientists, data journalists, academics and other data wonks to create impact using the data.

My applause to the maintainers and contributors to Project Open Data.

-FT

 

 

 

 

 

NDC search improves again

As I mentioned recently, the NDC search page improved to include more data and CSV downloads.

Now it is using tabular results, instead of accordion method. I love the improvement.

the old results look like this (click to make the picture clear)

ndc_improvements

 

new results look like this (click to make the picture clear)

ndc_improvements_v2

Great to see the FDA continuing to improve its data browsing capacity!

-FT

 

Better NDC downloads from the FDA

Recently, the FDA Division of Drug Information, Center for Drug Evaluation and Research dramatically improved how their NDC search tool data downloads work in response to some complaints they received from… someone. Most notably they:

  • Added the NDC Package Code (the NDC-10 code with the dashes) to each row as a distinct field. This is the only field that is unique per row!
  • Added the ability to download the results in plain CSV. (Previously you could only get Microsoft Excel files, which is a proprietary data standard)

 

NDC search and data download improvements
NDC search and data download improvements

This makes the download functionality much more useful, and IMHO, that improvement makes the searching generally much more worthwhile.

Data hounds like me just download the entire NDC database which is already available as open data already. But these files use non-standard data formats and require special ETL processing to work with conveniently. Now, you can make useful subsets of the NDC data and then download those subsets in an open standard.  Those CSV files will make working with the data in both spreadsheets (other than Excel) and automatic import into databases much easier.

Especially given my recent rant about using simple download formats.  I think it is really important to recognize the folks at the FDA who work every day to ensure that medication information is a little more useful to the public.

Thank you!

-FT

 

Open Data Frustrations

First, let me say that I applaud and salute anyone who releases open data about anything as relevant as healthcare data. It is a tough and thankless slog to properly build, format and document open data files. Really, if you work on this please know that I appreciate you. I value your time and your purpose in life.

But please get your shit together.

Get your shit together
Get your shit together

Please do not make your own data format standards. Please use a standard that does not require me to buy any proprietary expensive software to read. The best open standards have RFCs. Choose one of those.

And most of all. If a comma-delimited file will work for your data, just use a CSV. If you were thinking, “but what if I have commas in my data?”… well you are just wrong. CSV is an actual standard. It has ways to escape commas and most importantly, you do not need to think about that. All you need to do is use the CSV export functionality of whatever you are working with. It will automatically do the right thing for you.

You are not doing yourself any favors creating a fixed length file structure. Soon, you will find that you did not really account for how long last names are. Or you will find an address that is longer than 40 characters. Or the people at the FDA will add another digit to sort out NDC codes… or whatever. CSV files mean that you do not have to think about how many characters your data fields use. More importantly, it means that I do not need to think about it either.

You might be thinking “We should use JSON for this!” or “XML is an open standard”. Yes, thank you for choosing other good open formats… but for very large data sets, you probably just want to use a CSV file. The people at CMS thought JSON would be a good standard to use for the Qualified Health Plan data… and they did in fact design the standard so you could keep the JSON filed to a reasonable size. But the health insurance companies have no incentive to make their JSON files a reasonable size and so they have multiple gigabyte JSON files. That is hugely painful to download and it is a pain to parse.

Just use CSV.

I was recently working with the MAX Provider Characteristics files from Medicaid. Here are the issues I had.

  • They have one zip file from 2009 which empties into a directory with the same name as the zip file. That means that the zip file will not open, because it is trying to write to a directory with the same name as the original file. I have to admit, I am amazed that this mistake is even possible.
  • in 2009, the zip files made subdirectories. In 2010 and 2011 they dumped to the current directory tar-bomb style. (either way is fine, pick one)
  • sometimes the file names of the ‘txt’ files are ALL CAPS and sometimes not, even in the same years data.
  • Sometimes the state codes are upper case like ‘WI’ and ‘WV’, sometimes they are camel case ‘Wy’ and ‘Wa’, sometimes they are lowercase ‘ak’ and ‘al’. Of course, we also have ‘aZ’.
  • Usually the structure is StateCode.year.maxpc.txt .. like GA.2010.maxpc.txt. Except for that one time when they wrote it FL.Y2010.MAXPC.TXT
  • the actual data in the files is fixed length format. Each year, you have to confirm that all of the field lengths are the same in order to ensure that your parser will continue to work.
  • They included instructions for importing the data files in SAS, the single most expense data processing tool available. Which is, of course, what they were using to export the data.
  • They did not include instructions for any of the most popular programming languages. SAS does not even make the top 20 list.
  • There are multiple zip files, each with multiple files inside. We can afford a download that is over 100 MB in size. Just make, one. single. csv file. please.
  • Sometimes the files end in .txt Other times they just end in a ‘.’ (period).
  • The files are not just text files, they have some cruft at the beginning that ensures that they are interpreted as binary files.

Now how does that make me feel as someone trying to make use of these files? Pretty much like you might expect.

I love open data in healthcare. But please, please, start using easy to use and simple data standards. Get your shit together. I spend too much time hacking on ETL, I need to focus on things that change the world. And guess what… you need me to focus on those things too.

So if you are reading this, and you might very well be because I specifically referred you to this rant. Please do the right thing.

Soon, this advise will likely be formally compatible with the Open Data policies of the Federal Government.

  1. Use an open standard for your data
  2. Use CSV if you can
  3. Are you ABSOLUTELY SURE that you cannot use CSV?
  4. Use JSON if you cannot use CSV
  5. Use XML if you cannot use CSV or JSON
  6. Are you looking to compress and manage massive amounts of data moving it around at a furious rate, in a almost-binary compressed format? Perhaps try Protocol Buffers.
  7. Find that Protocol Buffers page confusing? Its because you should be using CSV. So just use CSV.
  8. Make your data and file naming consistent, so that a machine can process it.

This way, we can have all of the wonderful tools for csv data processing available to us. Joy!

Thank you.

Updated Mar 22 2017 (added protocol buffers and links)

 

Hacking on the Wikipedia APIs for Health Tech

Recently I wrote about my work hacking on the PubMed API. Which I hope is helpful to people. Now I will cover some of the revelations I have had working with DocGraph on the Wikipedia APIs.

This article will presume some knowledge of the basic structure of open medical data sets, but we have recently released a pretty good tool for browsing the relationships between the various data sets: DocGraph Linea (that project was specifically backed by Merck, both financially and with coding resources, and they deserve a ton of credit for it working as smoothly as it does).

Ok. here are some basics to remember when hacking on the Wikipedia API’s if you are doing so from a clinical angle. Some of this will apply to Wikipedia hacking in general, but much of it is specifically geared towards understanding the considerable clinical content that Wikipedia and it’s sister projects posses.

First, there is a whole group of editors that might be interested in collaborating with you at Wikiproject Medicine. (There is also a Wikiproject Anatomy, which ends up being strongly linked to clinical topics for obvious reasons). In general you should think of Wikiprojects as a group of editors with a shared interest in a topic, that collectively adopt a group of medical articles. Lots of behind the scenes things on Wikipedia take place on Wikipedia talk pages, and the connection between Wikiprojects and specific wiki articles is one of them. You can see the connection between wikiproject medicine and the Diabetes article, for instance, on the Diabetes Talk page.

Wikiproject Medicine maintains an internal work list that is the best place to understand the fundamental quality levels of all of the articles that they overlook. You can see the summary of this report embedded in the project page and also here. There is a quasi-api for this data using the quality search page data, you can get articles that are listed as “C quality” but are also “High Priority”.

Once a clinical article on Wikipedia article has reached a state where the Wikipedian community (Wikipedian is the nick-name for Wikipedia contributors and editors) regards it as either a “good” article or a “feature” article, it can generally be considered to be highly reliable. To prove this, several prominent healthcare wikipedians converted the “dengue fever” wikipedia article into a proper medical review article, and then got that article published in a peer-reviewed journal.

All of which is to say: the relative importance and quality of wikipedia articles is something that is mostly known and can be accessed programmatically if needed. For now “programmatically” means parsing the HTML results of the quality search engine above, I have a request in for a “get json” flag.. which I am sure will be added “real soon now”.

The next thing I wish I had understood about Wikipedia articles is the degree to which they have been pre-datamined. Most of the data linking for Wikipedia articles started life as “infoboxes” which are typically found at the top right of clinically relevant articles. They look like this:

ethanol_1 ethonal_infobox diabetes_infobox

The Diabetes infobox contains links to ICD9 and ICD10 as well as MeSH. Others will have links to Snomed or CPT as appropriate. The ethanol article has tons of stuff in it, but for now we can focus just on the ATC code entry. Not only does it have the codes, but the correctly link to the relevant page on the WHO website.

An infobox is a template on wikipedia, which means it is a special kind of markup that can be found inside the wikitext for a given article. Later we will show how we can download the wikitext. But for now, I want to assure you that the right way to access this data is through wikidata, parsing wikitext is not something you need to do in order to get at this data. (This sentence would have saved me about a month of development time, if I had been able to read it.).

For instance, here is how we get ATC codes and ethonol via the wikidata API:

Most of this data mining is found in the Wikidata project. Lets have a brief 10000 ft tour of the resources that it offers. First, there are several clinically related data points that it tracks. This includes ATC codes, which are the WHO maintained codes for medications. (It should be noted that recent versions of RX Norm, can link ATC codes to NDC codes, which are maintained by the US FDA, and are being newly exposed by the Open FDA API project.

I pulled all of the tweets I made from wikimania about this into a storify.

Other things you want to do in no particular order:

Once you have wikitext its pretty easy to mine for pmid so that you can use the PubMed API. I used regular expressions to do this, which does occasionally miss some pmids. I think there is an API way to do this perfectly but I cannot remember what it is…

Thats a pretty good start. Let me know if you have any questions. Will likely expand on this article when I am not sleepy….

-FT