Google Intrusion Detection Problems

(Update: Less than four hours after tweeting out this blog post, we got our access turned back on. So the Google support team is definitely listening on social media. We very much appreciate that, because it resolves this issue as a crisis. We are still concerned by the “auto-off” trend and the missing button. But we will be working to make sure there is a better long term solution. Will update this post as appropriate moving forward.)

So today our Google Cloud Account was suspended. This is a pretty substantial problem, since we had committed to leveraging the Google Cloud at DocGraph. We presumed that Google Cloud was as mature and battle tested as other carrier grade cloud providers like Microsoft, Rackspace and Amazon. But it has just been made painfully clear to us that Google Cloud is not mature at all.

Last Thursday, we were sent a message titled “Action required: Critical Problem with your Google Cloud Platform / API Project ABCDE123456789″ here is that message.


Which leads to our first issue Google is referring to the project by its id, and not its project name. It took us a considerable amount of time to figure out what they were talking about when they said “625834285688”. We probably lost a day trying to figure out what that meant. This is the first indication that they would be communicating with us in a manner that was deeply biased towards how they view their world of their cloud service internally, totally ignoring what we were seeing from the outside. While that was the first issue, it was nowhere near the biggest.

The biggest issue is that it was not possible to complete the “required action”. Thats right, Google threatened to shut our cloud account down in 3 days unless we did something… but made it impossible to complete that action. 

Note that they do not actually detail the action needed, in the “action required” email. Instead they refer to a FAQ, where you find these instructions:


So we did that.. and guess what, we could not find the blue “Request an appeal” button anywhere. So we played a little “wheres waldo” on the Google Cloud console.

  • We looked where they instructed us to.
  • We looked at the obvious places
  • We looked at the not-obvious places

As far as we can tell, there was no “Request an appeal” button anywhere in our interface. Essentially, this makes actually following the request impossible.

So we submitted a support request saying “Hey you want us to click something, but we cannot find it” and also “what exactly is the problem you have with our account in any case?”

However, early yesterday morning, despite us reaching out to their support services to figure out what was going on, Google shut our entire project down. Note that we did not say “shutdown the problematic server” or even “shutdown all your servers”. Google Cloud services shutdown the entire project. Although we use multiple google cloud APIs we thought it made sense to keep everything we were doing on the same project. For those wondering that is a significant design flaw, since Google has fully-automated systems that can shut down entire projects that cannot be manually overridden. (Or at least, they were not manually overridden for us).

We have lost access to multiple critical data stores because Google has an automated threat detection system that is incapable of handling false positives.  This is the big takeaway: It is not safe to use any part of Google Cloud Services because their threat detection system has a fully automated allergic reaction to anything that has not seen before, and it is capable of taking down all of your cloud services, without limitation. 

So how are we going to get out of this situation? Google offers support solutions where you can talk to a person if you have a problem. We view it as problematic that interrupting an “allergic reaction” as a “support issue”. However, we would be willing to purchase top-tier support in order to get this resolved quickly. But there does not appear to be an option to purchase access to a human to get this resolved. Apparently, we should have thought about that before our project was suspended.

Of course, we are very curious as to why our account was suspended. As data journalists, we are heavy users of little-known web services. We suspect that one of our API client implementations looked to Googles threat detection algorithms like it was “hacking” in one way or another. There are other, much less likely explanations, but that is our best guess as to what is happening.

But we have not idea what the problem is, because Google has given us no specific information about where to look for the problem. If were actually doing something nefarious, we would know which server was the problem. We would know exactly how we are breaking the rules, but because we are (in one way or another) a false positive in their system, we have no idea where to even start looking for the traffic pattern that Google finds concerning.

Now when we are logged in, we simply see an “appeal” page that asserts, boldly “Project is in violation of Google’s Terms of Service”. There is no conversation capacity, and filling out the form appears to simply loopback to the form itself.

It hardly matters, Googles support system is so completely broken, that this issue represents a denial of service attack vector. The simplest way to take down any infrastructure that relies on Google would be to hack a single server, and then send out really obvious hack attempts outbound from that server. Because Google ignores inbound support requests and has a broken “action required” mechanism, the automated systems will automatically take down an entire companies Cloud infrastructure, no matter what. 

Obviously, we will give Google a few hours to see if they can fix the problem and we will update this article if they respond in a reasonable timeframe, but we will likely have to move our infrastructure to a Cloud provider that has a mature user interface and support ticketing system. While Google Cloud offers some powerful features, they are not safe to use until Google abandons its “guilty until proven innocent, without an option to prove” threat response model. 







Mourning Jess Jacobs

Yesterday, Jess Jacobs died.

Like so many others on Twitter, I knew Jess just well enough to be profoundly frustrated as I watched helplessly as the healthcare system failed her again and again. Today, the small world of Internet patient advocates mourn for her across blogs and twitter. The world of people who are trying to fix healthcare from underneath is small, and relationships that form on social media around a cause can be intense. There is nothing like an impossible, uphill battle to make lasting friendships. Now this community is responding to the loss of not only one of our own, but one of our favorites.

Is the NSA sitting on medical device vulnerabilities?

Today is not a fun day to read slashdot if you care about healthcare cybersecurity. First, it highlights how the DEA is strong-arming states into divulging the contents of their prescription databases.

Second, and even more troubling, was the claim that the NSA was looking to exploit medical devices. The story was broken by Intercept reporter Jenna McLaughlin. Since then, the article has been picked up by the Verge. Their title is even more extreme: “The NSA wants to monitor pacemakers and other medical devices”  Jenna did not specifically mention where she heard the comments, but her twitter feed gave me a hint.

The comments were from NSA deputy director Richard Ledgett, who is the same guy that countered the Ted talk from Snowden with his own. He was speaking at the Defense One Tech Summit. It is incredibly hard to find, but his comments are available as a video he goes on almost exactly at 3 hours. I tried to embed the talk below, YMMV.

In one sense this has been blown out of proportion. Patrick Tucker is the moderator/interviewer here, and he is the one that is pressing Ledgett on the issue of biomedical devices. Start at 3:15 for the discussion on medical devices.

Ledgett insists that targeting medical devices is not a priority for the NSA. But the troubling thing is his answer to the first question:

Question: ” What is your estimation of their security ”

Answer: ” Varies alot depending on the device and the manufacturer”

The problems with this is that I know of no examples of the NSA releasing data on insecure medical devices. In fact, the FDA has recently released information about specific medical devices that were insecure, without giving credit to the NSA.

This means that the NSA is investigating the security of medical devices, but then not releasing that information to the public. Ironically, it is a quantified self device that is most illustrated here. Ledgett specifically highlights fitbit, which I know had some pretty strange wireless behavior (that many regarded as insecure, in its early versions). So we know they have looked at one specific device, but there has been no release of information from the NSA on the device. At least I cannot find any.

If indeed the NSA is investigating medical devices, and is not releasing all of that information to the FDA, device manufactures and the public, then that is a huge problem.

I am still thinking about this, but it does not look good.

I suppose I should also mention that I ran across the interesting fact that Osama Bin Laden was using a CPAP machine.

Update: I have submitted a FOIA request for access to vulnerabilities about “healing devices” and it has been denied.

Clintons Server Politifact

Most of the time that I spend as a security-wonk is focused on email security. This is due almost entirely to my involvement as one of the architects of the Direct Project, which is a specification for using secure encrypted email in healthcare settings.

Which is why I was surprised by a recent analysis from Politifact evaluating something that Hillary Clinton said about her email servers. I should mention that I am apolitical. I care, but both US parties fail to resonate. So I have no reason to pick one side of this debate over the other. I am interested in the implications and perceptions of Hillary’s email system, however, because it is very revealing of basic attitudes about email systems.

For those that do not know Politifact is an organization that evaluates the veracity of specific statements that politicians make. Given my attitude about politics, you can understand why I am a fan of such a service. The statement that Hillary Clinton made that Politifact was evaluating was that “my predecessors did the same thing” regarding her email practices.

Politifact said:

And there’s a big difference between a private account, which is generally free and simple to start, and a private server, which requires a more elaborate setup…. The unorthodox approach has opened up questions about her system’s level of security.

later concluding:

This is a misleading claim chiefly because only one prior secretary of state regularly used email, Colin Powell. Powell did use a personal email address for government business, however he did not use a private server kept at his home, as Clinton did.

We rate this claim Mostly False.

The central assumption that Politifact is making is that Clinton’s email server was fundamentally less secure than using a service. Specifically, Colin Powell used AOL. In fact, for the average person, you are probably better off using a service like AOL. But Hillary Clinton and Colin Powell are hardly the average person. There are considerably advantages to having your own email server and your own domain, if you are particularly concerned with security.

First, all of the email services are constantly the targets of hackers. If someone broke into AOL, they could find Colin Powell’s account as a side-effect of the overall hack. It would be a bonus for hacking a system that is already regarded as a high-value target. Second, it is still relatively easy to spoof email. That means that it is fairly simple for someone to send emails pretending to be a particular person on a public email service. So if I had wanted to pretend to be Colin Powell, it would have been a little easier to get away with it, given that he was using an email service. It would be much easier to setup specific defenses (there is not that much you can do without encryption of some kind) to combat spoofing on your own server.

Unless Colin Powell had some special relationship with AOL (which is actually a real possibility) then login attempts from eastern Europe to his account would not have been flagged in any way. On a private server, however, you could always say “Is Secretary Clinton in eastern Europe today? No. Then that login attempt is a problem”. Of course, if you are not watching the logs on your private server, then this advantage is negated.

As it turns out, Clinton was also publicly serving up Windows Remote Desktop on her server, which makes it unlikely that she was taking the steps needed to get the security benefits. Even with that information, however, I cannot see the merit to the assumption that using AOL vs hosting your own Exchange server is fundamentally less or more secure for a public official like this.

Ultimately when you trust an organization like AOL you are effectively trusting thousands of people all at once. Clinton probably trusted somewhere between 10 and 100 people with the contents of her email server. Colin Powell probably trusted somewhere between 1000 and 10000. If I was making suggestions for the security of the email of my grandmother.. I would go with AOL. If I were making suggestions for the secretary of state? It is much less clear, and would depend alot on how the two different email systems were configured and regularly used.

As per almost always, the Wikipedia article on the subject is a tremendous source of the kinds of detail that a security researcher like me might need to evaluate whether there were security advantages, or disadvantages to hosting your own server. But still it is obvious that Colin Powell trusted state secrets to a massive Internet provider and Hillary Clinton trusted state secrets to a small team of generalist (i.e. not security) consultants. Neither of those decisions was well-informed by proper security thinking for securing emails that might eventually become state secrets.

So from my perspective as a security researcher with a focus on email security, it is a pretty fair statement for Hillary to say “My recent bad decision about email is equivalent to previous bad decisions made by members of the other party”.

Which means I think Politifact got it wrong. What is more interesting is why. They got it wrong because they made some flawed. This is deeply ironic, because that is precisely the same mistake that both Clinton and Powell made about exactly the same issue.

But I also think this is a problem with the way that technical options are presented. Politifact quotes Clifford Neuman as saying “you would need to stay current on patches”. I can promise you that this is not all Clifford had to say on the matter, but this is the only thing that Politifact chose to surface. The reality of the technical issues is a huge debate about whether Software as a Service is more secure than locally deployed and supported software. In reality, locally deployed software clearly can me made more secure, because one can choose to enforce parameters that improve security at the expense of convenience (like two factor authentication, for instance). However, Software as a Service is usually more secure in practice because you have teams of people ensuring that bare-minimums are always met.

I really could not care less about Clinton’s or Powell’s choices when it comes to email servers. It is a little silly to be accusing people who get to decide what is classified and what is not with mis-handling classified information. Personally, I think the fact that Clinton was exposing an RDP connection to the public Internet is the only thing that I have heard that is truly scandalous here, and this is clearly not the focus of the media circus here. I do not care at all about the political side of this.

I am very concerned, however, about how novices think about about complex security and privacy issues. How did Politifact, which is charged with getting to the bottom of this issue, discussed precisely none of these complex technical issues? The conclusion they reached is pretty shallow. Which I do not think is their failing… I think this is a symptom of dogmatic thinking in InfoSec messaging.

Still not done thinking about this.




Hacking on the Wikipedia APIs for Health Tech

Recently I wrote about my work hacking on the PubMed API. Which I hope is helpful to people. Now I will cover some of the revelations I have had working with DocGraph on the Wikipedia APIs.

This article will presume some knowledge of the basic structure of open medical data sets, but we have recently released a pretty good tool for browsing the relationships between the various data sets: DocGraph Linea (that project was specifically backed by Merck, both financially and with coding resources, and they deserve a ton of credit for it working as smoothly as it does).

Ok. here are some basics to remember when hacking on the Wikipedia API’s if you are doing so from a clinical angle. Some of this will apply to Wikipedia hacking in general, but much of it is specifically geared towards understanding the considerable clinical content that Wikipedia and it’s sister projects posses.

First, there is a whole group of editors that might be interested in collaborating with you at Wikiproject Medicine. (There is also a Wikiproject Anatomy, which ends up being strongly linked to clinical topics for obvious reasons). In general you should think of Wikiprojects as a group of editors with a shared interest in a topic, that collectively adopt a group of medical articles. Lots of behind the scenes things on Wikipedia take place on Wikipedia talk pages, and the connection between Wikiprojects and specific wiki articles is one of them. You can see the connection between wikiproject medicine and the Diabetes article, for instance, on the Diabetes Talk page.

Wikiproject Medicine maintains an internal work list that is the best place to understand the fundamental quality levels of all of the articles that they overlook. You can see the summary of this report embedded in the project page and also here. There is a quasi-api for this data using the quality search page data, you can get articles that are listed as “C quality” but are also “High Priority”.

Once a clinical article on Wikipedia article has reached a state where the Wikipedian community (Wikipedian is the nick-name for Wikipedia contributors and editors) regards it as either a “good” article or a “feature” article, it can generally be considered to be highly reliable. To prove this, several prominent healthcare wikipedians converted the “dengue fever” wikipedia article into a proper medical review article, and then got that article published in a peer-reviewed journal.

All of which is to say: the relative importance and quality of wikipedia articles is something that is mostly known and can be accessed programmatically if needed. For now “programmatically” means parsing the HTML results of the quality search engine above, I have a request in for a “get json” flag.. which I am sure will be added “real soon now”.

The next thing I wish I had understood about Wikipedia articles is the degree to which they have been pre-datamined. Most of the data linking for Wikipedia articles started life as “infoboxes” which are typically found at the top right of clinically relevant articles. They look like this:

ethanol_1 ethonal_infobox diabetes_infobox

The Diabetes infobox contains links to ICD9 and ICD10 as well as MeSH. Others will have links to Snomed or CPT as appropriate. The ethanol article has tons of stuff in it, but for now we can focus just on the ATC code entry. Not only does it have the codes, but the correctly link to the relevant page on the WHO website.

An infobox is a template on wikipedia, which means it is a special kind of markup that can be found inside the wikitext for a given article. Later we will show how we can download the wikitext. But for now, I want to assure you that the right way to access this data is through wikidata, parsing wikitext is not something you need to do in order to get at this data. (This sentence would have saved me about a month of development time, if I had been able to read it.).

For instance, here is how we get ATC codes and ethonol via the wikidata API:

Most of this data mining is found in the Wikidata project. Lets have a brief 10000 ft tour of the resources that it offers. First, there are several clinically related data points that it tracks. This includes ATC codes, which are the WHO maintained codes for medications. (It should be noted that recent versions of RX Norm, can link ATC codes to NDC codes, which are maintained by the US FDA, and are being newly exposed by the Open FDA API project.

I pulled all of the tweets I made from wikimania about this into a storify.

Other things you want to do in no particular order:

Once you have wikitext its pretty easy to mine for pmid so that you can use the PubMed API. I used regular expressions to do this, which does occasionally miss some pmids. I think there is an API way to do this perfectly but I cannot remember what it is…

Thats a pretty good start. Let me know if you have any questions. Will likely expand on this article when I am not sleepy….


Susannah Fox is the new CTO of HHS

I am not actually sure that anybody reads this blog. I suppose they must, that is really the magic of RSS… letting you know when your friends blog… right??

Still. If you wanted to actually follow what I am doing you should probably be reading the DocGraph Blog, or the CareSet Blog, or the OpenSourceHealth News Page. I just don’t tend this blog the way I should… But when I have news that I think deeply interesting and I cannot find a category for it anywhere, this is the perfect place.

Discovering that my dear friend Susannah Fox is now the CTO of HHS is just that kind of category defying important news.

I cannot think of a better person for a role like this. I mean that literally. I tried. I failed.

Susannah is a geek, enough of one that she cannot easily be snowed by other technologists (I should be explicit: I am talking about government contractors), even where she does not have direct technical expertise. Not every geek I know can do that. On the other hand she is not soo much of a geek that people find her arrogant, or incomprehensible. (I have problems with both). Most of that job will not be directly geeky stuff. That sounds contradictory, but HHS is just too large to have any one technical strategy. There is no way that a reasonable technical vision for the FDA would apply at CMS or at NLM. Being the CTO at HHS is about seeing the connections, understanding how things fit together, and then having a vision that is not actually technology centric, but patient centric.

As technology savvy as Susannah is, it is her capacity to hold a huge vision, and keeping patients at the center of that vision, that make her so deeply qualified for this job. No body asked me who the next CTO of the government was going to be, and frankly I was a little worried about who would be next. Bryan Sivak and Todd Park (her predecessors in this role) leave pretty damn big shoes to fill. Someone in the whitehouse/HHS is casting a net wide enough to know who the really transformational thinkers in our industry are.

I have to admit, I am still reeling from this news. I am usually pretty good at figuring out what the implications of something are… at calculating where the hockey puck is going… But I really have no idea what is the implications of this are going to be… other than to say:

This is going to matter, in precisely the way that most things in healthcare reform don’t.


Does Epic resist or support interoperability? Hell if I know.

I just realized that my somewhat infamous question at the ONC annual meeting is recorded on video!

The background on my question, which I made me very popular at the meeting afterwards, was that I had heard that Epic hired a lobbyist to convince congress that it is an interoperable company.

That lobbyist and others at Epic have been heard saying stuff like “Interoperability is Epics strength”… and “Epic is the most open system I know” etc etc.. This makes me think “what planet I am on?”

I have actually heard of hospitals being told “no at any price” by Epic, and I have never heard that regarding another vendor… although there are lots of rumors like that about Epic I would prefer to be fair. How would I know if Judy et al, had really turned a corner on interoperability. Epic has been a faithful participant in the Direct Project, which is the only direct (see what I did there?) experience I have had with them.

But I want data… and here is what happened when I asked for it at the annual ONC meeting. Click through to see the video.. it auto plays so I did not want it on the my main site.

Continue reading Does Epic resist or support interoperability? Hell if I know.

Libel and Discourse in the Digital Age

Libel, like copyright, is one of the central legal frameworks for governing online activities. It sets the bounds for what can and cannot be said about people in the new media area.  Like copyright law, libel law is a legal framework designed in a pre-digital era, that is somewhat strained in this new digital media age.

I write this with some trepidation. This blog posts touches on gender issues on Twitter, and that is a heated and, at least on Twitter, mostly broken discussion.

Any discussion on sensitive issues online, especially on Twitter, can devolve into a core of reasonable people trying to have reasonable discussions that are surrounded by a much larger group of people (or at least a large number of twitter accounts) who say completely ridiculous and incendiary things. Jimmy Wales response to a GamerGate email regarding the policies for Wikipedia’s GamerGate article is required reading here.

The wonderful thing about Twitter is that it facilitates open to the public conversations about anything at all. These conversations usually involve only people who are genuinely interested in particular topic, which means that the Twitter conversation is usually representative of the topic as it exists in the real world. But a given hashtag is useful and productive, only to the degree that people all agree on what the topic under discussion is, and also fundamentally agree on what is the appropriate means to have that conversation.

Many times, both of those constraints fail, and this is when you get a single hashtag, like #GamerGate being used in at multiple conflicting ways. One way is to have a discussion about “Ethics in Game Journalism”, the second is to launch a coordinated attack on female game journalists and other feminists, and the third is the feminist community using the hashtag to refer to those attacks. In the sense that all three things are happening at once using the same hashtag on Twitter, all of them are equally valid and equally invalid uses of the hashtag. But all three discussions regularly lament that the other two discussions are trying to “redefine” what “GamerGate” “is”. The letter from Jimmy Wales helped me realize that there is an inherent difference between a movement and a hashtag. Before reading that I was deeply confused on how think about “GamerGate” a word whose definition changes dramatically depending on who you listen to.

Generally I think the power of Twitter lies in its capacity to have public conversations that serve only as “signals”, with larger discussions on topics left to more forums that are better suited for comprehensive discussion, like blogs. Twitter is ill-designed to handle contentious issues, in part because Tweets are necessarily atomic in nature. It is too easy to take a single tweet, and then lambast that single tweet as the entirety of someones position. This is not strictly a straw-man tactic because it actually takes a little work to get Twitter to contextualize any discussion. Twitter presents tweets as atoms, and not threads on a topic.

On Twitter, there is a lot of “What I said was X, but what I meant was Y”. As an informaticist, I would call Twitter something like a “Communication Platform with Low Semantic Fidelity”. Which is not an insult to the platform… this is both a “feature” and a “bug”, depending.

So it is with great irony that I found myself having a discussion about libel, on the very platform that makes the issues around libel so complex.

For those who have been living under a rock on Twitter lately there has been a drama unfolding regarding the role Vivek Wadhwa plays regarding women’s gender issues in technology. The play continues to unfold, but here is the outline of the opening scenes:

  • Wadwha makes a statement onstage referring to “floozies”. (have not been able to find video of this)
  • Mary Trigani writes a post entitled Captains and Floozies criticizing Wadwha’s comment.
  • Wadwha comments on the blog post.
  • Trigani reposts the Wadwha’s comment with the title Vivek Wadwha explains
  • Amelia Green Hall, writes QUIET, LADIES. @WADHWA IS SPEAKING NOW which sternly criticized the role that Wadhwa plays and how he plays it.
  • This blog post caused enough of a stir that Amelia was subsequently interviewed by Meredith Haggerty on NPR’s TLDR series. This podcast (which is still available here) is essentially a retelling of Amelia’s blog post in audio form, with no dissenting voice from Wadhwa or elsewhere.
  • Wadwha reacts on twitter saying that the podcast is “libel and slander
  • NPR removes the podcast from their page, although as per normal it will be remembered forever on the Internet somewhere…
  • Twitter presumes that the post was removed because of Wadhwa’s “threats”
  • Wadhwa insists that he wants the post itself restored, but merely wants to have the opportunity to blog in the same space.
  • Apparently, his interactions with NPR makes him believe that he will be able to publish a retort on the NPR site.
  • For whatever reason, Wadhwa’s defense is not published on NPR, so he manages for it to be published on Venture Beat instead.

Which brings us to current. (I will try and update the timeline if things change)

Obviously it’s interesting stuff in it’s own right, but I am mostly interested in the issues around libel. Specifically, I am interested to understand if it was in fact libel, and I am interested to know if the fact that Wadhwa labeling it as libel was a “threat”.

Lets deal with the first issue. Was it libel? Well it turns out that this is not a clear legal question, especially for Wadhwa. You see in the US, the legal test for libel typically has three components (IANAL and I am quoting Wikipedia, so you would be foolish to take this a legal advice).

  • statement was false,
  • caused harm,
  • and was made without adequate research into the truthfulness of the statement.

(from wikipedia)

Unless, you are a public figure, and then libel also includes “Proving Malice”. Again quoting wikipedia:

For a celebrity or a public official, the person must prove the first three steps and that the statement was made with the intent to do harm or with reckless disregard for the truth, which is usually specifically referred to as “proving malice”

Listening to the podcast there are several statements that stand out specifically as false:

  • ..”Has he really been this spokesman for women in tech for years while he is believing that women can’t be nerds because thats because thats like super misogynist”..
  • (on the website of for Wadhwa’s book) “I can get to a photo grid of women it doesn’t list their names..” (Wadhwa points out that such a list lives here)
  • “Wadwha was barely acknowledging the women he was working with”
  • Wadwha was “Gaslighting minimizing marginalizing people who disagree with (him)”
  • The story implies that Wadwha titled his response to Trigani’s post “Vivek Wadwha explains” when in fact Trigani had made that title.
  • The DM’s that Wadwha sent were “non-censual”.

If you listen the to podcast, and you read Wadwha’s rebuttal, it is pretty easy to see understand how Wadwha at least would view these statements as false, harmful, and inadequately researched. Wadwha is painted as pretender, a person who who is taking the role of “real” expertise. The implication here is that there is something essential to the experience of being a women in technology that is required to acquire legitimate expertise about women in tech. At the same time, there is the implication that the experiences of women in tech so vastly distinct that no one person could possibly make useful statements about them as a class.

This is an interesting issue with civil rights in general. There was a time when the racial civil rights movement choose to exclude white supporters from leadership positions. This makes sense when you are dealing with a pervasive attitude that presumes that a particular class is fundamentally incapable of self-representing and/or leadership.

But there is a difference between requesting that someone bow out of a leadership role, in order to further the aims of a social justice movement, and attacking the qualifications and intentions of that same person in the most public way possible (i.e. on the radio and Internet at the same time).

On the other hand, if there is a person claiming leadership in a social movement, while saying or doing things that hamper that movement, it is a natural reaction to eventually (after back channel discussions have failed) to out that person in public.

So which is it? Is this a necessary exposer in defense of an important social movement, or it is petty dramatics within a movement that should be above such theatrics?

What the hell do I know? Although I am at least a little interested in anything that qualifies as social justice, I am hardly an expert in this area. I don’t know any of the parties involved and I have no familiarity with the book and research body in question.

What I am interested is how libel works in the Internet age. What is fascinating specifically to me is the degree to which Wadwha is being criticized for calling the podcast “libel”. It is fairly clear to me that IF the contents of the podcast are misrepresentations, then Wadwha is just being publicly attacked. The whole podcast was about him, not about “men speaking for women generally”, but just about him and what he was specifically doing wrong. The podcast implied that he was a lecherous, misogynist, manipulative plagiarist. IF those things are not true about him… then does he have the right to say “This thing that is happening is slander and libel” without inappropriately using that language to squelch criticism.

According to Wadwha, he has made no legal threat, he did not ask for the article to be taken down and, in fact, he has asked for it to be restored. That does not generally sound like the acts of someone who is seeking to muzzle critics.

What I find fascinating, is the apparent consensus that merely labeling the podcast as libel IS itself a legal threat.

Here are some reactions from two lawyers who work for the EFF (an organization I admire and donate to)

And then here..

Lastly this is one specific quote from someone who has been on the other side of this.

However, I did find this gem from @DanielleMorrill, who was obviously researching this earlier than I was. She found places where Wikipedia policies cover these issues…

For my part, I cannot help to empathize with Wadwha. My family has had some pretty nasty run ins with people willing to publish false things about us. If someone in traditional media decides to smear you, its nearly impossible to undo the damage. At least Wadwha had the opportunity to tell his side of the story, an opportunity my family never got. 

Apparently, the consensus on the Internet, and what I would advise people do on this, is to just say. “Hey that stuff you wrote/said about me is not true, and its pretty hurtful and you really should have researched that better” instead of actually coming out and saying “Thats Libel”. Its pretty clear that Wadwha tried to take a position of “You have libeled me, but I am not planning on suing you, I just want to achieve balance”, and from what I can tell, that has blown up in his face, and possibly made things worse for him. 

I have certainly learned several things from this incident that will make me slightly less likely to put my foot in my mouth: Specifically…

  • I should be careful not to speak over other people on panels. I am frequently the most vocal and opinionated person on a panel. Audiences frequently ask questions specifically to/for me, and moderators will frequently favor me because I can be entertaining. But apparently when Wadwha does the same thing he is percieved as “taking the air out of the room” etc etc. I would never want my fellow panelists to feel they don’t have a voice b/c of me. I will have to work on that.
  • Apparently there is a whole contingent of women who have been so completely harassed by DM’s that saying something like “A non-consensual DM” actually makes sense to them. I had no idea that Twitter harassment had reached that level for women. I mean you have be brave or crazy to let someone know you are a female user on Reddit (which is sad), but I thought Twitter was a “safe place”. I was wrong.
  • When someone labels themselves as rude or mean or otherwise thinks that it is a good idea explicitly admit in their twitter profile that they are difficult to deal with… believe them. They are not kidding. Its one of these things. Lookup the Far Side cartoon that says: How Nature says “Do not touch”. Its just like that.
  • I need to be careful to explicitly not speak “for” the people I personally advocate for (which in my case is usually patients) b/c this can be disempowering. I need to find ways to advocate without being presumptuous, which is harder than it sounds.

Thanks for reading, I may well update this post based on reactions from Twitter and elsewhere.








Hacking on the Pubmed API

The pubmed API is pretty convoluted. Every time I try to use it, I have to try and relearn it from scratch.

Generally, I want to get JSON data about an article, using its PubMED ID and I want to do searches programmatically… These are pretty basic and pretty common goals…

The PubMED api is an old-school RESTish API that has hundreds of different purposes and options. Technically the PubMed API is called the Entrez Database, and instructions for using it begin, and end with the Entrez Programming Utilities Help document. Heres the things you probably really wanted to know…

How to search for articles using the PubMed API

To search pubmed you need to use the eSearch API.

Here is the example they give… 

The first thing we want to do is not have this thing return XML, but JSON instead. We do that by adding a GET variable called retmode=json. The new url

Ahh… thats better… No lets get more ids in each batch of the results…

Breaking this down…

is kind the entry point for the whole system..


is the actual function that you will be using…

This tells the API that you want to search pubmed.


Next you want to set the “return mode” so that JSON is returned.


And then you want to add the retmax to get at least 1000 results at a time… The documentation says that you can get 100,000 but I get a 404 if I go over 1000


The term argument


db and term are seperated by the classic GET variable layout (starts with a ? and is then seperated by a &) if that sounds strange to you, I suggest you learn a little more about how GET variables work in practice.

Now about the “YOUR SEARCH TERMS HERE” What that is a url_encoded string of arguments to the search string for pubmed. URL coding is (something of a trivialized explanation) how you make sure that there are no spaces or other strangeness in a URL. Here is a handy way to get data into and out of url encoding if you do not know what that is..

Thankfully the search terms are well defined, but not anywhere near the documentation for the API. The simplest way to understand the very advanced search functionality on pubmed is to use the PubMed advanced query builder or you can do a simple search, and then pay close attention to the box labeled “search details” on the right sidebar. For instance, I did a simple search for “Breast Cancer” and then enabled filters for Article Type of Review Articles and Journal Categories of “Core Clinical Journals”.. which results in a search text that looks like this:

("breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]) AND (Review[ptyp] AND jsubsetaim[text])

Lets break that apart into a readable syntax display…

("breast neoplasms"[MeSH Terms] 
  OR ("breast"[All Fields] 
        AND "neoplasms"[All Fields]) 
  OR "breast neoplasms"[All Fields] 
  OR ("breast"[All Fields] 
        AND "cancer"[All Fields]) 
  OR "breast cancer"[All Fields]) 
AND (Review[ptyp] 
  AND jsubsetaim[text])

How did I get this from such a simple search? PubMed is using MesH terms to map my search to what I “really wanted”. MesH stands for “Medical Subject Headings” is an ontology built specifically to make this task easier.

After that, it just tacked on the filter constraints that I manually set.

Now all I have to do is use my handy URL encoder.. to get the following url encoded version of my search parameters.


Lets put the retmode=json ahead of the term= so that we easily just paste this onto the back of the url.. we get the following result.

I wish that my css could handle these really long links better… but oh well. I know it looks silly, lets move on.

To save you (well mostly me at some future date) the trouble of cut and pasting here is the trunk of the url that is just missing the url encoded search term.

At the time of the writing, the PubMed GUI returns 2622 results for this search, and so does the API call… which is consistent and a good check to indicate that I am on the right track. Very satisfying.

The JSON that I get back has a section that looks like this:

    "esearchresult": {
        "count": "2622",
        "retmax": "20",
        "retstart": "0",
        "idlist": [

With this result it is easy to see why you want to set retmax… getting 20 at a time is pretty slow… But how do you page through the results to get the next 1000 results? Add the retstart variable

If you need more help, here is the link to the full documentation for eSearch API again…


How to download data about specific articles using the PubMed API

There are two stages to downloading the specific articles. First, to get article meta-data you want to use the eSummary API… using the ids from the idlist json element above… you can call it like this:

This will return a lovely json summary of this abstract. Technically, you can get more than one id at a time, by separating them with commas like so…,24792655

This summary is great, but it will not get the abstracts, if and when they are available. (it will tell you if there is an abstract available however…) In order to get the abstracts you need to use the eFetch API

Unlike the other APIs, there is no json retmode, the default is XML, but you can get plaintext using retmode=text. So if you want structured data here, you must use xml. Why? Because. Thats why. This API will take comma separated id list too, but I cannot see how to separate the plaintext results easily, so if you are using the plaintext (which is fine for me current purposes) better to call it a single id at a time.