Data sharing is always good, right? Well, not quite…

Rare is the occasion when I disagree significantly with my collaborator Steve Novella, but this is one of those times. It’s a measure of how much we agree on most things that, even in this case, I don’t completely disagree with him. But, hey, it happens. I’m referring to Steve’s post yesterday in which he gushed over the new policy at PLoS (Public Library of Science) regarding articles published in their journals. (Steve rarely gushes.) Here’s the policy:

In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings. This Data Availability Statement will be published on the first page of each article.

You’ll note that PLoS has significantly revised its announcement since Steve’s post. In any case, Steve argued that this is a “fabulous idea,” pointing out how such a policy could help with challenges such as “publication bias, the literature being flooded with preliminary or low quality research, researchers exploiting degrees of freedom (also referred to as “p-hacking”) without their questionable behavior being apparent in the final published paper, conflicts of interest, the relative lack of replications and lack of desire on the part of editors to publish replications, frequent statistical errors and the occasional deliberate fraud.” He had a point, but a lively discussion broke out in the comments that, I think, surprised Steve. Not everyone, myself included, was quite as enthusiastic about this new policy as Steve was. This was likely due to a combination of factors, including the vagueness of the PLoS policy, concerns about protecting research subject confidentiality for human subjects research, and the impracticality of following the policy for some types of experiments. These problems were obvious to commenters who actually run labs and do research (like myself), less so to those who did not. Indeed, I noted a rough negative correlation between the level of enthusiasm for this policy and the amount of experience doing actual research.

I think I can suggest why this is by “cherry picking” a couple of problematic parts of the policy. For example, the policy defines data that must be shared thusly:

PLOS defines the “minimal dataset” to consist of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. Core descriptive data, methods, and study results should be included within the main paper, regardless of data deposition. PLOS does not accept references to “data not shown”. Authors who have datasets too large for sharing via repositories or uploaded files should contact the relevant journal for advice.

First off, let me note that I agree that, in these days of supplemental data sections posted online, “data not shown” is no longer acceptable in a research paper. There is pretty much no reason I can think of that any “data not shown” notation couldn’t be changed to “see supplemental data section,” with the data that formerly wasn’t shown being deposited right there. The whole “data not shown” thing is a holdover from the days before all scientific papers were made available online as PDFs and full text documents, when space limitations required that the number of figures and the amount of text be limited. Data not considered essential to the findings reported in the scientific paper could be described as “data not shown.” It’s not as bad as it sounds. Usually, the data described as “not shown” was seen by the reviewers, as it was usually included with the manuscript. It just wasn’t published.

But what is the “minimal dataset”? This is not a trivial question. I’ve published papers in which experiments have been done multiple times over several years, starting out with preliminary experiments repeated over and over again to work out the bugs and get the methods to work reproducibly, followed by the “real” experiments, the ones that ultimately end up in the final manuscript. Do I include all those messy, preliminary experiments? What about basic molecular biology studies? Is PLoS going to require, for instance, that the original, uncropped, autoradiographs be included in the supplements, for instance? (Yes, we still do autoradiographs and use film in our labs to detect bands on gels through chemiluminescence.) Original lab notebook analyses, either copies or transcribed to print? Or the step, by step, analysis of data, which in some cases can be many, many pages long? Data transparency is great in concept, but when you start considering the nuts and bolts of what, exactly, data transparency means, it gets very, very messy very quickly. As was pointed out in the comments, the policy as currently written is so vague as to be almost completely unenforceable, which is why it’ll be really interesting to see what gets dumped in those supplemental data sections.

Unfortunately, PLoS’s “clarification” does anything but clarify:

This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper.” The ‘minimal dataset’ does not mean, for example, all data collected in the course of research, or all raw image files, or early iterations of a simulation or model before the final model was developed. We continue to request that the authors provide the “data underlying the findings described in their manuscript”. Precisely what form those data take will depend on the norms of the field and the requests of reviewers and editors, but the type and format of data being requested will continue to be the type and format PLOS has always required.

Ah, perhaps I should breathe more easily. We don’t have to make our “early iterations” of a model available. On second thought: Define “early iteration.” When does an iteration cease to be an “early” iteration? If the only data necessary are the direct data used in the paper, then why bother with this policy to begin with?

One issue that was brought up that probably isn’t a huge consideration is that some datasets are too large to share easily. Genomics data, for instance, can easily end up taking up many terrabytes of data. There also already exist public databases into which such data can be deposited, which would clearly satisfy the PLoS policy. However, other types of data lack such public databases. One reader described how the raw data from a single color channel of a single image of super-resolution microscopy takes up 8 GB, meaning that each standard image takes up 20-24 GB. If each experiment involves taking photos of large numbers of cells, say 30 or so, then one experiment containing only the negative control and the test condition can easily reach 1 TB of data. One experiment was described describing an imaging data set of 10 TB, which cost over $1,000 to store no a RAID. In that case, will a statement that the researcher will share the original data suffice?

My guess is that fewer investigators are going to want to submit their work to PLoS journals. Indeed, I’ve just been through the process of submitting two manuscripts to a PLoS journal, PLoS One. I just had one paper published by PLoS One, with another one in the can to be published next month. It was a big enough pain in the rear to submit to PLoS to begin with, not even counting the $1,300 or so per manuscript in page charges due to the journal being open access. If I wasn’t sure I would be doing it again before this announcement, now I really don’t know if I will do it again, given the extra time it will take to make sure the data are available to the satisfaction of PLoS. I already am fine about providing raw data to an investigator who requests it.

Another issue I noticed was this:

For studies involving human participants, data must be handled so as to not compromise study participants’ privacy. PLOS recommends that researchers follow established guidance and applicable local laws in ensuring they do not compromise participant privacy. Resources which researchers may consult for guidance include:

US National Institutes of Health: Protecting the Rights and Privacy of Human Subjects

Canadian Institutes of Health Research Best Practices for Protecting Privacy in Health Research

UK Data Archive: Anonymisation Overview

Australian National Data Service: Ethics, Consent and Data Sharing

Steps necessary to protect privacy may include de-identification, blocking portions of the database, or license agreements directed specifically at privacy concerns. Authors should indicate, as part of the ethics statement, the ways in which the study participants’ privacy was preserved. If license agreements apply, authors should note the process necessary for other researchers to obtain a license.

This policy is a bit naive. De-identifying the data would not be guaranteed to adequately protect the identities of clinical trial subjects, at least among hospital staff and others who might deal with them or friends, family, or acquaintances who might put together measurements and dates to figure out which subject is whom. While that might seem harmless, it would nonetheless be a violation of HIPAA privacy regulations, which do not allow exceptions for curious family members or hospital staff. Yes, the chances of this happening are low, but when data are available to anyone (i.e., are public) the chances of this happening can’t be ignored. And, yes, there are examples of successful anonymization of data for sharing data sets, but, as is noted here, it is “time consuming and therefore costly.” It would require that clinical trials be designed from their very inception with data sharing in mind and the informed consent that patients sign mentioning that the data will be shared. This has the potential to be a good thing in principle, but again the devil is in the details. it’s also financial. Funding sources already barely provide enough funding to do this research—and often they do not, at least not completely. In the absence of increased funding to do this, it’s a burden on researchers.

All of which is probably why PLoS backtracked:

Like some other types of data, it is often not ethical or legal to share patient data universally, so we provide guidance on the routes available to authors of such data, and we encourage anyone with concerns of this type to contact the journal they would like to submit to, or the data team at [email protected]

But the original policy formulated doesn’t give me a great deal of confidence that PLoS knows what it’s doing with respect to clinical trials confidentiality.

I don’t know if I completely agree with the ever-irascible Drug Monkey (one of the only researchers I’ve encountered whose tendency towards Insolence approaches my own), when he referred to the new PLoS policy as “letting the inmates run the asylum” and “whackaloonery,” but he does make some good points. He prefaces his complaints by discussing how he thinks PLoS, through its policies on animals, basically tries to sidestep the local IACUC (the ethics committee that approves animal research), and that complaint rings somewhat true. I remember that the statements about animal research that PLoS made me sign did make me wonder whether approval of my animal experiments by my university’s IACUC was going to be adequate. He also makes a legitimate point about “self-plagiarism” with respect to methods sections. Personally, like many scientists I, too, recycle huge swaths of my methods sections, because the methods for each technique are the same. Only the reagents, DNA constructs, and specific drugs and doses vary. The overally assays and techniques tend to be the same or very similar. It just doesn’t make sense to have to rewrite them every time.

Those complaints, however, have nothing to do with the current question about data “openness.” DrugMonkey’s first complaint is this:

The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time, to address the legitimate sins of the few. The scope of the problem hasn’t even been proven to be significant and we are ALL supposed to devote a lot more of our precious personnel time to data curation. Need I mention that research funds are tight and that personnel time is the most significant cost?

I know tight funding. There are only two people in my lab now, me and my lab manager.

The one that resonates with me is this one, which I came up with independently, albeit with a different emphasis, as you might imagine given the usual blogging topics I take on:

Fourth problem- I grasp that actual fraud and misleading presentation of data happens. But I also recognize, as the waccaloons do not, that there is a LOT of legitimate difference of opinion on data handling, even within a very old and well established methodological tradition. I also see a lot of will on the part of science denialists to pretend that science is something it cannot be in their nitpicking of the data. There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted! That’s fine for post-publication review but to use that as a gatekeeper before publication? Really PLoS ONE? Do you see how this is exactly like preventing publication because two of your three reviewers argue that it is not impactful enough?

DM is referring to the mission of a PLoS journal, PLoS One, which is to be different from other journals in that it will publish any well-conducted science without any assessment of whether the results are “important” or not; in other words, to be a repository for all science. I thought of DM’s concern from a different angle, one that I’d think of because of my usual blogging topics. (Actually, I’m not sure they thought through the consequences of this all that well.) The cranks, quacks, and antivaccinationists will have a field day with this. They already do their damnedest to get the original datasets for various studies they don’t like, the better to “analyze” them the way they want or to find flaws in them. You could argue that, knowing that anyone, including cranks, can see their original data will motivate scientists to produce a higher standard in their publication. Maybe so in some cases. In fact, probably so in some cases. However, what’s more likely to happen in most cases is that scientists in controversial fields frequently attacked by cranks just won’t publish in PLoS journals anymore because, however, rigorous their analyses are, they’ll have to put up with the hassle of cranks”re-analyzing” their data to discredit them. Most scientists care far more about what other scientists think about them than what cranks do, but on the other hand it’s understandable not to want the hassle of dealing with, say, antivaccinationists. I know I wouldn’t if I did vaccine research.

A willingness to share data is, without a doubt, one of the highest ideals of science. However, it isn’t as simple as a journal (or journals) mandating it. One can argue that one of the bigger flaws in science as it is practiced now is that it lacks the infrastructure and agreed-upon methodologies to make sharing easy and expected. To the extent that PLoS has started the conversation on how to work towards this goal, I’m with its editors. However, right now the effort strikes me as half-baked. I want to get behind it, but right now I feel that PLoS is using a blunt instrument that strikes me as not having been that well thought out.

Sorry, Steve. We can’t always agree, at least not completely.