Metta, Mudita and Musings : To share or not to share?

Over at Academic Jungle, GMP has a controversial post on sharing code. A scientist requested some code that was used in a published manuscript. She says this in her post,

"I actually replied politely, thanking him/her for the interest and stating that regrettably I cannot share the code. My group develops detailed microscopic simulations of certain physical phenomena. These codes can have wonderful predictive power and take years to develop. Sharing the codes is absolutely not the norm in my field, and there is no way in hell I would share any of my research codes with anyone other than close collaborators and colleagues.

Now most of the responses from commenters are in complete disagreement with GMP and I want to add I believe that it is utterly wrong to deny any request of published data to anyone, if the science is publicly funded.

But before jumping on GMP and calling this "despicable behavior," I think it's important to delve a little more into this issue.

GMP would not be first researcher who has refused to share. According to Heather Piwowar at Research Remix, almost 40% of scientists are not willing to release data. Why? It could be as in the case of GMP, an outright refusal, or because researchers are "not able to retrieve the data."

How many of you who are now well past your graduate work could provide data from a paper in your postdoc or PhD?

I have experienced this problem when I wrote to request data from someone who completed their PhD in the early 90s . The person responded by saying they didn't have a copy of it anymore, but that they could send summary tables from their dissertation. I wanted the original data because I felt that the analysis used at the time was incorrect and sending me the summary tables simply recapitulates the analysis.

GMP's post indirectly raises the issue of do we share data?

First, I define data to include but not limited to cell lines, evolved critters, genetic data, morphological measurements, images, maps, character matrices, seeds, and maybe (with some qualifications), computer code.

I am firmly in the camp of yes, we share data. If it's published, its not yours anymore. (Not that I really think it ever was...but that's another blogpost.)

In an editorial written in the journal Evolution, Rausher et al. (2010) outline the reasons for data sharing. As they highlight, much of our progress in understanding the natural world depends on building upon previously collected data. But too often that original data (not summary stats) is lost because the researcher just moved too many times or technology advances such that the data irretrievable. Who remembers floppy discs or zip discs? I have stuff that I can't access on a floppy and not that it matters, but in 10-20 years when whatever we're using now, becomes useless, how will anyone be able to access that data? An online database would preserve our scientific history.

Researchers typically refuse to share data for many reasons, probably top among these reasons is that they want to publish more papers and/or they want exclusive use of the data. But let's say decades later, after the faculty member retires, a young upstart researcher wants to conduct a meta-analyses on original data. A common data repository would make this young person's life a lot easier. And as Piwowar states in her slideshow called Research into Open Research Data, not sharing data is hurting mainly the young. She cites a survey of doctoral students and postdocs, saying that 28-50% report a negative impact from data withholding on progress, discovery and the quality of education.

Rausher et al. (2010) also suggest that data that is archived and available is more likely to be useful and to be cited more by other scientists. We only have to look to GenBank to see that this repository has afforded many different researchers a chance to use data collected by others to ask very different questions.

A final reason that we should share our scientific data is accountability. Data that is made accessible means that it can be checked for errors. Good data means good science and progress. According to Piwowar, more than half of all papers contain errors, but only 5-10% contain errors that change conclusions. Sigh of relief.

Recently, several journal editors within the ecology and evolution subdiscipline of biology came together to form DRYAD.

"Dryad is an international online repository of data underlying peer-reviewed articles in the basic and applied biosciences. Dryad enables scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, and perform synthetic studies. Dryad is governed by a consortium of journals that collaboratively promote data archiving and ensure the sustainability of the repository."

Already journals like The American Naturalist (American Society of Naturalists), Evolution (Society for the Study of Evolution), Molecular Ecology, Journal of Evolutionary Biology (European Society for Evolutionary Biology) have become interim partners with Dryad. These journals will be introducing a new data-archiving policy which will state that as a condition for publication, data used in the paper should be archived into an appropriate public database. The policy does, however, allow for an embargo period after publication and for longer periods of restriction at the discretion of the editor.

So what about computer code?

Unfortunately, Dryad is not set up to accept computer code. My programming friends tell me that the main reason coding geeks are reluctant to share code is the licensing problem. Freely available code means that anyone can take your code and use it to make a commercial program that they could then sell for cash. A second reason, I'm told is that computer code is dynamic and changing unlike data that is collected. For computer nerds, it's important that code change and evolve because that makes it better. And really in my field when I think about it, this is true. But, some of that evolution comes because the code is made public. Feedback from the user is essential to finding the bugs or problems with a given code.

In a recent article to Nature, Nick Barnes, a software computer engineer suggests that turning raw data into published papers requires a little programming, meaning that scientists write software. And he argues that yes, it isn't very good because of poor commenting, weird variable names and a lack of indentation. But it works. And if it works it should be made accessible.

But I would ask, all of it? Some of my R code is just a one-off and it isn't really necessary that I archive it. I do, however, have code that is associated with a particular dataset and could be useful to future users. And I would like to deposit my code somewhere and ensure that it was associated with that particular dataset.

Luckily for us in biology, things are improving, we have places where we can access shared code. Often the code is located in websites run by researchers but this means that it is essentially anywhere in the web. What is needed is a single resource where users can find and share code. Well, a new place called The Molecular Ecologist, associated with the journal Molecular Ecology Resources will provide a blog that highlights important papers in the field; list computer programs and other code (e.g. R packages) useful for analyzing genetic data; and a site to discuss methods.

I know it's a lot of work getting your "data" ready for submission - there's the formatting, the organizing, and the worry about mistakes being found or the data being misinterpreted.

But in the end, I feel better for having submitted data to a public repository. After all that blood, sweat and tears, I don't really want my hard work to end up in a black hole.

Photo taken from Website: Astronomy Picture of the Day

Other Sources: Rausher MD, McPeek MA, Moore AJ, Rieseberg L, Whitlock MC. 2010. Data archiving. Evolution. 64-3: 603-604.

Metta, Mudita and Musings

October 26, 2010

To share or not to share?

1 comment:

The liability of a brown voice.

Followers

The Broadsheets