May I have a peek? Reuse of publicly available biomedical data is increasing and producing some great successes


Evolving scientific culture and policy are directing more and more biomedical data into public data repositories. A number of the recent data sharing policy changes from funders and publishers were designed to improve data management and data sharing practices with the end goal of making it easy for researchers to discover and reuse biomedical datasets. However, not all researchers greet these policy changes with open arms, often questioning the proposed value and likelihood of their research data being reused.

This issue came to a head in early 2014 when the non-profit publisher PLOS (Public Library of Science) updated their data sharing policy to require that authors provide a statement describing where the data would be deposited. The policy essentially went from “tell us how you plan to share your data” to “show us where the data is.” Many researchers objected to this new data policy. One frequently cited concern was that researchers would bear a tremendous cost – in time to describe and deposit their data and risk of getting scooped by other researchers publishing on their data – while receiving little benefit. In addition, they expressed doubt about the likelihood of their data being productively reused: for an excellent summary of the controversy that arose from PLOS’s data sharing policy and level-headed response, please take a look at this blog post from the California Digital Library’s blog DataPub.

Widespread and productive use of public datasets

So, what’s the deal? Are most research datasets so unique and complex that no one would bother trying to find them, let alone mine them for new insights? Who is reusing biomedical data? Well, it turns out, lots of people, both inside and outside of biomedical research institutions.

You don’t need to look any further than Atul Butte, , MD, PhD who will lead UCSF’s Institute for Computational Biology, starting in April. Dr. Butte and his research lab provide us with great examples of the benefits of open data, and the value and novel discovery that can come from reanalysis of open, biomedical big data. At last year’s International Digital Curation Center Conference, Dr. Butte gave a fantastic presentation outlining some of his lab’s data reuse strategies and results. The title of his talk “Big Data in Biomedicine: Discovering new drugs and diagnostics from a trillion points of data” – could be amended to “a trillion points of public data.” Butte and colleagues used reanalysis of publicly available data to identify novel disease biomarkers and facilitated drug repositioning. They also launched two start-up biotech companies and the public immunology data resource ImmPort.

But you don’t have to be an academic researcher to gain value out of public biomedical data. Amateur scientists and high school science fair participants are also jumping on the data reuse bandwagon, using publicly available datasets to make novel discoveries, such developing algorithms for cancer diagnosis:

Giving and getting credit where it’s due

So we’ve established that concerns that publicly available data won’t be widely or successfully reused are overblown. What about the oft-voiced concern that the original data provider won’t get any credit for that data reuse? Well, we do still have some work to do to ensure that data citation is easy to do and becomes a cultural norm (and therefore provide researchers with more incentives for releasing their data to public repositories), but some things are falling into place. Many public data repositories utilize the Creative Commons CC-BY Attribution license to ensure that data depositors get proper credit when their dataset is reused. Other repositories rely on Data Use Agreements to establish a contract between a data depositor and data consumer with stipulations on proper citation if data reuse results in a new publication.

In fact, citation of datasets is increasing, and has become the norm in some fields, such as genomic datasets from the Gene Expression Omnibus. One way to track data citation activity is through the Data Citation Index, a great resource provided by the UCSF Library, which enables you to search for datasets by subject area, institution and data type, as well as identify the datasets that have been cited the most in original research articles.

For example, searching on “asthma” in the Data Citation Index reveals >3,600 publicly available data records that have been indexed by the Data Citation Index. These datasets encompassed many different types of data, including gene expression, imaging, protein sequence and structure, and survey data. Sorting these datasets by the number of times they have been cited in other publications revealed that the most highly cited asthma datasets were public health datasets from national health surveys and social science datasets. The next most frequently cited data types were molecular in nature, such as gene expression and protein structure. Data Citation Index does not yet index all public data repositories, but it is a good start for tracking the impact of public datasets.

As standards for citing datasets are adopted and more tools like the Data Citation Index emerge, many hope that data citations will become included in the metrics used to measure the impact of a researcher’s output. Giving shared data publicly the same respect and impact trackings more traditional measures of academic output, such as publications, would provide greater incentive for researchers to publicly available research datasets. Currently, policy changes and mandates from funders and publishers are predominantly driving the shift toward depositing data publicly; but as data reuse cases that demonstrate these datasets grow more abundant, benefits to the depositing researchers could become an even stronger driver.