Science Publishing: 'Tis the Season for Sharing Your Data

Thursday, December 4, 2014

Welcome to the second installment of the UCSF Library column on scientific publishing (that means data too!). Last month, Anneliese Taylor provided a great introduction to open-access publishing. Today, I’ll be focusing on the natural extension of open-access publishing: the open data movement, broadly defined as the effort to make research data more publicly accessible.

Attitudes and policies are shifting dramatically in favor of open data, as funders, publishers and researchers alike acknowledge that the benefits data sharing brings to science and biomedical discovery far outweigh the risks. I’ll provide a basic introduction to data sharing benefits and policies, and the resources available here at UCSF to support data sharing.

So, why share your data? Well, it’s good for science.

Data sharing supports data reuse, which can accelerate the pace of scientific discovery. From a funder’s perspective, data reuse increases the impact of their investment. Reanalysis of publicly available data helps confirm original results and helps researchers gain confidence in their novel discoveries. Publicly available datasets help train the next generation of researchers by enabling them to get their feet wet in an experimental or data analysis method that may be new to them. Public datasets are also key assets in the development of novel data analysis algorithms and software. Open data sharing also supports research reproducibility and discourages fraud.

In the field of genomics, basic and translational researchers are reaping huge benefits from the reuse of publicly available datasets in the form of discovery of novel disease biomarkers, drug targets and identification of genes mechanistically associated with a particular phenotype. For example, a recent Cell publication, co-authored by several UCSF researchers, reported that reanalysis of publicly available datasets from The Cancer Genome Atlas resulted in identification of novel genomic signatures for cancer diagnosis (Cell. 2014. 158:929–944).

Making data publicly available helps facilitate collaboration, and there is growing evidence that research articles that provide public access to the underlying data are cited at a higher rate than those that do not. (PeerJ 1: e175, http://dx.doi.org/10.7717/peerj.175). Datasets and dataset citations are also being tracked by tools such as the Data Citation Index (Thomson Reuters).

Why share data? Well, sometimes you have to.

Funders and publishers alike, motivated by many of the benefits highlighted above, are aligning their data sharing policies in favor of more open access to research datasets.

NIH policy reflects this by directing that any grant with more than $500,000 in direct costs in a single year must include a data sharing plan. In the past, these data sharing plans did not come under much scrutiny, but attitudes have shifted, and the quality and execution of those data sharing plans has begun to impact the grant review and renewal process. The NIH has also recently updated its Genomic Data Sharing Policy (http://gds.nih.gov/03policy2.html) to strike a balance between encouraging sharing data broadly and protecting patient confidentiality

Changes in publishers’ data sharing policies are having an even stronger and more immediate effect, and are driving more and more data into the public arena. For example, when PLOS released its new “show us the data” data sharing policy in early 2014, there were swift and strong reactions from the research community both in favor and in opposition to the change. The controversy surrounding this policy change directly reflects the immediate effect publisher data sharing policy has on researchers.

In spite of some voices in opposition to data sharing, publishers remain firm in their movement toward more open data. Nature’s data sharing policy is not uncommon, stating, “A condition of publication in a Nature journal is that authors are required to make materials, data, code and associated protocols promptly available to readers without undue qualifications. Supporting data must be made available to editors and peer-reviewers at the time of submission for the purposes of evaluating the manuscript. The preferred way to share large data sets is via public repositories.”

All signs from funders and publishers point toward more open data in the future.

So – how do I make my data publicly available?

There are a number of open data repositories available to help meet researchers’ data sharing needs. There are field-specific repositories such as the Neuroscience Information Framework that accept different types of datasets, as long as they are related to the field. There are data-type-specific repositories such as the Gene Expression Omnibus that accept datasets from many different fields, as long as they are generated from a gene expression platform. And there are catch-all repositories that accept almost any type of data—a good solution for datasets that don’t have a “natural” home.

One example of such a catch-all repository is our own UCSF DataShare (datashare.ucsf.edu), which is available to all researchers at UCSF. DataShare enables researchers to upload, describe and submit data using data sharing best practices.

How do I figure out which data repository to use?

Often a publisher or funder policy will mandate that certain types of data go to particular repositories; for example, gene expression data go to the Gene Expression Omnibus. If this is not the case, databib.org and re3data.org are great resources for finding the right place to deposit your data. Generally speaking, look for a data repository that follows best practices in data sharing and data management. Make sure the repository issues a doi (digital object identifier) for your dataset to support data citation. The data repository should also follow metadata standards for describing data in a way that supports discovery and reuse.

I encourage you take advantage of one of our upcoming workshops or pop-up events on data sharing, data management, and data reuse. Check the library’s class schedule (calendars.library.ucsf.edu/classes) or contact me at megan.laurance@ucsf.edu. I would be happy to help with any data sharing needs or questions you have.