Building a Path to Open Data at UCSF and Beyond

Contributor

Open Data Day is coming up and people from around the world are gathering in support of Open Science and Open Data. There are a number of challenges affecting researchers today regarding accessibility and reproducibility in data and consistent engagement on these topics is paramount. Here at UCSF, a growing coalition of students, post-docs, faculty and staff have come together in support of Open Science in response to the changing scholarly communication landscape. The Open Science Group has hosted a range of events in support of transparency, reproducibility, and openness in research. These include a discussion on protocol publishing, a forum on open scholarly communication initiatives at the University of California, pre-print peer review sessions, and an Open Con 2018 live stream. For more information about the group and future events, find the Open Data Day table in the Kalmanovitz Library on Monday March 42019 from 10AM-1PM.

Biomedical research has seen unprecedented advances in the sensitivity and depth of data collection in the last decade. Some of the major challenges that come with this shift are: (1) how to analyze large multidimensional data sets, (2) how to store these complex data sets, and (3) how to integrate note taking and communication of research data sets into the workflow of a researcher. At UCSF and beyond, scientific communities are working to create new infrastructure to address these challenges.

First, the UCSF Library Data Science Initiative (DSI) has made progress in the challenge of data analysis. Computational analysis techniques and bioinformatics are becoming an essential aspect of training effective researchers. The DSI provides a number of useful services to UCSF researchers. For researchers interested in boosting their training with bioinformatics, programming, and statistics, the DSI teaches intro to Python, SQL, and R for basic statistical analyses as well as bioinformatics such as RNA-seq analysis. For ad-hoc programming help, there are meetup sessions available to researchers for expert and peer support. The DSI is also working with partners across campus to develop training for researchers in managing research data for security and reproducibility.

Second, public or open access data repositories provide some of the best avenues for storing medium and large data sets. Known among biomedical researchers are the Gene Expression Omnibus database for next-generation sequencing data through the National Center for Biotechnology Information, the Encyclopedia of DNA Elements (ENCODE) for cellular genomics and epigenetics data, and the Protein Data Bank for three-dimensional protein structural data. Other data publishing and micropublishing services have emerged over the last several years. Researchers using model organisms can turn to WormBase (C. elegans), FlyBase (D. melanogaster), TheCellMap (S. cerevisiae), ZFIN(zebrafish) and EcoCyc (E. coli) for obtaining or sharing public data. More general micropublication services include μP (microPublication Biology), and the University of California Dashplatform, which accepts data from any field of research. Other major initiatives show promise such as the Human Cell Atlasfor cell systems research data and the Allen Institute Projectfor human neurological systems data. Data storage and publication services allow for researchers to easily share and reuse data, reducing waste and increasing efficiency in research.

Third, online lab notebook platforms facilitate and integrate public communication of protocols and data sets into researcher workflows. There are a variety of online lab notebook options, which facilitate collaboration and efficient note taking. One example is the F1000 Workspace, which is used to take notes, manage references and publish research projects and is currently free for UCSF researchers. For computational research applications, Jupyter notebooks, R Markdown, and Github provide platforms for annotating and sharing code. Protocols.io provides a platform for managing and citing biomedical research protocols. These notebooks are supplemental to the journal article, and do not integrate research data with analysis or build context for the experiment in the field of research. Ultimately, that means that more emphasis should be placed on open access publishing.

The gold standard for research is to undergo peer-review in a renowned journal. However, there are several issues with the current model of scholarly publishing. Journals tend to favor publication of positive results and negative data sets are often not communicated. Some researchers believe that a large proportion of research efforts are wasted as journals fail to publish negative results, which feeds into the crisis of reproducibility. Pre-print services such as BioRxivaccelerate research by lowering the barrier to publication, and in the majority of cases, are still published in peer-reviewed journals. Furthermore, many micropublication services are peer-reviewed or curated and hold their publications to similar standards seen in journals. Hiring and tenure committees should recognize researchers for using pre-print and micropublication platforms and enabling reference to these publications can increase the efficiency of research communication.

To assist researchers in this growing space the UCSF Open Science Group and the Data Science Initiative are working together to create a summer workshop series which will cover topics of reproducibility and accessibility of scholarly content. The goal is to help researchers improve their research workflows, enable reuse of data, improve reproducibility and discuss incentives for open research communication. To get engaged with the group, check out our Wiki page! In the spirit of Open Data Day, let us make our data available!