Events & Media

October 13, 2016

Data Citation: The Old Rules Don’t Apply
James Frew and colleagues suggest how to bring academic citation into the database age

Solving today’s environmental problems involves vast amounts of data, which have to be gathered, stored, retrieved, analyzed, and, increasingly, cited in academic journals. That presents problems.

In the cover article of the September issue of Communications of the ACM, Bren associate professor James Frew and his co-authors, Peter Buneman and Susan Davidson from the University of Edinburgh and the University of Pennsylvania, respectively, describe the problem and how they might solve it. (Read the paper.)

James Frew

“For purposes of honesty and reproducibility, academic publishers are very rapidly moving toward requiring those who publish an article to also publish the data backing it up,” says Frew, an expert on data storage and provenance. “It’s happening now and is going to affect everybody.”

Citations have the important role of directing readers to supporting information and giving credit where it is due. While different fields and journals have their own specific citation rules, they are largely variations on a simple, universally accepted standard. That system that has worked fine for decades, as long as cited materials were fixed, unchanging objects like books or articles, but it doesn’t transfer to data.

Increasingly, scientific data is stored in large databases that have incredibly complex structures and are accessible via the web. And while some databases, like those containing election results, are static, others, which may contain yearly demographic data or climatological data from satellites, grow and change over time. At places like the UCSB National Center for Ecological Analysis and Synthesis (NCEAS), working groups create huge data sets by combining smaller sets from multiple researchers. When that data is cited, both the database and the person who originally gathered the data should be included in the citation.

Currently, however, even if scholars want to cite the sources of data they use, they may not be able to, because there is no standard tool for generating database citations. Frew explains the result: “We get one of two extremes in database citation. Either we get a citation to the complete database package or to a piece of information where the citation is so granular it cannot be connected back to the original data set.” It can also occur that there is no citation at all.

In the article, titled “Why Data Citation is a Computational Problem,” the authors describe a system within the database that would automatically generate a citation in a standardized format whenever data is extracted from a database.

The authors suggest that by using the same computing power that makes databases possible, database citations can be made more specific while also accurately accounting for all data authors. Frew sees this as a responsibility that will fall to database managers, who would need to take three steps: 1) define the various ways their data can be queried or “viewed,” 2) create citation templates for the standard set of views, and 3) provide a computational mechanism to allow researchers to generate citations for specific queries. The authors outline a solution and show its versatility by applying it to two different scientific databases that Frew describes as being “radically different in both their structure and how they should be cited.”

Frew hopes that their suggestions will lay a foundation for expanding the kinds of citations available to the academic world, and that data citation information can be vastly improved by combining computational power with the foresight of database managers.

“My hope is that the article’s suggestions for automating citations will encourage managers to implement similar systems and make it easier for those using the data to cite it appropriately,” he said.