FAIR principles, formats, and reproducibility

FAIR principles

../_images/FAIR_data_principles.jpg

© Sangya Pundir

For LAMASUS we are obliged to abide by FAIR principles for research data management. Please ensure that you are familiar with the FAIR principles as listed on the linked website. Bookmark it for reference.

Managing your data such that it becomes Findable, Accessible, Interoperable, and Reusable (FAIR) can involve a considerable separate effort. By combining the workflows detailed earlier with other good practices while keeping FAIR in mind, you can create a custom data management workflow that accommodates several FAIR concerns:

  • Managing metadata and data together using Git keeps both co-located and in-sync.

  • Publishing on Zenodo makes your data (and your code!) findable, provides a unique persistent identifier (F3), and registers metadata (F4).

  • Releasing on GitHub or GitLab improves accessibility, in particular by allowing for robust automated retrieval.

Moreover, using a hosted Git repository automatically complies with or supports aspects of FAIR such as:

In support of FAIR, prefer standard data formats and metadata formats that:

  • Are commonly used in the specific field of research.

  • Are an open standard.

  • Can be manipulated with open-source tools.

  • Are machine-readable.

For example, NetCDF is a good choice — picking Excel is problematic for many reasons, for example on account of autocorrect errors.

Text formats

Text files are human-readable and hence accessible. Prefer a text format that is also machine-readable such as CSV, JSON, or YAML when storing relatively small data for which the inefficiency of text-based formats matters little.1 Ensure that the text flows across multiple lines2 for readability and to enable Git to asses, merge, and display line-based differences when making changes. This makes such data very easy to manage.

Note

When using a text-based data format, pay mind to the character encoding. For maximum accessibility and reusability, pick a Unicode format, preferably UTF-8, or limit the text to the ASCII subset shared by many character encodings.

Caution

There are two common ways to encode newlines in text files. The LF encoding is used on MacOS and Linux/UNIX. The CR LF encoding is used on Windows. Much software can use both. In such a case, prefer the LF encoding.

When newlines do matter, be sure to keep them under control. To help with that, you can configure Git to automatically convert newline encoding on commit or checkout.

Metadata formats

Several FAIR principles give guidance on metadata. The principle:

R1.3: (Meta)data meet domain-relevant community standards

indicates that the selection of metadata formats requires attention. A metadata format suitable for a particular field of research can be selected by exploring the DCC Metadata Standards and the Metadata Standards Catalog. In addition there are various metadata formats for use cases such as publishing on Zenodo, publishing code as a package in a package repository, assisting citation, and so on. Moreover, some data formats such as NetCDF allow inclusion of metadata inside of the data files.

This proliferation of formats covers a spectrum of use cases that may or may not be important to the code and data that you are managing, the usage context, and the audience that you are trying to reach. In principle, the more formats that you use for adding metadata, the better. In practice, this takes effort. Use common sense to make a selection.

Licensing

The FAIR principle

supports the reuse of data. The same holds for code: release with a license that makes the rights for reuse clear. Think about licensing early. Even before releasing to the public, it is a good idea to already specify the rights for reuse through a license so that collaborators and project partners who have access know their rights.

Research outputs and other original work produced for LAMASUS should default to a a public domain license (e.g. CC0) or a permissive license such as CC-BY for data or the MIT license for code. Derivative works of copyleft material may require a compatible copyleft license. A copyleft license may be preferred when the freedom of derivative work should be ensured. Further guidance on choosing a open source license can be found on choosealicense.com.

You can license a Git repository and thereby the files contained in it. This typically involves including a file detailing the license in the repository. For repositories hosted on GitHub, the GitHub web interface provides an easy way to do so. For standard license types, it can be sufficient to specify the license type as part of metadata and a README. Zenodo metadata includes a license field. The Citation File Format (CFF) too has optional license fields.

Tip

You can use the CFF INIT online tool to create a CFF file.

Different parts of your code base or dataset may be subject to different licensing conditions. In such a case, you need to carefully specify which license pertains to what, or break up the tree of files under management such that repositories hold only files subject to a single license. These separate repositories can be combined into a single tree via Git submodule references.

Reproducibility

Reproducibility is a basic tenet of science. Following the FAIR principles goes some way towards making your research reproducible. In addition using a Git-based workflow allows you or others to go back in time and retrieve old code or data versions to reproduce prior results. When you expect that a particular version of your files will be needed to reproduce your research in the future, you can have Git tag that commit or use more involved release management to enable people to easily return to that point in your file tree’s history.

Even when all FAIR principles have been followed, just providing the data may not be enough for reproducibility. Also providing the actual code that does the data wrangling is often necessary to make it practical to reproduce results. But how to tie code to the actual data used or produced? Here too, a Git-based workflow can help out. You can for example:

  • Include code and data in the same Git repository.

  • Reference a Git data repository from a Git code repository as a submodule.

  • Reference data on a remote server by URL from a Git code repository and include download scripting.

  • Publish result data in a Git repository that references as a submodule the repository containing the code used to perform the processing.


1

Compared to binary data, text-based numerical data takes up about a factor of three more space at full precision,3 and parsing it may require a significant amount of time.

2

Formats such as JSON and XML can have all content scrunched onto a single long line. A “pretty print” option that flows the text across multiple lines is available in libraries that write these formats.

3

Floating point numbers require special care when represented as text. The full binary precision will only be encoded in the text if you allow for sufficient digits of precision instead of truncating with few digits after the decimal point when generating the text.