open access != easy access

Over the past few years, writing software left and right, I noticed a trend. Most of the software I write serves one purpose: making open access data – accessible! This should not be!

There has been a steady push for open data, increasing transparency and reproducibility of scientific reporting. Although many scientific data sources are indeed open, their access is not (easy), especially for the less computer savvy.

For example, NASA provides a wealth of open data. Although there are a few tools on which one can rely, non are user friendly. Luckily, most of those can be replaced by open source tools coded by data users (ModisTools, MODIS LSP, GDAL). The European Space Agency (ESA)  fares even worse, where their data access is more restrictive and accessing data is an equal or even bigger mess. However, ESA has seen a recent push for a more user centric experience on some of the projects.

Looking at the field of ecology some projects do well and maintain APIs such at Daymet and Tropicos. Although Tropicos requires a personal key which makes writing toolboxes cumbersome. The later left me with no other choice than to scrape the website. The Oak Ridge National Laboratories (ORNL) also offers MODIS land product subsets through an API, with interfaces coded by users. However, these data are truly open access (as in relatively easy to query) and should be considered the way forward.

This contrasts with for example resources such as The Plant List, which offers a wealth of botanical knowledge guarded behind a search box on a web page, only to be resolved by either downloading the whole database or by using a third party website. Similarly the National Snow and Ice Data Center oldest snow and ice data is stored in an incomprehensible format (the more recent data is offered in an accessible geotiff format). Surprisingly, even large projects such as Ameriflux, with a rather prominent internet presence, suffer the same fate, i.e. a wealth of data largely inaccessible for quick and easy use and review.

Pooling several data access issues in the above examples, I think I’ve illustrated that open access data does not equal easily accessible data. A good few of us write and maintain toolboxes to alleviate these problems for themselves and the community at large. However, these efforts take up valuable time and resources and can’t be academically remunerated as only a handful of tools would qualify as substantial enough to publish.

I therefore would plead that data producers (projects alike) to make their open data easily accessible by:

  1. creating proper APIs to access all data or metadata (for querying)
  2. making APIs truly open so writing plugins can be easy if you don’t do it yourself
  3. writing toolboxes that do not rely on proprietary software (e.g. Matlab)
  4. assigning a true budget to these tasks
  5. appreciating those that develop tools on the side

update: a recent editorial in Science actually made much the same points as I did. Although there has been a discussion online about research parasitism I would side with the editor on that there is a growing need to re-digest some of the data that is online, if not bring it online. These reprocessed data also do have value added properties, and make the #IamAResearchParasite argument mute (with proper acknowledgements to the original authors however!).