Why scientists should learn from Aaron Swartz.

“He wanted openness, debate, rationality and critical thinking and above all refused to cut corners.” — Lawrence Lessig

Aaron Swartz helped draft the RDF Site Summary (RSS) standard at age 13 and was in many respects a prodigy. As Lawrence Lessig wrote about Aaron: “He wanted openness, debate, rationality and critical thinking and above all refused to cut corners.” Sadly, he perished by his own hand after particularly severe legal action against his person for copyright infringements. The documentary The Internet’s Own Boy: The Story of Aaron Swartz  provides a homage to his  life and work.

He left a legacy of writings which excel in clarity and brilliance I’ve rarely encountered. This is further contrasted by the age at which a lot of these blog posts or essays were written. Few people come close to the the way Aaron articulated his ideas in writing.

In a series of blog posts I’ll summarize some of his ideas with respect to technology, politics and media within the context of contemporary scientific (ecological) research. The fact that his ideas and his vision remain key to what I consider solid scientific practice reflect his genius and insight.

release late, release rarely (release early, release often)

In a blog post written on July 5, 2006 (release late, release rarely) Aaron outlines how to develop software. Yet, this essay could as well apply to scientific research, going from idea to publication.

Similarly to software (pet) projects, the subject of this blog post, science projects often have strong emotions attached to it. While these emotions are truthful the content or quality of the research might not pass muster.

“When you look at something you’re working on … you can’t help but see past the actual thing to the ideas that inspired it… But when others look at it, all they see is a piece of junk.”

In science, this basically means that you should do your homework and don’t oversell your research. In peer-review reviewers will see past these claims and, rightfully so, reject manuscripts because of it. So when you publish, release late, aim for quality not quantity.  This will raise the chance of getting your work published, while at the same time increasing the likelihood of stumbling on errors. Raising the true quality, or making it look good, often highlights inconsistencies you can’t move past in good conscious.

“Well, it looks great but I don’t really like it” is a lot better then “it’s a piece of junk”.

Releasing work late means that no one knows what you are doing and you might miss out on key feedback. So, informally, research benefits from releasing early.

“Still, you can do better. Releasing means showing it to the world. There’s nothing wrong with showing it to friends or experts or even random people in a coffee shop. The friends will give you the emotional support you would have gotten from actual users, without the stress. The experts will point out most of the errors the world would have found, without the insults. And random people will not only give you most of the complaints the public would, they’ll also tell you why the public gave up even before bothering to complain.”

Releasing early, means that you get valuable feedback that might otherwise would not make it into a high quality paper (released late). This feedback does not only come from experts, but as correctly observed, from everyone within a larger (research) community.

In short, scientific communication and progress requires a split approach where manuscripts should be released as late as possible, with ideas mature and solidly supported by open code and data, which was released as early as possible.

Note: Although the argument can be made that conferences serve the purpose of “early releases” I have yet to see a conference where people present truly early work. Most of the time either published or nearly published work is presented.

 

snotelr – a R package for easy access to SNOTEL data

I recently created the MCD10A1 product. This is a combined MODIS MOD10A1 and MYD10A1 product, alleviating some of the low bias introduced by either overpass through a maximum value approach. This approach has been used in the study by Gascoin et al. (2013) but I wanted some additional validation of the retrieved values.

As such I looked at the SNOTEL network which  “… is composed of over 800 automated data collection sites located in remote, high-elevation mountain watersheds in the western U.S. They are used to monitor snowpack, precipitation, temperature, and other climatic conditions. The data collected at SNOTEL sites are transmitted to a central database, called the Water and Climate Information System, where they are used for water supply forecasting, maps, and reports.” Here, the snowpack metrics could provide the needed validation data for my MCD10A1 product.

Although the SNOTEL website offers plenty of plotting options for casual exploration and the occasional report, but the interface remains rather clumsy with respect to full automation. As such, and similar to my amerifluxr package (both in spirit and execution), I created the snotelr R package. Below you find a brief description of the package and it’s functions.

Installation

You can quick install the package by installing the following dependencies

and downloading the package from the github repository

Use

Most people will prefer the GUI to explore data on the fly. To envoke the GUI use the following command:

This will start a shiny application with an R backend in your default browser. The first window will display all site locations, and allows for subsetting of the data based upon state or a bounding box. The bounding box can be selected by clicking top-left and bottom-right.

The plot data tab allows for interactive viewing of the soil water equivalent (SWE) data together with a covariate (temperature, precipitation). The SWE time series will also mark snow phenology statistics, mainly the day of:

  • first snow melt
  • a continuous snow free season (last snow melt)
  • first snow accumulation (first snow deposited)
  • continuous snow accumulation (permanent snow cover)
  • maximum SWE (and it’s amount)

For in depth analysis the above statistics can be retrieved using the snow.phenology() function

To access the full list of SNOTEL sites and associated meta-data use the snotel.info() function.

To query data for e.g. site 924 as shown in the image above use:

 

MODIS HDF data extraction in R

The Moderate-resolution imaging spectroradiometer (MODIS) sensors on both Aqua and Terra platforms provide a wealth of (environmental) data. However, manipulating the raw files can be challenging. Although Google Earth Engine provides an easier way to access these data, as most of the MODIS products are hosted, sometimes direct manipulation is still necessary. Here I quickly outline how to extract data from raw MODIS files using R.

Dependencies

To download extract data from a MODIS HDF file in R you need a few packages. Mainly:
– raster
– sp
MODIS

After installing these packages make sure you load them to continue.

Finding your way on the globe

First you have to determine where your data is located within the tiled format MODIS data is distributed in. All tiles are denoted with a horizontal (h) and a vertical index (v). For the Land Products there are roughly ~350 or so tiles with varying degrees of actual land coverage. All MODIS data are distributed in a custom sinusoidal projection, hence the rather weird looking shape of the map.

Tiles

In order to find the right tile we load a coordinate into R. In this case it’s the location of Harvard Yard, or 42.375028, -71.116493 latitude and longitude respectively.

Once you have the horizontal and vertical tiles you can download only these tiles instead of the entire archive.

Extracting data

As previously mentioned, data is stored in HDF files. These files are readily readable by the GDAL library. However, given the structured nature of these files you have to know what you are looking for. These layers in structured HDF files are called scientific data sets (SDS). You need to specify one to read them individually.

Now you have succesfully read in data. However, this data is projected using the sinusoidal projection, not the lat-long. In order to extract data at given locations we have to translate the coordinates of these extraction points to the MODIS sinusoidal projection.

Finally extract the data at the new sinusoidal coordinates.

open access != easy access

Over the past few years, writing software left and right, I noticed a trend. Most of the software I write serves one purpose: making open access data – accessible! This should not be!

There has been a steady push for open data, increasing transparency and reproducibility of scientific reporting. Although many scientific data sources are indeed open, their access is not (easy), especially for the less computer savvy.

For example, NASA provides a wealth of open data. Although there are a few tools on which one can rely, non are user friendly. Luckily, most of those can be replaced by open source tools coded by data users (ModisTools, MODIS LSP, GDAL). The European Space Agency (ESA)  fares even worse, where their data access is more restrictive and accessing data is an equal or even bigger mess. However, ESA has seen a recent push for a more user centric experience on some of the projects.

Looking at the field of ecology some projects do well and maintain APIs such at Daymet and Tropicos. Although Tropicos requires a personal key which makes writing toolboxes cumbersome. The later left me with no other choice than to scrape the website. The Oak Ridge National Laboratories (ORNL) also offers MODIS land product subsets through an API, with interfaces coded by users. However, these data are truly open access (as in relatively easy to query) and should be considered the way forward.

This contrasts with for example resources such as The Plant List, which offers a wealth of botanical knowledge guarded behind a search box on a web page, only to be resolved by either downloading the whole database or by using a third party website. Similarly the National Snow and Ice Data Center oldest snow and ice data is stored in an incomprehensible format (the more recent data is offered in an accessible geotiff format). Surprisingly, even large projects such as Ameriflux, with a rather prominent internet presence, suffer the same fate, i.e. a wealth of data largely inaccessible for quick and easy use and review.

Pooling several data access issues in the above examples, I think I’ve illustrated that open access data does not equal easily accessible data. A good few of us write and maintain toolboxes to alleviate these problems for themselves and the community at large. However, these efforts take up valuable time and resources and can’t be academically remunerated as only a handful of tools would qualify as substantial enough to publish.

I therefore would plead that data producers (projects alike) to make their open data easily accessible by:

  1. creating proper APIs to access all data or metadata (for querying)
  2. making APIs truly open so writing plugins can be easy if you don’t do it yourself
  3. writing toolboxes that do not rely on proprietary software (e.g. Matlab)
  4. assigning a true budget to these tasks
  5. appreciating those that develop tools on the side

update: a recent editorial in Science actually made much the same points as I did. Although there has been a discussion online about research parasitism I would side with the editor on that there is a growing need to re-digest some of the data that is online, if not bring it online. These reprocessed data also do have value added properties, and make the #IamAResearchParasite argument mute (with proper acknowledgements to the original authors however!).

DAYMET ancillary data conversion

I recently posted tools to extract DAYMET data for point locations. However, the product as produced by the DAYMET team is gridded data. If you want to scale models driven by DAYMET data spatially you need access to this gridded (spatial) data.

Sadly, access to the data is rather convoluted. If you want to download this gridded data, or tiles, you first have to figure out which tiles you need. This means, going to the DAYMET website and clicking several times within the tiles you want to download to get their individual numbers. After establishing all these tile numbers you can download them from the THREDDS server.

Obviously this routine is less than ideal, if for example you want to do regional to state based research and have to select a large number of tiles. I hope to automate this process. First step in this process is reprojecting the tiles grid to a latitude and longitude. Oddly enough the format of the netCDF file was not strandard and rather confusing. It took me a while to figure out how to accomplish this. Here is a bash script to do so. These instructions work on all ancillary netCDF grids as provided by DAYMET. From the generate tile file using the below code you can either determine the tile based upon any given point location or alternatively for a region of interest. The latter is my goal and in the works, as the former is covered by previous code.