snotelr – a R package for easy access to SNOTEL data

I recently created the MCD10A1 product. This is a combined MODIS MOD10A1 and MYD10A1 product, alleviating some of the low bias introduced by either overpass through a maximum value approach. This approach has been used in the study by Gascoin et al. (2013) but I wanted some additional validation of the retrieved values.

As such I looked at the SNOTEL network which  “… is composed of over 800 automated data collection sites located in remote, high-elevation mountain watersheds in the western U.S. They are used to monitor snowpack, precipitation, temperature, and other climatic conditions. The data collected at SNOTEL sites are transmitted to a central database, called the Water and Climate Information System, where they are used for water supply forecasting, maps, and reports.” Here, the snowpack metrics could provide the needed validation data for my MCD10A1 product.

Although the SNOTEL website offers plenty of plotting options for casual exploration and the occasional report, but the interface remains rather clumsy with respect to full automation. As such, and similar to my amerifluxr package (both in spirit and execution), I created the snotelr R package. Below you find a brief description of the package and it’s functions.

Installation

You can quick install the package by installing the following dependencies

and downloading the package from the github repository

Use

Most people will prefer the GUI to explore data on the fly. To envoke the GUI use the following command:

This will start a shiny application with an R backend in your default browser. The first window will display all site locations, and allows for subsetting of the data based upon state or a bounding box. The bounding box can be selected by clicking top-left and bottom-right.

The plot data tab allows for interactive viewing of the soil water equivalent (SWE) data together with a covariate (temperature, precipitation). The SWE time series will also mark snow phenology statistics, mainly the day of:

  • first snow melt
  • a continuous snow free season (last snow melt)
  • first snow accumulation (first snow deposited)
  • continuous snow accumulation (permanent snow cover)
  • maximum SWE (and it’s amount)

For in depth analysis the above statistics can be retrieved using the snow.phenology() function

To access the full list of SNOTEL sites and associated meta-data use the snotel.info() function.

To query data for e.g. site 924 as shown in the image above use:

 

MODIS HDF data extraction in R

The Moderate-resolution imaging spectroradiometer (MODIS) sensors on both Aqua and Terra platforms provide a wealth of (environmental) data. However, manipulating the raw files can be challenging. Although Google Earth Engine provides an easier way to access these data, as most of the MODIS products are hosted, sometimes direct manipulation is still necessary. Here I quickly outline how to extract data from raw MODIS files using R.

Dependencies

To download extract data from a MODIS HDF file in R you need a few packages. Mainly:
– raster
– sp
MODIS

After installing these packages make sure you load them to continue.

Finding your way on the globe

First you have to determine where your data is located within the tiled format MODIS data is distributed in. All tiles are denoted with a horizontal (h) and a vertical index (v). For the Land Products there are roughly ~350 or so tiles with varying degrees of actual land coverage. All MODIS data are distributed in a custom sinusoidal projection, hence the rather weird looking shape of the map.

Tiles

In order to find the right tile we load a coordinate into R. In this case it’s the location of Harvard Yard, or 42.375028, -71.116493 latitude and longitude respectively.

Once you have the horizontal and vertical tiles you can download only these tiles instead of the entire archive.

Extracting data

As previously mentioned, data is stored in HDF files. These files are readily readable by the GDAL library. However, given the structured nature of these files you have to know what you are looking for. These layers in structured HDF files are called scientific data sets (SDS). You need to specify one to read them individually.

Now you have succesfully read in data. However, this data is projected using the sinusoidal projection, not the lat-long. In order to extract data at given locations we have to translate the coordinates of these extraction points to the MODIS sinusoidal projection.

Finally extract the data at the new sinusoidal coordinates.

open access != easy access

Over the past few years, writing software left and right, I noticed a trend. Most of the software I write serves one purpose: making open access data – accessible! This should not be!

There has been a steady push for open data, increasing transparency and reproducibility of scientific reporting. Although many scientific data sources are indeed open, their access is not (easy), especially for the less computer savvy.

For example, NASA provides a wealth of open data. Although there are a few tools on which one can rely, non are user friendly. Luckily, most of those can be replaced by open source tools coded by data users (ModisTools, MODIS LSP, GDAL). The European Space Agency (ESA)  fares even worse, where their data access is more restrictive and accessing data is an equal or even bigger mess. However, ESA has seen a recent push for a more user centric experience on some of the projects.

Looking at the field of ecology some projects do well and maintain APIs such at Daymet and Tropicos. Although Tropicos requires a personal key which makes writing toolboxes cumbersome. The later left me with no other choice than to scrape the website. The Oak Ridge National Laboratories (ORNL) also offers MODIS land product subsets through an API, with interfaces coded by users. However, these data are truly open access (as in relatively easy to query) and should be considered the way forward.

This contrasts with for example resources such as The Plant List, which offers a wealth of botanical knowledge guarded behind a search box on a web page, only to be resolved by either downloading the whole database or by using a third party website. Similarly the National Snow and Ice Data Center oldest snow and ice data is stored in an incomprehensible format (the more recent data is offered in an accessible geotiff format). Surprisingly, even large projects such as Ameriflux, with a rather prominent internet presence, suffer the same fate, i.e. a wealth of data largely inaccessible for quick and easy use and review.

Pooling several data access issues in the above examples, I think I’ve illustrated that open access data does not equal easily accessible data. A good few of us write and maintain toolboxes to alleviate these problems for themselves and the community at large. However, these efforts take up valuable time and resources and can’t be academically remunerated as only a handful of tools would qualify as substantial enough to publish.

I therefore would plead that data producers (projects alike) to make their open data easily accessible by:

  1. creating proper APIs to access all data or metadata (for querying)
  2. making APIs truly open so writing plugins can be easy if you don’t do it yourself
  3. writing toolboxes that do not rely on proprietary software (e.g. Matlab)
  4. assigning a true budget to these tasks
  5. appreciating those that develop tools on the side

update: a recent editorial in Science actually made much the same points as I did. Although there has been a discussion online about research parasitism I would side with the editor on that there is a growing need to re-digest some of the data that is online, if not bring it online. These reprocessed data also do have value added properties, and make the #IamAResearchParasite argument mute (with proper acknowledgements to the original authors however!).

DAYMET ancillary data conversion

I recently posted tools to extract DAYMET data for point locations. However, the product as produced by the DAYMET team is gridded data. If you want to scale models driven by DAYMET data spatially you need access to this gridded (spatial) data.

Sadly, access to the data is rather convoluted. If you want to download this gridded data, or tiles, you first have to figure out which tiles you need. This means, going to the DAYMET website and clicking several times within the tiles you want to download to get their individual numbers. After establishing all these tile numbers you can download them from the THREDDS server.

Obviously this routine is less than ideal, if for example you want to do regional to state based research and have to select a large number of tiles. I hope to automate this process. First step in this process is reprojecting the tiles grid to a latitude and longitude. Oddly enough the format of the netCDF file was not strandard and rather confusing. It took me a while to figure out how to accomplish this. Here is a bash script to do so. These instructions work on all ancillary netCDF grids as provided by DAYMET. From the generate tile file using the below code you can either determine the tile based upon any given point location or alternatively for a region of interest. The latter is my goal and in the works, as the former is covered by previous code.