Want to get published, show me your code.

All too often one is still confronted with a statement at the end of the manuscript reading: “Code is available from the authors at reasonable request”.

The last few years there has been a strong focus on open data and open access journals. This is in part stimulated by a reproducibility crisis in science, often in the biomedical sciences. However, the strong focus on data and journal access alone is misplaced.

Many fields such as ecology, remote sensing and elsewhere rely increasingly on ever more complex software (models). Furthermore, they use ever larger amounts of data. Yet, there isn’t the same demand for releasing code and / or open coding practices. All too often one is still confronted with a statement at the end of the manuscript reading: “Code is available from the authors at reasonable request”.

What reasonable means is often unclear, but it clearly does not stimulate reproducibility (e.g., a critical request might not be “reasonable”). It also actively interferes with the task of reviewers who make assumptions (in good faith) that the analysis was correctly executed. However, with the amount of data (sources) used as well as the number of lines of code produced errors are far from unlikely.

With services such as Github and Docker containers there should be a requirement for any study heavy on the modelling side, and which relies on open data, to be fully reproducible if not through a small worked example if the full dataset would be prohibitive in size or when ethically not desired.

More so, when it comes to model comparisons there should be an active effort to formalize these comparisons in community driven frameworks (e.g., an R package, a python package, docker images, or a formalized workflow). Such rigorous efforts are required to truly assess model performance and quantify model errors at all levels (from source data to model structure). Alas, such efforts are few and far between in ecology, as are open and good coding practices.

The lack of this transparency is in part fueled by a gatekeeper effect. It is profitable not to share the code, as it is profitable not to share data. Not sharing code puts other scientists at a disadvantage, as similar studies or incremental advance upon the original code can’t easily be made. Provided that not sharing code constitutes a breakdown of any reproducibility, and actively slows down scientific process, I’m inclined not to consider studies fit for publication without accessible source code.

note 1: The active sharing of algorithms if far more common in computer science and physics.

note 2: I got pushback on the notion that there is a gatekeeper effect in science. Yet, the fact that a “reasonable request” is mentioned, not merely any request, implies a gatekeeper effect. It is up to the authors to decide how and who will get access to code (and applications thereof) and who doesn’t. But what about licensing? Although, licensing might require citations (CC-BY), release under the same license (GPL) or prohibiting commercial applications (CC-NC), this still guarantees access to the code to begin with.

 

Google Earth Engine time series subset tool

Google Earth Engine (GEE) has provided a way to massively scale a lot of remote sensing analysis. However, more than often time series analysis are carried out on a site by site basis and scaling to a continental or global level is not required. Furthermore, some applications are hard to implement on GEE or prototyping does not benefit from direct spatial scaling. In short, working on a handful of reference pixels locally is often still faster than Google servers. I hereby sidestep the handling of large amounts of data (although sometimes helpful) to get to single location time series subsets with a GEE hack.

I wrote a simple python script / library called gee_subset.py which allows you to extract time series for a particular location or it’s neighbourhood. This tool is similar to my MODIS subset or daymetr tools, which all facilitate the extraction of time series of remote sensing or climatological data respectively.

My python script expands this functionality to all available GEE products, which include high resolution Landsat and Sentinel data, includes climatological data among others Daymet, but also representative concentration pathway (RCP) CMIP5 model runs.

Compared to the ORNL DAAC MODIS subset tool performance is blazing fast (thank you Google). An example query, calling the python script from R, downloaded two years (~100 data points) of Landsat 8 Tier 1 data for two bands (red, NIR) in ~8 seconds flat. Querying a larger footprint (1×1 km) only creates a small overhead (13 sec. query). The resulting figure for the point location with the derived NDVI values is shown below. The demo script to recreate this figure is included in the example folder of the github repository.

NDVI values from Landsat 8 Tier 1 scenes. black lines depicts a loess fit to the data, with the gray envelope representing the standard error.

Level Theta S images

My Virtual Forest project is still running strong and generates tons of spherical images (currently ~50GB). However, the post on which the camera sits is not perfectly level.  The Theta S camera normally compensates for this using an internal gyroscope which detects pitch and roll of the camera.  Yet, when downloading images directly from the camera no adjustments are made and the pitch and roll data is merely recorded in the EXIF data of the image.

As such I wrote a small bash script which rectifies (levels the horizon) in Theta S spherical images using this internal EXIF data. This is an alternative implementation to the THETA EXIF Library by Regen. I use his cute Lama test images for reference. All credit for the funky images go to Regen. Below is the quick install guide to using my script. I hope it helps speed up people’s Theta S workflow.

Install

Download, fork or copy paste the script from my github repository to your machine and make it executable.

Use

The above command will rectify the image.jpg file and output a new file called image_rectified.jpg.

Visual comparison between my results and those of Regen’s python script show good correspondence.

Requirements

The script depends on a running copy of exiftools, imagemagick and POVRay. These tools are commonly available in most Linux distros, and can be installed on OSX using tools such as homebrew. I lack a MS Windows system, but the script should be easily adjusted to cover similar functionality.

R arctic polar plots

For a project I needed to create an appealing plot of the arctic, showing the location of some field sites. I’ve posted this map on twitter earlier today. Below, I’ll outline a simple routine to recreate this plot, and if need be adjust it to your liking.

First of all I downloaded an appealing background from the Blue Marble dataset as created by NASA. Geotiffs can be downloaded here or by direct download following this link. Alternatively you can download the less realistic and more summary style graphics as produced by Natural Earth.

After downloading the Blue Marble geotiff you trim the data to the lowest latitude you want to plot. Subsequently, I reproject the data to the EPSG 3995 projection (or arctic polar stereographic). All this is done using GDAL. This step could be done using rgdal, but at times this doesn’t play nice. For now I post the command line GDAL code.

UPDATE: the below command line gdal code is not necessary anymore as I call the raster library in R now which works fine in dealing with the reprojection after some fiddling.

The remaining R code ingests this background image and overlays a graticule and some labels. For this I heavily borrowed from the sp map gallery.

snotelr – a R package for easy access to SNOTEL data

I recently created the MCD10A1 product. This is a combined MODIS MOD10A1 and MYD10A1 product, alleviating some of the low bias introduced by either overpass through a maximum value approach. This approach has been used in the study by Gascoin et al. (2013) but I wanted some additional validation of the retrieved values.

As such I looked at the SNOTEL network which  “… is composed of over 800 automated data collection sites located in remote, high-elevation mountain watersheds in the western U.S. They are used to monitor snowpack, precipitation, temperature, and other climatic conditions. The data collected at SNOTEL sites are transmitted to a central database, called the Water and Climate Information System, where they are used for water supply forecasting, maps, and reports.” Here, the snowpack metrics could provide the needed validation data for my MCD10A1 product.

Although the SNOTEL website offers plenty of plotting options for casual exploration and the occasional report, but the interface remains rather clumsy with respect to full automation. As such, and similar to my amerifluxr package (both in spirit and execution), I created the snotelr R package. Below you find a brief description of the package and it’s functions.

Installation

You can quick install the package by installing the following dependencies

and downloading the package from the github repository

Use

Most people will prefer the GUI to explore data on the fly. To envoke the GUI use the following command:

This will start a shiny application with an R backend in your default browser. The first window will display all site locations, and allows for subsetting of the data based upon state or a bounding box. The bounding box can be selected by clicking top-left and bottom-right.

The plot data tab allows for interactive viewing of the soil water equivalent (SWE) data together with a covariate (temperature, precipitation). The SWE time series will also mark snow phenology statistics, mainly the day of:

  • first snow melt
  • a continuous snow free season (last snow melt)
  • first snow accumulation (first snow deposited)
  • continuous snow accumulation (permanent snow cover)
  • maximum SWE (and it’s amount)

For in depth analysis the above statistics can be retrieved using the snow.phenology() function

To access the full list of SNOTEL sites and associated meta-data use the snotel.info() function.

To query data for e.g. site 924 as shown in the image above use: