Want to get published, show me your code.

All too often one is still confronted with a statement at the end of the manuscript reading: “Code is available from the authors at reasonable request”.

The last few years there has been a strong focus on open data and open access journals. This is in part stimulated by a reproducibility crisis in science, often in the biomedical sciences. However, the strong focus on data and journal access alone is misplaced.

Many fields such as ecology, remote sensing and elsewhere rely increasingly on ever more complex software (models). Furthermore, they use ever larger amounts of data. Yet, there isn’t the same demand for releasing code and / or open coding practices. All too often one is still confronted with a statement at the end of the manuscript reading: “Code is available from the authors at reasonable request”.

What reasonable means is often unclear, but it clearly does not stimulate reproducibility (e.g., a critical request might not be “reasonable”). It also actively interferes with the task of reviewers who make assumptions (in good faith) that the analysis was correctly executed. However, with the amount of data (sources) used as well as the number of lines of code produced errors are far from unlikely.

With services such as Github and Docker containers there should be a requirement for any study heavy on the modelling side, and which relies on open data, to be fully reproducible if not through a small worked example if the full dataset would be prohibitive in size or when ethically not desired.

More so, when it comes to model comparisons there should be an active effort to formalize these comparisons in community driven frameworks (e.g., an R package, a python package, docker images, or a formalized workflow). Such rigorous efforts are required to truly assess model performance and quantify model errors at all levels (from source data to model structure). Alas, such efforts are few and far between in ecology, as are open and good coding practices.

The lack of this transparency is in part fueled by a gatekeeper effect. It is profitable not to share the code, as it is profitable not to share data. Not sharing code puts other scientists at a disadvantage, as similar studies or incremental advance upon the original code can’t easily be made. Provided that not sharing code constitutes a breakdown of any reproducibility, and actively slows down scientific process, I’m inclined not to consider studies fit for publication without accessible source code.

note 1: The active sharing of algorithms if far more common in computer science and physics.

note 2: I got pushback on the notion that there is a gatekeeper effect in science. Yet, the fact that a “reasonable request” is mentioned, not merely any request, implies a gatekeeper effect. It is up to the authors to decide how and who will get access to code (and applications thereof) and who doesn’t. But what about licensing? Although, licensing might require citations (CC-BY), release under the same license (GPL) or prohibiting commercial applications (CC-NC), this still guarantees access to the code to begin with.

 

Jungle Rhythms made it into The Guardian

A cache of decaying notebooks found in a crumbling Congo research station has provided unexpected evidence with which to help solve a crucial puzzle – predicting how vegetation will respond to climate change. . . . (by Dan Grossman)

My Jungle Rhythms has made some waves as of late. The project sparked the interest of dr. Dan Grossman, a science journalist, and his nice summary of all the Jungle Rhythms work was published in The Guardian. As a result of this IFLscience picked it up as well. Especially in the comments section of The Guardian the response was really positive. I’m happy to see some global exposure of the project, and the larger context and importance of similar work. I also hope that this exposure might bring about more funding to safeguard historical collections and capacity building within this context in DR Congo.

reviz.in – peer-review annotations with hypothes.is

A few months ago I was flooded with review requests. And I figured that it might be time to look around for solutions and code something up allow me to annotate peer-review PDFs easily,  and generate a review report with a click of a button (as proposed years ago).

Enter, Hypothes.is. A few years ago this initiative started to facilitate the semantic or annotated web. A way to annotate web pages separate from the original creator.  Hypothes.is did exactly what I needed to annotate any given PDF (also locally stored items). However, I could not extract the data easily in a standardized way. In addition, the standard mode for the Hypothes.is client is a public one, with personal groups being private. In short, although the whole framework had all the pieces the output wasn’t optimal for peer-review, if not dangerous to reputations when accidentally leaking reviews to the web.

As such, I created Reviz.in, a simple hack of the original Hypothes.is client and Google Chrome extension which makes sure you can’t escape the group which holds your peer-review revision notes and generates a nice review report (see image below). In addition, I added a fancy icon and renamed the original labels (not consistently however), to differentiate the original interface from my copy to avoid confusion. I hope over time this functionality will be provided by the original Hypothes.is client, in the mean time you can read more on the installation process on the Reviz.in website:

http://reviz.in 

or download

the Google Chrome Extension.

I hope this simple hack will help people speed up their review process as to free up some time. I also hope that publishers will take note, as the lack of their innovation on this front is rather shameful.

Google Earth Engine time series subset tool

Google Earth Engine (GEE) has provided a way to massively scale a lot of remote sensing analysis. However, more than often time series analysis are carried out on a site by site basis and scaling to a continental or global level is not required. Furthermore, some applications are hard to implement on GEE or prototyping does not benefit from direct spatial scaling. In short, working on a handful of reference pixels locally is often still faster than Google servers. I hereby sidestep the handling of large amounts of data (although sometimes helpful) to get to single location time series subsets with a GEE hack.

I wrote a simple python script / library called gee_subset.py which allows you to extract time series for a particular location or it’s neighbourhood. This tool is similar to my MODIS subset or daymetr tools, which all facilitate the extraction of time series of remote sensing or climatological data respectively.

My python script expands this functionality to all available GEE products, which include high resolution Landsat and Sentinel data, includes climatological data among others Daymet, but also representative concentration pathway (RCP) CMIP5 model runs.

Compared to the ORNL DAAC MODIS subset tool performance is blazing fast (thank you Google). An example query, calling the python script from R, downloaded two years (~100 data points) of Landsat 8 Tier 1 data for two bands (red, NIR) in ~8 seconds flat. Querying a larger footprint (1×1 km) only creates a small overhead (13 sec. query). The resulting figure for the point location with the derived NDVI values is shown below. The demo script to recreate this figure is included in the example folder of the github repository.

NDVI values from Landsat 8 Tier 1 scenes. black lines depicts a loess fit to the data, with the gray envelope representing the standard error.