How do I find data using R?

This is a work in progress, and may be split up into different modules/tutorials as we continue to work on it.

Here are our recommended approaches for finding data with R, from the command line or a notebook.

Using the web interface

See How do I find data using Earthdata Search? - in that tutorial, we found the dataset ECCO Sea Surface Height - Monthly Mean 0.5 Degree (Version 4 Release 4), with the shortname ECCO_L4_SSH_05DEG_MONTHLY_V4R4.

Searching programmatically

The NASA cloud data is searchable programmatically via two methods - NASA’s own search service, and the NASA Spatio-Temporal Asset Catalog (STAC) API. To find data in R, we’ll mainly rely on the rstac package. This enables us interact with the NASA STAC API to search for our data, and at the same time learn about STAC more generally. This will be useful as it is a common standard for distributing spatial data.

We will also search for data using the NASA Earthdata search API, which is the service that powers the Earthdata web search portal.

For both of these services, the earthdatalogin package will help us get set up and provide a few handy functions to smooth some rough edges.

Authentication for NASA Earthdata

An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

Once you have created your Earthdata Login account, you can use the earthdatalogin R package to help you manage your authentication within R.

There is some functionality currently only available in the development version of earthdatalogin, so we will install it from GitHub using the pak package:

install.packages("pak")
pak::pak("boettiger-lab/earthdatalogin")

The easiest and most portable method of access is using the netrc basic authentication protocol for HTTP. Call edl_netrc() to set this up given your username and password. The easiest way to store your credentials for use in edl_netrc() is to set them as environment variables in your ~/.Renviron file. The usethis package has a handy function to find and open this file for editing:

# install.packages("usethis")
usethis::edit_r_environ()

Then add the following lines to the file, save the file, and restart R.

EARTHDATA_USER="your_user_name"
EARTHDATA_PASSWORD="your_password"

Now, when you call edl_netrc(), it will consult these environment variables to create a .netrc file that will be used to authenticate with the NASA Earthdata services. If you don’t have your credentials saved as environment variables, earthdatalogin will provide its own credentials, but you may experience rate limits more readily. You can also manually type in your credentials to the username and password arguments to edl_netrc() but this is not recommended as it is too easy to accidentally share these in your code.

library(earthdatalogin)

edl_netrc()

Once edl_netrc() has been called in your R session, then most spatial packages in R can seamlessly access NASA Earthdata over HTTP links.

Finding data in NASA STACs with rstac

All of the NASA STAC catalogues can be viewed here: https://radiantearth.github.io/stac-browser/#/external/cmr.earthdata.nasa.gov/stac/.

We will use the rstac package to first browse the collections in the PO DAAC catalogue (POCLOUD):

## In R
## load R libraries
# install.packages("pak") 
# pak::pak(c("tidyverse", "rstac", "boettiger-lab/earthdatalogin"))

library(rstac)
library(earthdatalogin)

po_collections <- stac("https://cmr.earthdata.nasa.gov/stac/POCLOUD/") |>
  collections() |>
  get_request()

po_collections

This only gives us the first 10 collections in the catalogue; to get the rest we can use the collections_fetch() function:

all_po_collections <- collections_fetch(po_collections)
all_po_collections

length(all_po_collections$collections)

head(all_po_collections$collections)

# Just look at the titles:
sapply(all_po_collections$collections, `[[`, "title")

# Get shortnames from the 'id' field and search for a match:
grep("ECCO_L4_SSH", sapply(all_po_collections$collections, `[[`, "id"), value = TRUE)

Once we have searched through the collections, we can choose the one we are interested in and query it to get the items (granules) we are interested in:

start <- "2015-01-01"
end   <- "2015-12-31" 

items <- stac("https://cmr.earthdata.nasa.gov/stac/POCLOUD") |> 
  stac_search(collections = "ECCO_L4_SSH_05DEG_MONTHLY_V4R4",
              datetime = paste(start,end, sep = "/")) |>
  post_request() |>
  items_fetch()

items

There are 13 ‘items’ representing the same 13 granules we found when we searched using EarthData Search.

We can see more details of the items (granules) by printing the features list item:

items$features

# And the urls:
edl_stac_urls(items)

Finding data in NASA EarthData Search using earthdatalogin

Once we know the shortname of a collection (usually by looking in the EarthData Search portal), we can supply it to edl_search() to get the metadata and file urls of the individual granules:

granules <- edl_search(
  short_name = "ECCO_L4_SSH_05DEG_MONTHLY_V4R4",
  temporal = c(start, end), 
  parse_results = FALSE
)

granules

granules[[1]]

# See the granule titles
sapply(granules, `[`, "title")

# Note these are the same urls obtained via the rstac workflow demonstrated above
granule_urls <- edl_extract_urls(granules)
granule_urls

Accessing data using the `{terra}` package

We can read any of these urls using terra::rast(). We supply the vsi = TRUE argument, which prepends "/vsicurl/" to the url, indicating that the file should be opened as a “virtual” remote dataset. This allows random partial reading of files without prior download of the entire file. This will vastly speed up most operations as only the subset of the data required is ever actually downloaded.

library(terra)

rast <- terra::rast(granule_urls[1], vsi = TRUE)

# This does not come with an extent and CRS embedded so we can supply it manually
granules[[1]]$boxes
ext(rast) <- c(-180, 180, -90, 90)
crs(rast) <- "EPSG:4326"

plot(rast[["SSH"]])

# We notice that this plot is upside-down - it is likely that the NetCDF file
# does not conform properly to the conventions. But we can flip it:
rast <- flip(rast, direction = "vertical")
plot(rast)

If we want to crop the raster, we can define an extent and crop it to that area. Because we previously called edl_netrc(), we are not only authenticated with the server so we can access the data, but we have also set up terra and stars to us the underlying GDAL library to access the data in such a way that only subset of the data we have requested is actually downloaded.

crop_box <- rast(extent = c(-150, -120, 35, 60), crs = "EPSG:4326")

rast_cropped <- crop(rast, crop_box)

plot(rast_cropped[["SSH"]])

Accessing data using the `{stars}` package

The read_* functions in stars do not have the vsi argument, but we can do the same thing simply by prepending "/vsicurl/" to the url ourselves. Here we will use the read_mdim() function.

library(stars)
ssh_stars <- read_mdim(paste0("/vsicurl/", granule_urls[1]))

plot(ssh_stars)

We can again crop this using the same bounding box as we did with terra:

st_crop(
  ssh_stars, 
  st_bbox(c(xmin = -150, xmax = -120, ymin = 35, ymax = 60))
) |> 
  plot()

Accessing data using gdalcubes

Coming soon!