`earthdata`: Python-R Handoff

The dream

Create once, use often: using earthdata python package for NASA Earthdata authorization and identifying the s3 links (i.e. the locations where the data are stored on Amazon Web Services), then passing those python objects to R through Quarto for analysis by R folks. These notes are a work-in-progress by Julie and Luis and we’ll tidy them up as we develop them further.

Note: this dream is currently not working but we are sharing our progress.

Python: `earthdata` package for auth & s3 links

earthdata gets me the credentials, it gets me the links based on the queries.

In this example, the data we want is in the Cloud. For this examples we’re using this data we identified from the Earthdata Cloud Cookbook’s Multi-File_Direct_S3_Access_NetCDF_Example, and its short_name is 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'.

Identify the s3 links

Below is our query, pretending that that is the data and the bounding box we want.

## import DataCollections class from earthdata library
from earthdata import DataGranules

## To find the concept_id from the shortname that we copied: 
# short_name = 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4' 
# collection = DataCollections().short_name(short_name).get()
# [c.concept_id() for c in collection] ## this returned 'C1990404799-POCLOUD'

# Then we build a Query with spatiotemporal parameters. 
GranuleQuery = DataGranules().concept_id('C1990404799-POCLOUD').bounding_box(-134.7,58.9,-133.9,59.2)

## We get the metadata records from CMR
granules = GranuleQuery.get()

## Now it's time to open our data granules list. 
s3_links = [granule.data_links(access='direct') for granule in granules] 
s3_links[0]

Note that files = Store(auth).open(granules) would work for Python users but open won’t work in the R world because it will create some kind of python file handlers from fsspec.

Get the Cloud credentials

Prerequesite: you’ll need a functioning .netrc here. earthdata expects interactivity and that did not work here with Quarto in the RStudio IDE (and it also did not work for Julie in Jupyter notebook (June 7 2022)). So, we followed the 2021-Cloud-Hackathon’s NASA_Earthdata_Authentication, copying and pasting and running that code in a Jupyter notebook. (remember to rm .netrc beforehand!)

Then, with a nice .netrc file, the next step is to get Cloud credentials:

## import the Auth class from the earthdata library
from earthdata import Auth

auth = Auth().login(strategy="netrc")
credentials = auth.get_s3_credentials(cloud_provider = "POCLOUD")

So now we have the s3 links and the credentials to download the links, so now we can use the tutorial in R!!

Notes

Luis will update earthdata to automatically know the cloud provider so that you don’t have to specify for example POCLOUD vs PODAAC # credentials you actually don’t want to print your credentials, we were just checking that they worked
The resulting JSON dictionary is what we’ll export to R, and it will be valid for 1 hour. When I run into issues, I’ll say “why is this not working”, and it’s because it’s expired in 1 hour.
When we want to identify the bucket level, we’ll need to remove the name of the file. For example:
- <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_1992-01_ECCO_V4r4_latlon_0p50deg.nc> includes the filename
- <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/> is only the bucket
Expect to run into issues with listing the files in the bucket (because maybe something is restricted or maybe you can access files but not list everything that’s inside the bucket)

R: data access from s3 links!

And now I can switch to R, if R is my preferred language.

The blog post Using Amazon S3 with R by Danielle Navarro is hugely informative and describes how to use the aws.s3 R package.

First load libraries:

library(dplyr)
library(readr)
library(purrr)
library(stringr)
library(tibble)
library(aws.s3) # install.packages("aws.s3")
library(reticulate)

Translate credentials from python variables (created with earthdata above) to R variables using reticulate’s py$ syntax and purr’s pluck() to isolate a variable from a list:

## translate credentials from python to R, map to dataframe
credentials_r_list <- py$credentials #YAY!
credentials_r <- purrr::map_df(credentials_r_list, print)

## translate s3 links from python to R, create my_bucket
s3_links_r_list <- py$s3_links
my_link_list <- s3_links_r_list[1] # let's just start with one
my_link_chr <- purrr:::map_chr(my_link_list, paste, collapse="")
#my_link <- as_tibble(my_link_chr)
#my_link_split <- stringr::str_split(my_link, "/")
#my_bucket <- str_c("s3://", my_link_split[3], my_link_split[4])
my_bucket <- "s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/"

From the aws.s3 documentation, set up system environment variables for AWS:

Sys.setenv("AWS_ACCESS_KEY_ID" = credentials_r$accessKeyId,
           "AWS_SECRET_ACCESS_KEY" = credentials_r$secretAccessKey,
           "AWS_DEFAULT_REGION" = "us-west-2",
           "AWS_SESSION_TOKEN" = credentials_r$sessionToken)

# testing by hand: Luis
Sys.setenv("AWS_ACCESS_KEY_ID" = "ASIATNGJQBXBHRPIKFFB",
           "AWS_SECRET_ACCESS_KEY" = "zbYP2fueNxLK/joDAcz678mkjjzP6fz4HUN131ID",
           "AWS_DEFAULT_REGION" = "us-west-2")

First let’s test Danielle’s code to see if it runs. Note to Luis: the following only works when the Sys.setenv is not set:

library(aws.s3)

bucket_exists(
  bucket = "s3://herbariumnsw-pds/", 
  region = "ap-southeast-2"
)

Client error: (403) Forbidden

[1] FALSE
attr(,"x-amz-bucket-region")
[1] "ap-southeast-2"
attr(,"x-amz-request-id")
[1] "0FQ1R57F2VHGFPDF"
attr(,"x-amz-id-2")
[1] "N6RPTKPN3/H9tDuKNHM2ZAcChhkkn2WpfcTzhpxC3fUmiZdNEIiu1xJsQAvFSecYIuWZ28pchQW3sAPAdVU57Q=="
attr(,"content-type")
[1] "application/xml"
attr(,"date")
[1] "Thu, 07 Jul 2022 23:11:30 GMT"
attr(,"server")
[1] "AmazonS3"

Now, see if the PODAAC bucket exists:

aws.s3::bucket_exists(
  bucket = "s3://podaac-ops-cumulus-protected/", 
  region = "us-west-2"
)

Client error: (403) Forbidden

[1] FALSE
attr(,"x-amz-bucket-region")
[1] "us-west-2"
attr(,"x-amz-request-id")
[1] "M4T3W1JZ93M08AZB"
attr(,"x-amz-id-2")
[1] "hvGLWqGCRB4lLf9pD8f67OsTDulSOgqd+yLWzUTRFz2tlLPVpxHr9mSREL0bQPVyo70j0hvJp+8="
attr(,"content-type")
[1] "application/xml"
attr(,"date")
[1] "Thu, 07 Jul 2022 23:11:30 GMT"
attr(,"server")
[1] "AmazonS3"

herbarium_files <- get_bucket_df(
  bucket = "s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/", 
  region = "us-west-2",
  max = 20000
) %>% 
  as_tibble()

If forbidden: - 1 hour expiration time - this bucket is not listable (or protected) (hopefully this error will be clear enough)

If you get the following error, it’s likely because your credentials have expired:

Important

Client error: (403) Forbidden
[1] FALSE
attr(,"x-amz-bucket-region")
[1] "us-west-2"
attr(,"x-amz-request-id")
[1] "W2PQV030PDTGDD32"
attr(,"x-amz-id-2")
[1] "S8C0qzL1lAYLufzUupjqplyyS/3fWCKxIELk0OJLVHGzTOqlyhof+IPFYbaRUhmJwXQelfprYCU="
attr(,"content-type")
[1] "application/xml"
attr(,"date")
[1] "Wed, 08 Jun 2022 03:11:16 GMT"
attr(,"server")
[1] "AmazonS3"

Dev notes

Chat with Andy May 26

Maybe have a python script that takes arguments, compiled in a way that then in MatLab you can sys.admin that python script. Then he doesn’t need to know python

Other approach would be MatLab to re-write earthdata in MatLab

Our dream, revised: the code should be language-agnostic

Background

This was Luis’ original example code, but it downloads data. The examples above access it in the cloud. From https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/earthdata-access-demo.html

from earthdata import Auth, DataGranules, Store

# first we authenticate with NASA EDL
auth = Auth().login(strategy="netrc")

# Then we build a Query with spatiotemporal parameters
GranuleQuery = DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)

# We get the metadata records from CMR
granules = GranuleQuery.get()

# Now it{s time to download (or open) our data granules list with get()
files = Store(auth).get(granules, local_path='./data')