## import DataCollections class from earthdata library
from earthdata import DataGranules
## To find the concept_id from the shortname that we copied:
# short_name = 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'
# collection = DataCollections().short_name(short_name).get()
# [c.concept_id() for c in collection] ## this returned 'C1990404799-POCLOUD'
# Then we build a Query with spatiotemporal parameters.
= DataGranules().concept_id('C1990404799-POCLOUD').bounding_box(-134.7,58.9,-133.9,59.2)
GranuleQuery
## We get the metadata records from CMR
= GranuleQuery.get()
granules
## Now it's time to open our data granules list.
= [granule.data_links(access='direct') for granule in granules]
s3_links 0] s3_links[
earthdata
: Python-R Handoff
The dream
Create once, use often: using earthdata
python package for NASA Earthdata authorization and identifying the s3 links (i.e. the locations where the data are stored on Amazon Web Services), then passing those python objects to R through Quarto for analysis by R folks. These notes are a work-in-progress by Julie and Luis and we’ll tidy them up as we develop them further.
Note: this dream is currently not working but we are sharing our progress.
Python: earthdata
package for auth & s3 links
earthdata
gets me the credentials, it gets me the links based on the queries.
In this example, the data we want is in the Cloud. For this examples we’re using this data we identified from the Earthdata Cloud Cookbook’s Multi-File_Direct_S3_Access_NetCDF_Example, and its short_name
is 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'
.
Identify the s3 links
Below is our query, pretending that that is the data and the bounding box we want.
Note that files = Store(auth).open(granules)
would work for Python users but open
won’t work in the R world because it will create some kind of python file handlers from fsspec
.
Get the Cloud credentials
Prerequesite: you’ll need a functioning .netrc here. earthdata
expects interactivity and that did not work here with Quarto in the RStudio IDE (and it also did not work for Julie in Jupyter notebook (June 7 2022)). So, we followed the 2021-Cloud-Hackathon’s NASA_Earthdata_Authentication, copying and pasting and running that code in a Jupyter notebook. (remember to rm .netrc
beforehand!)
Then, with a nice .netrc file, the next step is to get Cloud credentials:
## import the Auth class from the earthdata library
from earthdata import Auth
= Auth().login(strategy="netrc")
auth = auth.get_s3_credentials(cloud_provider = "POCLOUD") credentials
So now we have the s3 links and the credentials to download the links, so now we can use the tutorial in R!!
Notes
- Luis will update
earthdata
to automatically know the cloud provider so that you don’t have to specify for example POCLOUD vs PODAAC # credentials you actually don’t want to print your credentials, we were just checking that they worked - The resulting JSON dictionary is what we’ll export to R, and it will be valid for 1 hour. When I run into issues, I’ll say “why is this not working”, and it’s because it’s expired in 1 hour.
- When we want to identify the bucket level, we’ll need to remove the name of the file. For example:
- <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_1992-01_ECCO_V4r4_latlon_0p50deg.nc> includes the filename
- <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/> is only the bucket
- Expect to run into issues with listing the files in the bucket (because maybe something is restricted or maybe you can access files but not list everything that’s inside the bucket)
R: data access from s3 links!
And now I can switch to R, if R is my preferred language.
The blog post Using Amazon S3 with R by Danielle Navarro is hugely informative and describes how to use the aws.s3 R package.
First load libraries:
library(dplyr)
library(readr)
library(purrr)
library(stringr)
library(tibble)
library(aws.s3) # install.packages("aws.s3")
library(reticulate)
Translate credentials from python variables (created with earthdata
above) to R variables using reticulate
’s py$
syntax and purr
’s pluck()
to isolate a variable from a list:
## translate credentials from python to R, map to dataframe
<- py$credentials #YAY!
credentials_r_list <- purrr::map_df(credentials_r_list, print)
credentials_r
## translate s3 links from python to R, create my_bucket
<- py$s3_links
s3_links_r_list <- s3_links_r_list[1] # let's just start with one
my_link_list <- purrr:::map_chr(my_link_list, paste, collapse="")
my_link_chr #my_link <- as_tibble(my_link_chr)
#my_link_split <- stringr::str_split(my_link, "/")
#my_bucket <- str_c("s3://", my_link_split[3], my_link_split[4])
<- "s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/" my_bucket
From the aws.s3
documentation, set up system environment variables for AWS:
Sys.setenv("AWS_ACCESS_KEY_ID" = credentials_r$accessKeyId,
"AWS_SECRET_ACCESS_KEY" = credentials_r$secretAccessKey,
"AWS_DEFAULT_REGION" = "us-west-2",
"AWS_SESSION_TOKEN" = credentials_r$sessionToken)
# testing by hand: Luis
Sys.setenv("AWS_ACCESS_KEY_ID" = "ASIATNGJQBXBHRPIKFFB",
"AWS_SECRET_ACCESS_KEY" = "zbYP2fueNxLK/joDAcz678mkjjzP6fz4HUN131ID",
"AWS_DEFAULT_REGION" = "us-west-2")
First let’s test Danielle’s code to see if it runs. Note to Luis: the following only works when the Sys.setenv
is not set:
library(aws.s3)
bucket_exists(
bucket = "s3://herbariumnsw-pds/",
region = "ap-southeast-2"
)
Client error: (403) Forbidden
[1] FALSE
attr(,"x-amz-bucket-region")
[1] "ap-southeast-2"
attr(,"x-amz-request-id")
[1] "0FQ1R57F2VHGFPDF"
attr(,"x-amz-id-2")
[1] "N6RPTKPN3/H9tDuKNHM2ZAcChhkkn2WpfcTzhpxC3fUmiZdNEIiu1xJsQAvFSecYIuWZ28pchQW3sAPAdVU57Q=="
attr(,"content-type")
[1] "application/xml"
attr(,"date")
[1] "Thu, 07 Jul 2022 23:11:30 GMT"
attr(,"server")
[1] "AmazonS3"
Now, see if the PODAAC bucket exists:
::bucket_exists(
aws.s3bucket = "s3://podaac-ops-cumulus-protected/",
region = "us-west-2"
)
Client error: (403) Forbidden
[1] FALSE
attr(,"x-amz-bucket-region")
[1] "us-west-2"
attr(,"x-amz-request-id")
[1] "M4T3W1JZ93M08AZB"
attr(,"x-amz-id-2")
[1] "hvGLWqGCRB4lLf9pD8f67OsTDulSOgqd+yLWzUTRFz2tlLPVpxHr9mSREL0bQPVyo70j0hvJp+8="
attr(,"content-type")
[1] "application/xml"
attr(,"date")
[1] "Thu, 07 Jul 2022 23:11:30 GMT"
attr(,"server")
[1] "AmazonS3"
<- get_bucket_df(
herbarium_files bucket = "s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/",
region = "us-west-2",
max = 20000
%>%
) as_tibble()
If forbidden: - 1 hour expiration time - this bucket is not listable (or protected) (hopefully this error will be clear enough)
If you get the following error, it’s likely because your credentials have expired:
Client error: (403) Forbidden
[1] FALSE
attr(,"x-amz-bucket-region")
[1] "us-west-2"
attr(,"x-amz-request-id")
[1] "W2PQV030PDTGDD32"
attr(,"x-amz-id-2")
[1] "S8C0qzL1lAYLufzUupjqplyyS/3fWCKxIELk0OJLVHGzTOqlyhof+IPFYbaRUhmJwXQelfprYCU="
attr(,"content-type")
[1] "application/xml"
attr(,"date")
[1] "Wed, 08 Jun 2022 03:11:16 GMT"
attr(,"server")
[1] "AmazonS3"
Dev notes
Chat with Andy May 26
Maybe have a python script that takes arguments, compiled in a way that then in MatLab you can sys.admin that python script. Then he doesn’t need to know python
Other approach would be MatLab to re-write earthdata in MatLab
Our dream, revised: the code should be language-agnostic
Background
This was Luis’ original example code, but it downloads data. The examples above access it in the cloud. From https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/earthdata-access-demo.html
from earthdata import Auth, DataGranules, Store
# first we authenticate with NASA EDL
= Auth().login(strategy="netrc")
auth
# Then we build a Query with spatiotemporal parameters
= DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
GranuleQuery
# We get the metadata records from CMR
= GranuleQuery.get()
granules
# Now it{s time to download (or open) our data granules list with get()
= Store(auth).get(granules, local_path='./data') files