Accessing Multiple NetCDF4/HDF5 Files - S3 Direct Access

Summary

In this notebook, we will access monthly sea surface height from ECCO V4r4 (10.5067/ECG5D-SSH44). The data are provided as a time series of monthly netCDFs on a 0.5-degree latitude/longitude grid.

We will access the data from inside the AWS cloud (us-west-2 region, specifically) and load a time series made of multiple netCDF datasets into an xarray dataset. This approach leverages S3 native protocols for efficient access to the data.

Requirements

1. AWS instance running in us-west-2

NASA Earthdata Cloud data in S3 can be directly accessed via temporary credentials; this access is limited to requests made within the US West (Oregon) (code: us-west-2) AWS region.

2. Earthdata Login

An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

3. netrc File

You will need a netrc file containing your NASA Earthdata Login credentials in order to execute the notebooks. A netrc file can be created manually within text editor and saved to your home directory. For additional information see: Authentication for NASA Earthdata.

Learning Objectives

  • how to retrieve temporary S3 credentials for in-region direct S3 bucket access
  • how to define a dataset of interest and find netCDF files in S3 bucket
  • how to perform in-region direct access of ECCO_L4_SSH_05DEG_MONTHLY_V4R4 data in S3
  • how to plot the data

Import Packages

import os
import requests
import s3fs
import xarray as xr
import hvplot.xarray

Get Temporary AWS Credentials

Direct S3 access is achieved by passing NASA supplied temporary credentials to AWS so we can interact with S3 objects from applicable Earthdata Cloud buckets. For now, each NASA DAAC has different AWS credentials endpoints. Below are some of the credential endpoints to various DAACs:

s3_cred_endpoint = {
    'podaac':'https://archive.podaac.earthdata.nasa.gov/s3credentials',
    'gesdisc': 'https://data.gesdisc.earthdata.nasa.gov/s3credentials',
    'lpdaac':'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
    'ornldaac': 'https://data.ornldaac.earthdata.nasa.gov/s3credentials',
    'ghrcdaac': 'https://data.ghrc.earthdata.nasa.gov/s3credentials'
}

Create a function to make a request to an endpoint for temporary credentials. Remember, each DAAC has their own endpoint and credentials are not usable for cloud data from other DAACs.

def get_temp_creds(provider):
    return requests.get(s3_cred_endpoint[provider]).json()
temp_creds_req = get_temp_creds('podaac')
#temp_creds_req

Set up an s3fs session for Direct Access

s3fs sessions are used for authenticated access to s3 bucket and allows for typical file-system style operations. Below we create session by passing in the temporary credentials we recieved from our temporary credentials endpoint.

fs_s3 = s3fs.S3FileSystem(anon=False, 
                          key=temp_creds_req['accessKeyId'], 
                          secret=temp_creds_req['secretAccessKey'], 
                          token=temp_creds_req['sessionToken'],
                          client_kwargs={'region_name':'us-west-2'})

In this example we’re interested in the ECCO data collection from NASA’s PO.DAAC in Earthdata Cloud. In this case it’s the following string that unique identifies the collection of monthly, 0.5-degree sea surface height data (ECCO_L4_SSH_05DEG_MONTHLY_V4R4).

short_name = 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'
bucket = os.path.join('podaac-ops-cumulus-protected/', short_name, '*2015*.nc')
bucket

Get a list of netCDF files located at the S3 path corresponding to the ECCO V4r4 monthly sea surface height dataset on the 0.5-degree latitude/longitude grid, for year 2015.

ssh_files = fs_s3.glob(bucket)
ssh_files

Direct In-region Access

Open with the netCDF files using the s3fs package, then load them all at once into a concatenated xarray dataset.

fileset = [fs_s3.open(file) for file in ssh_files]

Create an xarray dataset using the open_mfdataset() function to “read in” all of the netCDF4 files in one call.

ssh_ds = xr.open_mfdataset(fileset,
                           combine='by_coords',
                           mask_and_scale=True,
                           decode_cf=True,
                           chunks='auto')
ssh_ds

Get the SSH variable as an xarray dataarray

ssh_da = ssh_ds.SSH
ssh_da

Plot the SSH time series using hvplot

ssh_da.hvplot.image(y='latitude', x='longitude', cmap='Viridis',).opts(clim=(ssh_da.attrs['valid_min'][0],ssh_da.attrs['valid_max'][0]))

Resources

Direct access to ECCO data in S3 (from us-west-2)

Data_Access__Direct_S3_Access__PODAAC_ECCO_SSH using CMR-STAC API to retrieve S3 links