Searching and Downloading NSIDC Cloud Collections

Programmatic access and processing of NSIDC data can happen in 2 ways, using the old Search -> Download -> Analize pattern or using a more modern Search -> Process_in_the_cloud -> Analyze approach.

There is nothing wrong with downloading data to our local machine but that can get complicated or even impossible if a dataset is too large. For this reason NSIDC along with other NASA data centers started to collocate or migrate their dataset holdings to the cloud.

In order to use NSIDC cloud collections we need to 1. Authenticate ourselves with the NASA Earthdata Login API (EDL). 2. Search granules/collections using a CMR client that supports authentication 3. Parse CMR responses looking for AWS S3 URLs 4. Access the data granules using temporary AWS credentials given by the NSIDC cloud credentials endpoint

Data used:

  • ICESat-2 ATL03: This data set contains height above the WGS 84 ellipsoid (ITRF2014 reference frame), latitude, longitude, and time for all photons.

Requirements

Querying CMR for NSIDC data

Most collections at NSIDC have not being migrated to the cloud and can be found using CMR with no authentication at all. Here is a simple example for altimeter data (ATL03) coming from the ICESat-2 mission. First we’ll search the regular collection and then we’ll do the same using the cloud collection.

Note: This notebook uses a low level CMR endpoint, this won’t be not the only workflow for data discovery.

from cmr.search import collection as cmr_collection
from cmr.search import granule 
from cmr.auth import token

import textwrap
# NON_AWS collections are hosted at the NSIDC DAAC data center
# AWS_CLOUD collections are hosted at AWS S3 us-west-2
NSIDC_PROVIDERS = {
    'NSIDC_HOSTED': 'NSIDC_ECS', 
    'AWS_HOSTED':'NSIDC_CPRD'
}

# First let's search for some collections hosted at NSIDC using a keyword
collections = cmr_collection.search({'keyword':'ice',
                                     'provider': NSIDC_PROVIDERS['NSIDC_HOSTED']})

# Let's print some information about the first 3 collection that match our provider
for collection in collections[0:3]:
    wrapped_abstract = '\n'.join(textwrap.wrap(f"Abstract: {collection['umm']['Abstract']}", 80)) + '\n'
    print(f"concept-id: {collection['meta']['concept-id']}\n" +
          f"Title: {collection['umm']['EntryTitle']}\n" +
          wrapped_abstract)
concept-id: C1997321091-NSIDC_ECS
Title: ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004
Abstract: This data set (ATL03) contains height above the WGS 84 ellipsoid
(ITRF2014 reference frame), latitude, longitude, and time for all photons
downlinked by the Advanced Topographic Laser Altimeter System (ATLAS) instrument
on board the Ice, Cloud and land Elevation Satellite-2 (ICESat-2) observatory.
The ATL03 product was designed to be a single source for all photon data and
ancillary information needed by higher-level ATLAS/ICESat-2 products. As such,
it also includes spacecraft and instrument parameters and ancillary data not
explicitly required for ATL03.

concept-id: C1705401930-NSIDC_ECS
Title: ATLAS/ICESat-2 L2A Global Geolocated Photon Data V003
Abstract: This data set (ATL03) contains height above the WGS 84 ellipsoid
(ITRF2014 reference frame), latitude, longitude, and time for all photons
downlinked by the Advanced Topographic Laser Altimeter System (ATLAS) instrument
on board the Ice, Cloud and land Elevation Satellite-2 (ICESat-2) observatory.
The ATL03 product was designed to be a single source for all photon data and
ancillary information needed by higher-level ATLAS/ICESat-2 products. As such,
it also includes spacecraft and instrument parameters and ancillary data not
explicitly required for ATL03.

concept-id: C2003771331-NSIDC_ECS
Title: ATLAS/ICESat-2 L3A Land Ice Height V004
Abstract: This data set (ATL06) provides geolocated, land-ice surface heights
(above the WGS 84 ellipsoid, ITRF2014 reference frame), plus ancillary
parameters that can be used to interpret and assess the quality of the height
estimates. The data were acquired by the Advanced Topographic Laser Altimeter
System (ATLAS) instrument on board the Ice, Cloud and land Elevation Satellite-2
(ICESat-2) observatory.
# Now let's do the same with short names, a more specific way of finding data.

#First let's search for some collections hosted at NSIDC
collections = cmr_collection.search({'short_name':'ATL03',
                                     'provider': NSIDC_PROVIDERS['NSIDC_HOSTED']})

# Note how we get back the same collection twice, that's because we have 2 versions available.
for collection in collections[0:3]:
    wrapped_abstract = '\n'.join(textwrap.wrap(f"Abstract: {collection['umm']['Abstract']}", 80)) + '\n'
    print(f"concept-id: {collection['meta']['concept-id']}\n" +
          f"Title: {collection['umm']['EntryTitle']}\n" +
          wrapped_abstract)
concept-id: C1997321091-NSIDC_ECS
Title: ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004
Abstract: This data set (ATL03) contains height above the WGS 84 ellipsoid
(ITRF2014 reference frame), latitude, longitude, and time for all photons
downlinked by the Advanced Topographic Laser Altimeter System (ATLAS) instrument
on board the Ice, Cloud and land Elevation Satellite-2 (ICESat-2) observatory.
The ATL03 product was designed to be a single source for all photon data and
ancillary information needed by higher-level ATLAS/ICESat-2 products. As such,
it also includes spacecraft and instrument parameters and ancillary data not
explicitly required for ATL03.

concept-id: C1705401930-NSIDC_ECS
Title: ATLAS/ICESat-2 L2A Global Geolocated Photon Data V003
Abstract: This data set (ATL03) contains height above the WGS 84 ellipsoid
(ITRF2014 reference frame), latitude, longitude, and time for all photons
downlinked by the Advanced Topographic Laser Altimeter System (ATLAS) instrument
on board the Ice, Cloud and land Elevation Satellite-2 (ICESat-2) observatory.
The ATL03 product was designed to be a single source for all photon data and
ancillary information needed by higher-level ATLAS/ICESat-2 products. As such,
it also includes spacecraft and instrument parameters and ancillary data not
explicitly required for ATL03.
# now that we have the concept-ids we can look for data granules in that collection and pass spatiotemporal parameters.
from cmr_serializer import QueryResult

# a bbox over Juneau Icefield 
# bbox = min Longitude , min Latitude , max Longitude , max Latitude 
query = {'concept-id': 'C1997321091-NSIDC_ECS',
         'bounding_box': '-135.1977,58.3325,-133.3410,58.9839'}

# Querying for ATL03 v3 using its concept-id and a bounding box
results = granule.search(query, limit=1000)
# This is a wrapper with convenient methods to work with CMR query results.
granules = QueryResult(results).items()

print(f"Total granules found: {len(results)} \n")
for g in granules[0:3]:
    display(g)
Total granules found: 201 

Id: ATL03_20181014001049_02350102_004_01.h5
Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -127.0482205607256, 'StartLatitude': 27.0, 'StartDirection': 'A', 'EndLatitude': 59.5, 'EndDirection': 'A'}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-14T00:10:49.722Z', 'EndingDateTime': '2018-10-14T00:19:19.918Z'}}
Size(MB): 1764.5729866028
Data: https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.004/2018.10.14/ATL03_20181014001049_02350102_004_01.h5

Id: ATL03_20181015124359_02580106_004_01.h5
Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': 49.70324528818096, 'StartLatitude': 59.5, 'StartDirection': 'D', 'EndLatitude': 27.0, 'EndDirection': 'D'}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-15T12:43:57.696Z', 'EndingDateTime': '2018-10-15T12:52:28.274Z'}}
Size(MB): 276.2403841019
Data: https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.004/2018.10.15/ATL03_20181015124359_02580106_004_01.h5

Id: ATL03_20181018000228_02960102_004_01.h5
Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -127.82682215638665, 'StartLatitude': 27.0, 'StartDirection': 'A', 'EndLatitude': 59.5, 'EndDirection': 'A'}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-18T00:02:28.717Z', 'EndingDateTime': '2018-10-18T00:10:58.903Z'}}
Size(MB): 877.0574979782
Data: https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.004/2018.10.18/ATL03_20181018000228_02960102_004_01.h5

# We  can access the data links with the data_links()
for g in granules[0:10]:
    print(g.data_links())

Cloud Collections

Some NSIDC cloud collections are not yet public we need to authenticate ourselves with CMR first.

import getpass
import textwrap

from cmr.search import collection as cmr_collection
from cmr.search import granule 
from cmr.auth import token

from cmr_auth import CMRAuth

# NON_AWS collections are hosted at the NSIDC DAAC data center
# AWS_CLOUD collections are hosted at AWS S3 us-west-2
NSIDC_PROVIDERS = {
    'NSIDC_HOSTED': 'NSIDC_ECS', 
    'AWS_HOSTED':'NSIDC_CPRD'
}

# Use your own EDL username
USER= 'betolink'

print('Enter your NASA Earthdata login password:')
password = getpass.getpass()
CMR_auth = CMRAuth(USER, password)
# Token to search private collections on CMR
cmr_token = CMR_auth.get_token()
Enter your NASA Earthdata login password:
 ········
# Now let's start our aunthenticated queries on CMR
query = {'short_name':'ATL03',
         'token': cmr_token,
         'provider': NSIDC_PROVIDERS['AWS_HOSTED']}

collections = cmr_collection.search(query)

for collection in collections[0:3]:
    wrapped_abstract = '\n'.join(textwrap.wrap(f"Abstract: {collection['umm']['Abstract']}", 80)) + '\n'
    print(f"concept-id: {collection['meta']['concept-id']}\n" +
          f"Title: {collection['umm']['EntryTitle']}\n" +
          wrapped_abstract)
concept-id: C2027878642-NSIDC_CPRD
Title: ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004
Abstract: This data set (ATL03) contains height above the WGS 84 ellipsoid
(ITRF2014 reference frame), latitude, longitude, and time for all photons
downlinked by the Advanced Topographic Laser Altimeter System (ATLAS) instrument
on board the Ice, Cloud and land Elevation Satellite-2 (ICESat-2) observatory.
The ATL03 product was designed to be a single source for all photon data and
ancillary information needed by higher-level ATLAS/ICESat-2 products. As such,
it also includes spacecraft and instrument parameters and ancillary data not
explicitly required for ATL03.
# now that we have the concept-id for our ATL03 in the cloud we do the same thing we did with ATL03 hosted at
from cmr_serializer import QueryResult
# NSIDC but using the cloud concept-id
# Jeneau ice sheet
query = {'concept-id': 'C2027878642-NSIDC_CPRD',
         'token': cmr_token,
         'bounding_box': '-135.1977,58.3325,-133.3410,58.9839'}

# Querying for ATL03 v3 using its concept-id and a bounding box
results = granule.search(query, limit=1000)
granules = QueryResult(results).items()

print(f"Total granules found: {len(results)} \n")

# Print the first 3 granules
for g in granules[0:3]:
    display(g)
    # You can use: print(g) for the regular text representation.
Total granules found: 135 

Id: ATL03_20181014001049_02350102_004_01.h5
Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -127.0482205607256, 'StartLatitude': 27.0, 'StartDirection': 'A', 'EndLatitude': 59.5, 'EndDirection': 'A'}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-14T00:10:49.722Z', 'EndingDateTime': '2018-10-14T00:19:19.918Z'}}
Size(MB): 1764.5729866027832
Data: https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/14/ATL03_20181014001049_02350102_004_01.h5
s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/14/ATL03_20181014001049_02350102_004_01.h5

Id: ATL03_20181015124359_02580106_004_01.h5
Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': 49.70324528818096, 'StartLatitude': 59.5, 'StartDirection': 'D', 'EndLatitude': 27.0, 'EndDirection': 'D'}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-15T12:43:57.696Z', 'EndingDateTime': '2018-10-15T12:52:28.274Z'}}
Size(MB): 276.2403841018677
Data: https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/15/ATL03_20181015124359_02580106_004_01.h5
s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/15/ATL03_20181015124359_02580106_004_01.h5

Id: ATL03_20181018000228_02960102_004_01.h5
Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -127.82682215638665, 'StartLatitude': 27.0, 'StartDirection': 'A', 'EndLatitude': 59.5, 'EndDirection': 'A'}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-18T00:02:28.717Z', 'EndingDateTime': '2018-10-18T00:10:58.903Z'}}
Size(MB): 877.0574979782104
Data: https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/18/ATL03_20181018000228_02960102_004_01.h5
s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/18/ATL03_20181018000228_02960102_004_01.h5

NOTE: Not all the data granules for NSIDC datasets have been migrated to S3. This might result in different counts between the NSIDC hosted data collections and the ones in AWS S3

# We can list the s3 links but 
for g in granules[0:10]:
    print(g.data_links(only_s3=True))
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/14/ATL03_20181014001049_02350102_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/15/ATL03_20181015124359_02580106_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/18/ATL03_20181018000228_02960102_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/11/05/ATL03_20181105113651_05780106_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/11/07/ATL03_20181107225525_06160102_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/11/09/ATL03_20181109112837_06390106_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/11/11/ATL03_20181111224708_06770102_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/11/15/ATL03_20181115223845_07380102_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/12/04/ATL03_20181204101243_10200106_004_01.h5']
['s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/12/06/ATL03_20181206213114_10580102_004_01.h5']

We note that our RelatedLinks array now contain links to AWS S3, these are the direct URIs for our data granules in the AWS us-west-2 region.

Data Access using AWS S3

  • IMPORTANT: This section will only work if this notebook is running on the AWS us-west-2 zone

There is more than one way of accessing data on AWS S3, either downloading it to your local machine using the official client library or using a python library.

Performance tip: using the HTTPS URLs will decrease the access performance since these links have to internally be processed by AWS’s content delivery system (CloudFront). To get a better performance we should access the S3:// URLs with BOTO3 or a high level S3 enabled library (i.e. S3FS)

Related links: * HDF in the Cloud challenges and solutions for scientific data * Cloud Storage (Amazon S3) HDF5 Connector

# READ only temporary credentials
import s3fs
import h5py

# This credentials only last 1 hour.
s3_cred = CMR_auth.get_s3_credentials()


s3_fs = s3fs.S3FileSystem(key=s3_cred['accessKeyId'],
                          secret=s3_cred['secretAccessKey'],
                          token=s3_cred['sessionToken'])

# Now you could grab S3 links to your cloud instance (EC2, Hub etc) using:
# s3_fs.get('s3://SOME_LOCATION/ATL03_20181015124359_02580106_004_01.h5', 'test.h5')

We now have the propper credentials and file mapper to access the data within AWS us-west-2.

with s3_fs.open('s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/15/ATL03_20181015124359_02580106_004_01.h5', 'rb') as s3f:
    with h5py.File(s3f, 'r') as f:
        print([key for key in f.keys()])

Using xarray to open files on S3

ATL data is complex so xarray doesn’t know how to extract the important bits out of it.

import xarray

with s3_fs.open('s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2018/10/15/ATL03_20181015124359_02580106_004_01.h5', 'rb') as s3f:
    ds= xarray.open_dataset(s3f)
    for varname in ds:
        print(varname)
ds

“Downloading” files on S3 using the official aws-cli library

The quotes on downloading are because ideally you’ll be working on an EC2 (virtual machine for short) instance on the us-west-2 region.