Data discovery with NASA’s CMR

Summary

In this notebook, we will walk through how to search for Earthdata data collections and granules. Along the way we will explore the available search parameters, information return, and specific contrains when using the CMR API. Our object is to identify assets to access that we would downloaded, or perform S3 direct access, within an analysis workflow

We will be querying CMR for ECOSTRESS collections/granules to identify assets we would downloaded, or perform S3 direct access, within an analysis workflow

Requirements

1. Earthdata Login

An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

2. ECOSTRESS Early Adopter

ECOSTRESS build 7 is only open to individuals identified as early adopters. As such ECOSTRESS discovery and access is managed by an access control list. If you are not on the access control list, you will not be able to complete the exercise as written below.

Learning Objectives

understand what CMR/CMR API is and what CMR/CMR API can be used for
how to use the requests package to search data collections and granules
how to use an Earthdata Login token to search for data with access control lists
how to parse the results of these searches.

What is CMR

CMR is the Common Metadata Repository. It catalogs all data for NASA’s Earth Observing System Data and Information System (EOSDIS). It is the backend of Earthdata Search, the GUI search interface you are probably familiar with. More information about CMR can be found here.

Unfortunately, the GUI for Earthdata Search is not accessible from a cloud instance - at least not without some work. Earthdata Search is also not immediately reproducible. What I mean by that is if you create a search using the GUI you would have to note the search criteria (date range, search area, collection name, etc), take a screenshot, copy the search url, or save the list of data granules returned by the search, in order to recreate the search. This information would have to be re-entered each time you or someone else wanted to do the search. You could make typos or other mistakes. A cleaner, reproducible solution is to search CMR programmatically using the CMR API.

What is the CMR API

API stands for Application Programming Interface. It allows applications (software, services, etc) to send information to each other. A helpful analogy is a waiter in a restaurant. The waiter takes your drink or food order that you select from the menu, often translated into short-hand, to the bar or kitchen, and then returns (hopefully) with what you ordered when it is ready.

The CMR API accepts search terms such as collection name, keywords, datetime range, and location, queries the CMR database and returns the results.

Getting Started: How to search CMR from Python

The first step is to import python packages. We will use:
- requests This package does most of the work for us accessing the CMR API using HTTP methods. - pprint to pretty print the results of the search.

A more in-depth tutorial on requests is here

import requests
import json
from pprint import pprint

To conduct a search using the CMR API, requests needs the url for the root CMR search endpoint. We’ll assign this url to a python variable as a string.

CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'

CMR allows search by collections, which are datasets, and granules, which are files that contain data. Many of the same search parameters can be used for collections and granules but the type of results returned differ. Search parameters can be found in the API Documentation.

Whether we search collections or granules is distinguished by adding "collections" or "granules" to the end of the CMR endpoint URL.

We are going to search collections first, so we add "collections" to the URL. We are using a python format string in the examples below.

url = f'{CMR_OPS}/{"collections"}'

In this first example, I want to retrieve a list of ECOSTRESS collections in the Earthdata Cloud. This includes ECOSTRESS collections from build 7 which at the time of this tutorial, is hidden to all except early adopters. Because of this, an extra parameter needs to be passed in each CMR request that indicates you are part of the access list. An Earthdata Login token will be passed to the token parameter, which is generated using your Earthdata Login credentials.

Two options are available to generate an Earthdata Login token. 1. Generate a token from the Earthdata Login interface by logging into Earthdata Login and Click Generate Token. 2. Programatically generate an Earthdata Login token. Use the NASA_Earthdata_Login_Token notebook to generate and save a token for use in this notebook.

We can read in our token after it has been generated and saved using the NASA_Earthdata_Login_Token notebook. The json file produce can be found here: /home/jovyan/.hidden_dir/edl_token.json. We’ll read to token into a variable named token.

with open('../../.hidden_dir/edl_token.json') as js:
    token = json.load(js)['access_token']

We’ll want to get the content in json (pronounced “jason”) format, so I pass a dictionary to the header keyword argument to say that I want results returned as json.

The .get() method is used to send this information to the CMR API. get() calls the HTTP method GET.

response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                        },
                        headers={
                            'Accept': 'application/json',
                        }
                       )

The request returns a Response object.

To check that our request was successful we can print the response variable we saved the request to.

response

A 200 response is what we want. This means that the requests was successful. For more information on HTTP status codes see https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

A more explict way to check the status code is to use the status_code attribute. Both methods return a HTTP status code.

response.status_code

The response from requests.get returns the results of the search and metadata about those results in the headers.

More information about the response object can be found by typing help(response).

headers contains useful information in a case-insensitive dictionary. We requested (above) that the information be return in json which means the object return is a dictionary in our Python environment. We’ll iterate through the returned dictionary, looping throught each field (k) and its associated value (v). For more on interating through dictionary object click here.

for k, v in response.headers.items():
    print(f'{k}: {v}')

Each item in the dictionary can be accessed in the normal way you access a python dictionary but the keys uniquely case-insensitive. Let’s take a look at the commonly used CMR-Hits key.

response.headers['CMR-Hits']

Note that “cmr-hits” works as well!

response.headers['cmr-hits']

In some situations the response to your query can return a very large number of result, some of which may not be relevant. We can add additional query parameters to restrict the information returned. We’re going to restrict the search by the provider parameter.

You can modify the code below to explore all Earthdata data products hosted by the various providers. When searching by provider, use Cloud Provider to search for cloud-hosted datasets and On-Premises Provider to search for datasets archived at the DAACs. A partial list of providers is given below.

DAAC	Short Name	Cloud Provider	On-Premises Provider
NSIDC	National Snow and Ice Data Center	NSIDC_CPRD	NSIDC_ECS
GHRC DAAC	Global Hydrometeorology Resource Center	GHRC_DAAC	GHRC_DAAC
PO DAAC	Physical Oceanography Distributed Active Archive Center	POCLOUD	PODAAC
ASF	Alaska Satellite Facility	ASF	ASF
ORNL DAAC	Oak Ridge National Laboratory	ORNL_CLOUD	ORNL_DAAC
LP DAAC	Land Processes Distributed Active Archive Center	LPCLOUD	LPDAAC_ECS
GES DISC	NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC)	GES_DISC	GES_DISC
OB DAAC	NASA’s Ocean Biology Distributed Active Archive Center		OB_DAAC
SEDAC	NASA’s Socioeconomic Data and Applications Center		SEDAC

We’ll assign the provider to a variable as a string and insert the variable into the parameter argument in the request. We’ll also assign the term ‘ECOSTRESS’ to a varible so we don’t need to repeatedly add it to the requests parameters.

provider = 'LPCLOUD'
project = 'ECOSTRESS'

headers = {
    'Authorization': f'Bearer {token}',
    'Accept': 'application/json',
}

response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider,
                            'project': project,
                        },
                        headers=headers
                       )
response

response.headers['cmr-hits']

Search results are contained in the content part of the Response object. However, response.content returns information in bytes.

response.content

A more convenient way to work with this information is to use json formatted data. I’m using pretty print pprint to print the data in an easy to read way.

Note - response.json() will format our response in json - ['feed']['entry'] returns all entries that CMR returned in the request (not the same as CMR-Hits) - [0] returns the first entry. Reminder that python starts indexing at 0, not 1!

pprint(response.json()['feed']['entry'][0])

The first response contains a lot more information than we need. We’ll narrow in on a few fields to get a feel for what we have. We’ll print the name of the dataset (dataset_id) and the concept id (id). We can build this variable and print statement like we did above with the url variable.

collections = response.json()['feed']['entry']

for collection in collections:
    print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["id"]}')

In some situations we may be expecting a certain number of results. Note here that we only have 10 datasets are printed. We know from CMR-Hits that there are more than 10 datasets. This is because CMR restricts the number of results returned by each query. The default is 10 but it can be set to a maximum of 2000. We’ll set the page_size parameter to 25 so we return all results in a single query.

response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider,
                            'project': project,
                            'page_size': 25
                        },
                        headers=headers
                       )
response

response.headers['cmr-hits']

Now, when we can re-run our for loop for the collections we now have all of the available collections listed.

collections = response.json()['feed']['entry']
for collection in collections:
    print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["id"]}')

Searching for Granules

In NASA speak, Granules are files or groups of files. In this example, we will search for ECO2LSTE version 1 for a specified region of interest and datetime range.

We need to change the resource url to look for granules instead of collections

url = f'{CMR_OPS}/{"granules"}'

We will search by concept_id, temporal, and bounding_box. Details about these search parameters can be found in the CMR API Documentation.

The formatting of the values for each parameter is quite specific.
Temporal parameters are in ISO 8061 format yyyy-MM-ddTHH:mm:ssZ.
Bounding box coordinates are lower left longitude, lower left latitude, upper right longitude, upper right latitude.

collection_id = 'C2076090826-LPCLOUD'
date_range = '2022-04-01T00:00:00Z,2022-04-30T23:59:59Z'
bbox = '-120.45264628,34.51050622,-120.40432448,34.53239876'

response = requests.get(url, 
                        params={
                            'concept_id': collection_id,
                            'temporal': date_range,
                            'bounding_box': bbox,
                            'token': token,
                            'page_size': 200
                            },
                        headers=headers
                       )
print(response.status_code)

print(response.headers['CMR-Hits'])

granules = response.json()['feed']['entry']
for granule in granules:
    print(f'{granule["data_center"]} | {granule["dataset_id"]} | {granule["id"]}')

pprint(granules[0])

Get URLs to cloud data assets

https_urls = [l['href'] for l in granules[0]['links'] if 'https' in l['href'] and '.tif' in l['href']]
https_urls

s3_urls = [l['href'] for l in granules[0]['links'] if 's3' in l['href'] and '.tif' in l['href']]
s3_urls