import requests
import json
from pprint import pprint
Data discovery with NASA’s CMR
Summary
In this notebook, we will walk through how to search for Earthdata data collections and granules. Along the way we will explore the available search parameters, information return, and specific contrains when using the CMR API. Our object is to identify assets to access that we would downloaded, or perform S3 direct access, within an analysis workflow
We will be querying CMR for ECOSTRESS collections/granules to identify assets we would downloaded, or perform S3 direct access, within an analysis workflow
Requirements
1. Earthdata Login
An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.
2. ECOSTRESS Early Adopter
ECOSTRESS build 7 is only open to individuals identified as early adopters. As such ECOSTRESS discovery and access is managed by an access control list. If you are not on the access control list, you will not be able to complete the exercise as written below.
Learning Objectives
- understand what CMR/CMR API is and what CMR/CMR API can be used for
- how to use the
requests
package to search data collections and granules - how to use an Earthdata Login token to search for data with access control lists
- how to parse the results of these searches.
What is CMR
CMR is the Common Metadata Repository. It catalogs all data for NASA’s Earth Observing System Data and Information System (EOSDIS). It is the backend of Earthdata Search, the GUI search interface you are probably familiar with. More information about CMR can be found here.
Unfortunately, the GUI for Earthdata Search is not accessible from a cloud instance - at least not without some work. Earthdata Search is also not immediately reproducible. What I mean by that is if you create a search using the GUI you would have to note the search criteria (date range, search area, collection name, etc), take a screenshot, copy the search url, or save the list of data granules returned by the search, in order to recreate the search. This information would have to be re-entered each time you or someone else wanted to do the search. You could make typos or other mistakes. A cleaner, reproducible solution is to search CMR programmatically using the CMR API.
What is the CMR API
API stands for Application Programming Interface. It allows applications (software, services, etc) to send information to each other. A helpful analogy is a waiter in a restaurant. The waiter takes your drink or food order that you select from the menu, often translated into short-hand, to the bar or kitchen, and then returns (hopefully) with what you ordered when it is ready.
The CMR API accepts search terms such as collection name, keywords, datetime range, and location, queries the CMR database and returns the results.
Getting Started: How to search CMR from Python
The first step is to import python packages. We will use:
- requests
This package does most of the work for us accessing the CMR API using HTTP methods. - pprint
to pretty print the results of the search.
A more in-depth tutorial on requests
is here
To conduct a search using the CMR API, requests
needs the url for the root CMR search endpoint. We’ll assign this url to a python variable as a string.
= 'https://cmr.earthdata.nasa.gov/search' CMR_OPS
CMR allows search by collections, which are datasets, and granules, which are files that contain data. Many of the same search parameters can be used for collections and granules but the type of results returned differ. Search parameters can be found in the API Documentation.
Whether we search collections or granules is distinguished by adding "collections"
or "granules"
to the end of the CMR endpoint URL.
We are going to search collections first, so we add "collections"
to the URL. We are using a python
format string in the examples below.
= f'{CMR_OPS}/{"collections"}' url
In this first example, I want to retrieve a list of ECOSTRESS collections in the Earthdata Cloud. This includes ECOSTRESS collections from build 7 which at the time of this tutorial, is hidden to all except early adopters. Because of this, an extra parameter needs to be passed in each CMR request that indicates you are part of the access list. An Earthdata Login token will be passed to the token
parameter, which is generated using your Earthdata Login credentials.
Two options are available to generate an Earthdata Login token. 1. Generate a token from the Earthdata Login interface by logging into Earthdata Login and Click Generate Token. 2. Programatically generate an Earthdata Login token. Use the NASA_Earthdata_Login_Token notebook to generate and save a token for use in this notebook.
We can read in our token after it has been generated and saved using the NASA_Earthdata_Login_Token notebook. The json file produce can be found here: /home/jovyan/.hidden_dir/edl_token.json
. We’ll read to token into a variable named token
.
with open('../../.hidden_dir/edl_token.json') as js:
= json.load(js)['access_token'] token
We’ll want to get the content in json
(pronounced “jason”) format, so I pass a dictionary to the header keyword argument to say that I want results returned as json
.
The .get()
method is used to send this information to the CMR API. get()
calls the HTTP method GET.
= requests.get(url,
response ={
params'cloud_hosted': 'True',
'has_granules': 'True',
},={
headers'Accept': 'application/json',
} )
The request returns a Response
object.
To check that our request was successful we can print the response
variable we saved the request to.
response
A 200 response is what we want. This means that the requests was successful. For more information on HTTP status codes see https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
A more explict way to check the status code is to use the status_code
attribute. Both methods return a HTTP status code.
response.status_code
The response from requests.get
returns the results of the search and metadata about those results in the headers
.
More information about the response
object can be found by typing help(response)
.
headers
contains useful information in a case-insensitive dictionary. We requested (above) that the information be return in json which means the object return is a dictionary in our Python environment. We’ll iterate through the returned dictionary, looping throught each field (k
) and its associated value (v
). For more on interating through dictionary object click here.
for k, v in response.headers.items():
print(f'{k}: {v}')
Each item in the dictionary can be accessed in the normal way you access a python
dictionary but the keys uniquely case-insensitive. Let’s take a look at the commonly used CMR-Hits
key.
'CMR-Hits'] response.headers[
Note that “cmr-hits” works as well!
'cmr-hits'] response.headers[
In some situations the response to your query can return a very large number of result, some of which may not be relevant. We can add additional query parameters to restrict the information returned. We’re going to restrict the search by the provider
parameter.
You can modify the code below to explore all Earthdata data products hosted by the various providers. When searching by provider, use Cloud Provider to search for cloud-hosted datasets and On-Premises Provider to search for datasets archived at the DAACs. A partial list of providers is given below.
DAAC | Short Name | Cloud Provider | On-Premises Provider |
---|---|---|---|
NSIDC | National Snow and Ice Data Center | NSIDC_CPRD | NSIDC_ECS |
GHRC DAAC | Global Hydrometeorology Resource Center | GHRC_DAAC | GHRC_DAAC |
PO DAAC | Physical Oceanography Distributed Active Archive Center | POCLOUD | PODAAC |
ASF | Alaska Satellite Facility | ASF | ASF |
ORNL DAAC | Oak Ridge National Laboratory | ORNL_CLOUD | ORNL_DAAC |
LP DAAC | Land Processes Distributed Active Archive Center | LPCLOUD | LPDAAC_ECS |
GES DISC | NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) | GES_DISC | GES_DISC |
OB DAAC | NASA’s Ocean Biology Distributed Active Archive Center | OB_DAAC | |
SEDAC | NASA’s Socioeconomic Data and Applications Center | SEDAC |
We’ll assign the provider to a variable as a string and insert the variable into the parameter argument in the request. We’ll also assign the term ‘ECOSTRESS’ to a varible so we don’t need to repeatedly add it to the requests parameters.
= 'LPCLOUD'
provider = 'ECOSTRESS' project
= {
headers 'Authorization': f'Bearer {token}',
'Accept': 'application/json',
}
= requests.get(url,
response ={
params'cloud_hosted': 'True',
'has_granules': 'True',
'provider': provider,
'project': project,
},=headers
headers
) response
'cmr-hits'] response.headers[
Search results are contained in the content part of the Response object. However, response.content
returns information in bytes.
response.content
A more convenient way to work with this information is to use json
formatted data. I’m using pretty print pprint
to print the data in an easy to read way.
Note - response.json()
will format our response in json
- ['feed']['entry']
returns all entries that CMR returned in the request (not the same as CMR-Hits) - [0]
returns the first entry. Reminder that python starts indexing at 0, not 1!
'feed']['entry'][0]) pprint(response.json()[
The first response contains a lot more information than we need. We’ll narrow in on a few fields to get a feel for what we have. We’ll print the name of the dataset (dataset_id
) and the concept id (id
). We can build this variable and print statement like we did above with the url
variable.
= response.json()['feed']['entry'] collections
for collection in collections:
print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["id"]}')
In some situations we may be expecting a certain number of results. Note here that we only have 10 datasets are printed. We know from CMR-Hits
that there are more than 10 datasets. This is because CMR restricts the number of results returned by each query. The default is 10 but it can be set to a maximum of 2000. We’ll set the page_size
parameter to 25 so we return all results in a single query.
= requests.get(url,
response ={
params'cloud_hosted': 'True',
'has_granules': 'True',
'provider': provider,
'project': project,
'page_size': 25
},=headers
headers
) response
'cmr-hits'] response.headers[
Now, when we can re-run our for loop for the collections we now have all of the available collections listed.
= response.json()['feed']['entry']
collections for collection in collections:
print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["id"]}')
Searching for Granules
In NASA speak, Granules are files or groups of files. In this example, we will search for ECO2LSTE version 1 for a specified region of interest and datetime range.
We need to change the resource url to look for granules instead of collections
= f'{CMR_OPS}/{"granules"}' url
We will search by concept_id
, temporal
, and bounding_box
. Details about these search parameters can be found in the CMR API Documentation.
The formatting of the values for each parameter is quite specific.
Temporal parameters are in ISO 8061 format yyyy-MM-ddTHH:mm:ssZ
.
Bounding box coordinates are lower left longitude, lower left latitude, upper right longitude, upper right latitude.
= 'C2076090826-LPCLOUD'
collection_id = '2022-04-01T00:00:00Z,2022-04-30T23:59:59Z'
date_range = '-120.45264628,34.51050622,-120.40432448,34.53239876' bbox
= requests.get(url,
response ={
params'concept_id': collection_id,
'temporal': date_range,
'bounding_box': bbox,
'token': token,
'page_size': 200
},=headers
headers
)print(response.status_code)
print(response.headers['CMR-Hits'])
= response.json()['feed']['entry']
granules for granule in granules:
print(f'{granule["data_center"]} | {granule["dataset_id"]} | {granule["id"]}')
0]) pprint(granules[
Get URLs to cloud data assets
= [l['href'] for l in granules[0]['links'] if 'https' in l['href'] and '.tif' in l['href']]
https_urls https_urls
= [l['href'] for l in granules[0]['links'] if 's3' in l['href'] and '.tif' in l['href']]
s3_urls s3_urls