Introducing NASA earthaccess 🌍

TL;DR: earthaccess is a Python package to search, preview and access NASA datasets (on-prem or in the cloud) with a few lines of code.

Why?

Programmatic, Easy, Reproducible.

There are many ways to access NASA datasets, we can use the Earthdata search portal. We can use DAAC specific portals or tools. We could even use data.gov! Web portals are great but they are not designed for programmatic access and reproducible workflows. This is extremely important in the age of the cloud and reproducible open science.

The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. Many of these libraries contain amazing features and some similarities. In this context, earthaccess aims to be a simple library that can deal with the important parts of the metadata so we can access or download data without having to worry if a given dataset is on-prem or in the cloud.

How?

Note: There are a lot of acronyms that we need to get familiar with before any of this makes sense, here is a brief glossary for NASA Earthdata terms: NASA glossary

Authentication: Before we can use earthaccess we need an account with NASA EDL

Earthdata Login provides free and immediate access to thousands of EOSDIS data products covering all Earth science disciplines and topic areas for researchers, applied science users, application developers, and the general public.

Once we have our NASA EDL login credentials we can start accessing NASA data in a programmatic way.

import earthaccess
earthaccess.__version__
'0.3.0'
from earthaccess import Auth, Store, DataCollections, DataGranules
auth = Auth()

Auth()

earthaccess’s Auth class provides 3 different strategies to authenticate ourselves with NASA EDL.

  • netrc: Do we have a .netrc file with our EDL credentials? if so, we can use it with earthaccess. If we don’t have it and want to create one we can, earthaccess allows users to type their credentials and persist them into a .netrc file.
  • environment: If we have our EDL credentials as environment variables
    • EDL_USERNAME
    • EDL_PASSWORD
  • interactive: We will be asked for our EDL credentials with optinal persistance to .netrc

To persist our credentials to a .netrc file we have to do the following:

auth.login(strategy="interactive", persist=True)
auth.login(strategy="netrc")
# are we authenticated?
print(auth.authenticated)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 07/24/2022
True
a = auth.login(strategy="environment")
# are we authenticated?
print(auth.authenticated)
We are already authenticated with NASA EDL
True

Querying for datasets

The DataCollections class can query CMR for any collection (dataset) using all of CMR’s Query parameters and has built-in functions to extract useful information from the response.

# The first step is to create a DataCollections query 
Query = DataCollections()

# Use chain methods to customize our query
Query.keyword('elevation').bounding_box(-134.7,58.9,-133.9,59.2).temporal("2020-01-01","2020-02-01")

print(f'Collections found: {Query.hits()}')

# filtering what UMM fields to print, to see the full record we omit the fields filters
# meta is always included as 
collections = Query.fields(['ShortName','Version']).get(5)
# Inspect some results printing just the ShortName and Abstract
collections[0:3]

The results from a DataCollections and DataGranules query are enhanced python dictionaries, this means that we can access all the keys and values like we usually do with Python dictionaries.

collections[0]["umm"]["ShortName"]

The DataCollections class returns python dictionaries with some handy methods.

collection.concept_id() # returns the concept-id, used to search for data granules
collection.abstract() # returns the abstract
collection.landing_page() # returns the landing page if present in the UMM fields
collection.get_data() # returns the portal where data can be accessed.

The same results can be obtained using the dict syntax:

collection["meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc
# We can now search for collections using a pythonic API client for CMR.
Query = DataCollections().daac("PODAAC")

print(f'Collections found: {Query.hits()}')
collections = Query.fields(['ShortName']).get(10)
# Printing the first collection
collections[0]
# What if we want cloud collections
Query = DataCollections().daac("PODAAC").cloud_hosted(True)

print(f'Collections found: {Query.hits()}')
collections = Query.fields(['ShortName']).get(10)
# Printing 3 collections
collections[0]
# Printing the concept-id for the first 10 collections
[collection.concept_id() for collection in collections]

Querying for data files (granules)

The DataGranules class provides similar functionality as the collection class. To query for granules in a more reliable way concept-id would be the main key. You can search data granules using a short name but that could (more likely will) return different versions of the same data granules.

In this example we’re querying for 10 data grnaules from ICESat-2 ATL06 version 005 dataset.

Note: Generally speaking we won’t need authenticated queries unless they are restricted datasets for early adopters.

# We build our query
from pprint import pprint
Query = DataGranules().short_name('ATL06').version("005").bounding_box(-134.7,58.9,-133.9,59.2)
# We get 5 metadata records
granules = Query.get(5)
granules

Pretty printing data granules

Since we are in a notebook we can take advantage of it to see a more user friendly version of the granules with the built-in function display This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata search portal

# printing 2 granules using display
[display(granule) for granule in granules]

Spatiotemporal queries

Our granules and collection classes accept the same spatial and temporal arguments as CMR so we can search for granules that match spatiotemporal criteria.

Query = DataGranules().short_name("ATL06").temporal("2020-03-01", "2020-03-30").bounding_box(-134.7,58.9,-133.9,59.2).version("005")
# Always inspects the hits before retrieven the granule metadata, just because it's very verbose.
print(f"Granules found: {Query.hits()}")
# Now we can print some info about these granules using the built-in methods
granules = Query.get(5)
data_links = [{'links': g.data_links(access="on_prem"), 'size (MB):': g.size()} for g in granules]
data_links

Accessing the data

With earthaccess a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud (direct access) we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, us-west-2.

On-prem access: DAAC hosted data 📡

The Store() class will allow us to download or access our data and needs to be instantiated with our auth instance.

store = Store(auth)

For this example we are going to use a PODAAC dataset SMAP_JPL_L3_SSS_CAP_8DAY-RUNNINGMEAN_V5 which we previously queried (see querying for datasets) and got the concept id: C1972955240-PODAAC

Query = DataGranules().concept_id("C1972955240-PODAAC").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")
# getting more than 6,000 metadata records for demo purposes is going to slow us down a bit so let's get only a few
granules = Query.get(10)
# Does this granule belong to a cloud-based collection?
granules[0].cloud_hosted

Finally! let’s get the data

The Store class accepts the results from a DataGranules() query or it can also accept a list of URLs for the data files. In the second case we’ll have to specify the DAAC since it cannot infer which credentials to use solely on the URL.

%%time
files = store.get(granules[0:4], "./data/C1972955240-PODAAC/")

Accessing the data in the cloud ☁️

With earthaccess a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, us-west-2.

Same API, just a different place, in this case the concept-id for the same dataset is C2208422957-POCLOUD > Note: The concept-id changed even though is the same dataset.


Query = DataGranules().concept_id("C2208422957-POCLOUD").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")
cloud_granules = Query.get(10)
# is this a cloud hosted data granule?
cloud_granules[0].cloud_hosted
# Let's pretty print this
cloud_granules[0]
%%time
# If we get an error with direct_access=True, most likely is because
# we are running this code outside the us-west-2 region.
try:
    files = store.get(cloud_granules[0:4], local_path="./data/demo-POCLOUD")
except Exception as e:
    print(f"Error: {e}, we are probably not using this code in the Amazon cloud. Trying external links...")
    # There is hope, even if we are not in the Amazon cloud we can still get the data
    files = store.get(cloud_granules[0:4], access="external", local_path="./data/demo-POCLOUD")

☁️ Cloud Access Part II: streaming data

Being in the cloud allows us to stream data as if we were using it locally. Pairing gridded datasets on S3 and xarray isa very useful patter when we deal with a lot of data.

Recommended read: Skip the download! Stream NASA data directly into Python objects

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
import xarray as xr
# data_links
https_links = []
s3_links = []

fs = store.get_s3fs_session('POCLOUD')

for granule in cloud_granules:
    https_links.extend(granule.data_links(access="on_prem"))
    s3_links.extend(granule.data_links(access="direct"))
s3_links
%%time

import xarray as xr

try:
    files = store.open(s3_links, provider="POCLOUD")

    ds_L3 = xr.open_mfdataset(
        files,
        combine='nested',
        concat_dim='time',
        decode_cf=True,
        coords='minimal',
        chunks={'time': 1}
        )
    ds_L3
except Exception as e:
    pass
    # print(e)

Now to the important science! 🚀

Recap


from earthaccess import Auth, DataGranules, Store

# first we authenticate with NASA EDL
auth = Auth().login(strategy="netrc")

# Then we build a Query with spatiotemporal parameters
GranuleQuery = DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)

# We get the metadata records from CMR
granules = GranuleQuery.get()

# Now it{s time to download (or open) our data granules list with get()
files = Store(auth).get(granules, local_path='./data')

# Now to the important science!