import earthaccess
earthaccess.__version__
'0.3.0'
Programmatic, Easy, Reproducible.
There are many ways to access NASA datasets, we can use the Earthdata search portal. We can use DAAC specific portals or tools. We could even use data.gov! Web portals are great but they are not designed for programmatic access and reproducible workflows. This is extremely important in the age of the cloud and reproducible open science.
The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. Many of these libraries contain amazing features and some similarities. In this context, earthaccess aims to be a simple library that can deal with the important parts of the metadata so we can access or download data without having to worry if a given dataset is on-prem or in the cloud.
Note: There are a lot of acronyms that we need to get familiar with before any of this makes sense, here is a brief glossary for NASA Earthdata terms: NASA glossary
earthaccess
we need an account with NASA EDLEarthdata Login provides free and immediate access to thousands of EOSDIS data products covering all Earth science disciplines and topic areas for researchers, applied science users, application developers, and the general public.
Once we have our NASA EDL login credentials we can start accessing NASA data in a programmatic way.
import earthaccess
earthaccess.__version__
'0.3.0'
from earthaccess import Auth, Store, DataCollections, DataGranules
= Auth() auth
earthaccess
’s Auth class provides 3 different strategies to authenticate ourselves with NASA EDL.
.netrc
file with our EDL credentials? if so, we can use it with earthaccess
. If we don’t have it and want to create one we can, earthaccess allows users to type their credentials and persist them into a .netrc
file..netrc
To persist our credentials to a .netrc
file we have to do the following:
="interactive", persist=True) auth.login(strategy
="netrc")
auth.login(strategy# are we authenticated?
print(auth.authenticated)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 07/24/2022
True
= auth.login(strategy="environment")
a # are we authenticated?
print(auth.authenticated)
We are already authenticated with NASA EDL
True
The DataCollections
class can query CMR for any collection (dataset) using all of CMR’s Query parameters and has built-in functions to extract useful information from the response.
# The first step is to create a DataCollections query
= DataCollections()
Query
# Use chain methods to customize our query
'elevation').bounding_box(-134.7,58.9,-133.9,59.2).temporal("2020-01-01","2020-02-01")
Query.keyword(
print(f'Collections found: {Query.hits()}')
# filtering what UMM fields to print, to see the full record we omit the fields filters
# meta is always included as
= Query.fields(['ShortName','Version']).get(5)
collections # Inspect some results printing just the ShortName and Abstract
0:3] collections[
The results from a DataCollections and DataGranules query are enhanced python dictionaries, this means that we can access all the keys and values like we usually do with Python dictionaries.
0]["umm"]["ShortName"] collections[
The DataCollections class returns python dictionaries with some handy methods.
# returns the concept-id, used to search for data granules
collection.concept_id() # returns the abstract
collection.abstract() # returns the landing page if present in the UMM fields
collection.landing_page() # returns the portal where data can be accessed. collection.get_data()
The same results can be obtained using the dict
syntax:
"meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc collection[
# We can now search for collections using a pythonic API client for CMR.
= DataCollections().daac("PODAAC")
Query
print(f'Collections found: {Query.hits()}')
= Query.fields(['ShortName']).get(10)
collections # Printing the first collection
0] collections[
# What if we want cloud collections
= DataCollections().daac("PODAAC").cloud_hosted(True)
Query
print(f'Collections found: {Query.hits()}')
= Query.fields(['ShortName']).get(10)
collections # Printing 3 collections
0] collections[
# Printing the concept-id for the first 10 collections
for collection in collections] [collection.concept_id()
The DataGranules class provides similar functionality as the collection class. To query for granules in a more reliable way concept-id would be the main key. You can search data granules using a short name but that could (more likely will) return different versions of the same data granules.
In this example we’re querying for 10 data grnaules from ICESat-2 ATL06 version 005
dataset.
Note: Generally speaking we won’t need authenticated queries unless they are restricted datasets for early adopters.
# We build our query
from pprint import pprint
= DataGranules().short_name('ATL06').version("005").bounding_box(-134.7,58.9,-133.9,59.2)
Query # We get 5 metadata records
= Query.get(5)
granules granules
Since we are in a notebook we can take advantage of it to see a more user friendly version of the granules with the built-in function display
This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata search portal
# printing 2 granules using display
for granule in granules] [display(granule)
Our granules and collection classes accept the same spatial and temporal arguments as CMR so we can search for granules that match spatiotemporal criteria.
= DataGranules().short_name("ATL06").temporal("2020-03-01", "2020-03-30").bounding_box(-134.7,58.9,-133.9,59.2).version("005")
Query # Always inspects the hits before retrieven the granule metadata, just because it's very verbose.
print(f"Granules found: {Query.hits()}")
# Now we can print some info about these granules using the built-in methods
= Query.get(5)
granules = [{'links': g.data_links(access="on_prem"), 'size (MB):': g.size()} for g in granules]
data_links data_links
With earthaccess
a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud (direct access) we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, us-west-2
.
The Store()
class will allow us to download or access our data and needs to be instantiated with our auth
instance.
= Store(auth) store
For this example we are going to use a PODAAC dataset SMAP_JPL_L3_SSS_CAP_8DAY-RUNNINGMEAN_V5
which we previously queried (see querying for datasets) and got the concept id: C1972955240-PODAAC
= DataGranules().concept_id("C1972955240-PODAAC").bounding_box(-134.7,54.9,-100.9,69.2)
Query print(f"Granule hits: {Query.hits()}")
# getting more than 6,000 metadata records for demo purposes is going to slow us down a bit so let's get only a few
= Query.get(10)
granules # Does this granule belong to a cloud-based collection?
0].cloud_hosted granules[
The Store class accepts the results from a DataGranules()
query or it can also accept a list of URLs for the data files. In the second case we’ll have to specify the DAAC since it cannot infer which credentials to use solely on the URL.
%%time
= store.get(granules[0:4], "./data/C1972955240-PODAAC/") files
With earthaccess
a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, us-west-2
.
Same API, just a different place, in this case the concept-id
for the same dataset is C2208422957-POCLOUD
> Note: The concept-id
changed even though is the same dataset.
= DataGranules().concept_id("C2208422957-POCLOUD").bounding_box(-134.7,54.9,-100.9,69.2)
Query print(f"Granule hits: {Query.hits()}")
= Query.get(10)
cloud_granules # is this a cloud hosted data granule?
0].cloud_hosted cloud_granules[
# Let's pretty print this
0] cloud_granules[
%%time
# If we get an error with direct_access=True, most likely is because
# we are running this code outside the us-west-2 region.
try:
= store.get(cloud_granules[0:4], local_path="./data/demo-POCLOUD")
files except Exception as e:
print(f"Error: {e}, we are probably not using this code in the Amazon cloud. Trying external links...")
# There is hope, even if we are not in the Amazon cloud we can still get the data
= store.get(cloud_granules[0:4], access="external", local_path="./data/demo-POCLOUD") files
Being in the cloud allows us to stream data as if we were using it locally. Pairing gridded datasets on S3 and xarray isa very useful patter when we deal with a lot of data.
Recommended read: Skip the download! Stream NASA data directly into Python objects
import warnings
'ignore')
warnings.filterwarnings('ignore')
warnings.simplefilter(import xarray as xr
# data_links
= []
https_links = []
s3_links
= store.get_s3fs_session('POCLOUD')
fs
for granule in cloud_granules:
="on_prem"))
https_links.extend(granule.data_links(access="direct"))
s3_links.extend(granule.data_links(access s3_links
%%time
import xarray as xr
try:
= store.open(s3_links, provider="POCLOUD")
files
= xr.open_mfdataset(
ds_L3
files,='nested',
combine='time',
concat_dim=True,
decode_cf='minimal',
coords={'time': 1}
chunks
)
ds_L3except Exception as e:
pass
# print(e)
from earthaccess import Auth, DataGranules, Store
# first we authenticate with NASA EDL
= Auth().login(strategy="netrc")
auth
# Then we build a Query with spatiotemporal parameters
= DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
GranuleQuery
# We get the metadata records from CMR
= GranuleQuery.get()
granules
# Now it{s time to download (or open) our data granules list with get()
= Store(auth).get(granules, local_path='./data')
files
# Now to the important science!