How do I subset data granules?
How do I subset a data granule using Harmony?
Install the harmony-py package:
# Install harmony-py
pip install -U harmony-py
Import packages:
import datetime as dt
from harmony import BBox, Client, Collection, Request, LinkType
import s3fs
import xarray as xr
Set up Harmony client and authentication
We will authenticate the following Harmony request using a netrc file. See the appendix for more information on Earthdata Login and netrc setup. This basic line below to create a Harmony Client assumes that we have a .netrc available.
= Client() harmony_client
Create and submit Harmony request
We are interested in the GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis dataset. We are subsetting over the Pacific Ocean to the west of Mexico during 1:00 - 2:00 on 10 March 2021. The dataset is organized into daily files, so while we are specifying a single hour in our request, this will return that full day’s worth of data.
= 'MUR-JPL-L4-GLOB-v4.1'
dataset_short_name
= Request(
request =Collection(id=dataset_short_name),
collection=BBox(-125.469,15.820,-99.453,35.859),
spatial={
temporal'start': dt.datetime(2021, 3, 10, 1),
'stop': dt.datetime(2021, 3, 10, 2)
}
)
= harmony_client.submit(request)
job_id
harmony_client.wait_for_processing(job_id)
Open and read the subsetted file in xarray
Harmony data outputs can be accessed within the cloud using the s3 URLs and AWS credentials provided in the Harmony job response. Using aws_credentials
we can retrieve the credentials needed to access the Harmony s3 staging bucket and its contents. We then use the AWS s3fs
package to create a file system that can then be read by xarray.
= harmony_client.result_urls(job_id, link_type=LinkType.s3)
results = list(results)
urls = urls[0]
url
= harmony_client.aws_credentials()
creds
= s3fs.S3FileSystem(
s3_fs =creds['aws_access_key_id'],
key=creds['aws_secret_access_key'],
secret=creds['aws_session_token'],
token={'region_name':'us-west-2'},
client_kwargs
)
= s3_fs.open(url, mode='rb')
f = xr.open_dataset(f)
ds ds
Plot data
Use the xarray built in plotting function to create a simple plot along the x and y dimensions of the dataset:
; ds.analysed_sst.plot()
R code coming soon!
# Coming soon!
This example assumes that you have experimented with and understood the tutorial MATLAB Access Single NASA EarthData L2 NetCDF presented in this Cookbook.
To subset a dataset, either read it over its full extent and subset it using MATLAB indexing (a posteriori, client-side), or use the start
, count
, and stride
parameters of function h5read
to read only specific regions/samples (hyperslabs) of it (server side). The code snippets that follow present both approaches, for accessing a dataset of sea surface temperature.
Follow the authentication procedure presented in the tutorial:
addpath("<PATH_TO_FUNCTION_LOADAWSCREDENTIALSENPOINT>") ;
daacCredentialsEndpoint = "https://archive.podaac.earthdata.nasa.gov/s3credentials";
loadAWSCredentials(daacCredentialsEndpoint);
Define relevant file and dataset:
filePath = "s3://podaac-ops-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20100619062008-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0.nc";
datasetName = "sea_surface_temperature_4um";
datasetPath = "/" + datasetName;
Read full dataset, longitudes, and latitudes. Replace fill values by NaN (so they are not displayed by MALTAB), and build contour plot:
fill_value = h5readatt(filePath,datasetPath,"_FillValue");
name = h5readatt(filePath,datasetPath,"long_name");
data = h5read(filePath,datasetPath);
data(data == fill_value) = NaN;
lat = h5read(filePath,"/lat");
lon = h5read(filePath,"/lon");
contour(lon,lat,data);
title(name);
Subset/sub-sample full dataset with a step of 10 (a posteriori) using MATLAB indexing:
step = 10;
dataSize = size(data);
rowIx = 1 : step : dataSize(1);
colIx = 1 : step : dataSize(2);
contour(lon(rowIx,colIx),lat(rowIx,colIx),data(rowIx,colIx));
title(name + " (sub-sampled through MATLAB indexing)");
Use the start
, count
, and stride
parameters of function h5read
to read only specific regions/samples (hyperslabs) of it (server side). These parameters must be defined as vectors whose size matches the number of dimensions of the dataset. Here, we want to read data starting at index 1, with a stride/step of 10, and cover the full extent of the dataset (count = Inf). Based on our experiments above, we assume that the dataset is a 2D array:
dataStride10 = h5read(filePath,datasetPath,[1,1],[Inf,Inf],[step,step]);
Running this leads to the following error:
Error using h5readc
The START, COUNT, and STRIDE parameters must be double precision, positive, and have length equal to the rank of the dataset.
This error message is too succint to be helpful. Try to get more information about the dataset using the h5disp
, which is cloud-enabled unlike ncdisp
:
h5disp(filePath,datasetPath)
OUTPUT:
HDF5 20100619062008-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0.nc
Dataset 'sea_surface_temperature_4um'
Size: 1354x2030x1 <--- Shows a 3rd (singleton) dimension
MaxSize: 1354x2030x1
Datatype: H5T_STD_I16LE (int16)
ChunkSize: 677x1015x1
Filters: deflate(5)
FillValue: -32767
Attributes:
'long_name': 'sea surface temperature'
'units': 'kelvin'
'_FillValue': -32767
'valid_min': -1000
'valid_max': 10000
'comment': 'sea surface temperature from mid-IR (4 um) channels; non L2P core field'
'scale_factor': 0.005000
'add_offset': 273.149994
'coordinates': 'lon lat'
'coverage_content_type': 'physicalMeasurement'
'DIMENSION_LIST': H5T_VLEN
This output shows a 3rd (singleton) dimension, but it still does not explain what it is.
One workaround is to display more of the content of the file through h5disp
and/or h5info
and try to get relevant information about dimensions. The following shows a simpler approach, however: create a local copy of the file and analyze it using ncdisp
. It is possible because MATLAB IO functions, and in particular copyfile
, are cloud-enabled:
copyfile(filePath,"test.nc");
ncdisp("test.nc",datasetName);
OUTPUT:
Source:
.../test.nc
Format:
netcdf4
Dimensions:
ni = 1354
nj = 2030
time = 1
Variables:
sea_surface_temperature_4um
Size: 1354x2030x1
Dimensions: ni,nj,time
Datatype: int16
Attributes:
...
This makes it clear that there is a 3rd dimension that is time, with a size of 1.
Subset the dataset using h5read
and vectors of 3 elements that define start
, count
, and stride
for the 3 dimensions:
dataStride10 = h5read(filePath,datasetPath,[1,1,1],[Inf,Inf,Inf],[step,step,1]);
dataStride10(dataStride10 == fill_value) = NaN;
latStride10 = h5read(filePath,"/lat",[1,1],[Inf,Inf],[step,step]);
lonStride10 = h5read(filePath,"/lon",[1,1],[Inf,Inf],[step,step]);
contour(lonStride10,latStride10,dataStride10);
title(name+ " (sub-sampled through H5READ stride parameter)");
We can target a specific region using the same two approaches. The following shows how to create an array of logicals that “flags” a region of interest, how to identify relevant rows and columns of the data, latitudes, and longitudes arrays, and how to use logical indexing for plotting the data:
latBounds = [31,33.5];
lonBounds = [-64,-57.5];
isRegion = lat>=latBounds(1) & lat<=latBounds(2) & lon>=lonBounds(1) & lon<=lonBounds(2);
isRow = any(isRegion,2);
isCol = any(isRegion,1);
contour(lon(isRow,isCol),lat(isRow,isCol),data(isRow,isCol));
xlim(lonBounds);
ylim(latBounds);
title(name+ " (selected region)");
With wget
and curl
:
# Coming soon!