Using S3 Bucket Storage in NASA-Openscapes Hub

Overview

When you are working in the NASA Openscapes Hub, there are strategies we can use to manage our storage both in terms of cost and performance. The default storage location is the HOME directory (/home/jovyan/) mounted to the compute instance (the cloud computer that is doing the computations). The Hub uses an EC2 compute instance, with the HOME directory mounted to AWS Elastic File System (EFS) storage. This drive is really handy because it is persistent across server restarts and is a great place to store your code. However the HOME directory is not a great place to store data, as it is very expensive, and can also be quite slow to read from and write to.

To that end, the hub provides every user access to two AWS S3 buckets - a “scratch” bucket for short-term storage, and a “persistent” bucket for longer-term storage. S3 buckets have fast read/write, and storage costs are relatively inexpensive compared to storing in your HOME directory. A useful way to think of S3 buckets in relation to your compute instance is like attaching a cheap but fast external hard drive to your expensive laptop.

One other thing to note about these buckets is that all hub users can access each other’s user directories. These buckets are accessible only when you are working inside the hub; you can access them using the environment variables:

  • $SCRATCH_BUCKET pointing to s3://openscapeshub-scratch/[your-username]
    • Scratch buckets are designed for storage of temporary files, e.g. intermediate results. Objects stored in a scratch bucket are removed after 7 days from their creation.
  • $PERSISTENT_BUCKET pointing to s3://openscapeshub-persistent/[your-username]
    • Persistent buckets are designed for storing data that is consistently used throughout the lifetime of a project. There is no automatic purging of objects in persistent buckets, so it is the responsibility of the hub admin and/or hub users to delete objects when they are no longer needed to minimize cloud billing costs.

We can interact with these directories in Python using the packages boto3 and/or s3fs, or in a terminal with the awsv2 cli tool. This tutorial will focus on using the s3fs package. See this page for more information on using S3 buckets in a 2i2c hub, and tips on using the aws cli tool.

Reading and writing to the $SCRATCH_BUCKET

We will start by accessing the same data we did in the Earthdata Cloud Clinic - reading it into memory as an xarray object and subsetting it.

import earthaccess 
import xarray as xr
import hvplot.xarray #plot
import os
import tempfile
import s3fs # aws s3 access
auth = earthaccess.login()
data_name = "SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205"

results = earthaccess.search_data(
    short_name=data_name,
    cloud_hosted=True,
    temporal=("2021-07-01", "2021-09-30"),
)
Granules found: 18
ds = xr.open_mfdataset(earthaccess.open(results))
ds
Opening 18 granules, approx size: 0.16 GB
using endpoint: https://archive.podaac.earthdata.nasa.gov/s3credentials
<xarray.Dataset> Size: 299MB
Dimensions:      (Time: 18, Longitude: 2160, nv: 2, Latitude: 960)
Coordinates:
  * Longitude    (Longitude) float32 9kB 0.08333 0.25 0.4167 ... 359.8 359.9
  * Latitude     (Latitude) float32 4kB -79.92 -79.75 -79.58 ... 79.75 79.92
  * Time         (Time) datetime64[ns] 144B 2021-07-05T12:00:00 ... 2021-09-2...
Dimensions without coordinates: nv
Data variables:
    Lon_bounds   (Time, Longitude, nv) float32 311kB dask.array<chunksize=(1, 2160, 2), meta=np.ndarray>
    Lat_bounds   (Time, Latitude, nv) float32 138kB dask.array<chunksize=(1, 960, 2), meta=np.ndarray>
    Time_bounds  (Time, nv) datetime64[ns] 288B dask.array<chunksize=(1, 2), meta=np.ndarray>
    SLA          (Time, Latitude, Longitude) float32 149MB dask.array<chunksize=(1, 960, 2160), meta=np.ndarray>
    SLA_ERR      (Time, Latitude, Longitude) float32 149MB dask.array<chunksize=(1, 960, 2160), meta=np.ndarray>
Attributes: (12/21)
    Conventions:            CF-1.6
    ncei_template_version:  NCEI_NetCDF_Grid_Template_v2.0
    Institution:            Jet Propulsion Laboratory
    geospatial_lat_min:     -79.916664
    geospatial_lat_max:     79.916664
    geospatial_lon_min:     0.083333336
    ...                     ...
    version_number:         2205
    Data_Pnts_Each_Sat:     {"16": 743215, "1007": 674076}
    source_version:         commit 58c7da13c0c0069ae940c33a82bf1544b7d991bf
    SLA_Global_MEAN:        0.06428374482174487
    SLA_Global_STD:         0.0905195660534004
    latency:                final
ds_subset = ds['SLA'].sel(Latitude=slice(15.8, 35.9), Longitude=slice(234.5,260.5)) 
ds_subset
<xarray.DataArray 'SLA' (Time: 18, Latitude: 120, Longitude: 156)> Size: 1MB
dask.array<getitem, shape=(18, 120, 156), dtype=float32, chunksize=(1, 120, 156), chunktype=numpy.ndarray>
Coordinates:
  * Longitude  (Longitude) float32 624B 234.6 234.8 234.9 ... 260.1 260.2 260.4
  * Latitude   (Latitude) float32 480B 15.92 16.08 16.25 ... 35.42 35.58 35.75
  * Time       (Time) datetime64[ns] 144B 2021-07-05T12:00:00 ... 2021-09-28T...
Attributes:
    units:          m
    long_name:      Sea Level Anomaly Estimate
    standard_name:  sea_surface_height_above_sea_level
    alias:          sea_surface_height_above_sea_level

Home directory

Imagining this ds_subset object is now an important intermediate dataset, or the result of a complex analysis and we want to save it. Our default action might be to just save it to our HOME directory. This is simple, but we want to avoid this as it incurs significant storage costs, and using this data later will be slow.

ds_subset.to_netcdf("test.nc") # avoid writing to home directory like this

Use the s3fs package to interact with our S3 bucket.

s3fs is a Python library that allows us to interact with S3 objects in a file-system like manner.

# Create a S3FileSystem class
s3 = s3fs.S3FileSystem()

# Get scratch and persistent buckets
scratch = os.environ["SCRATCH_BUCKET"]
persistent = os.environ["PERSISTENT_BUCKET"]

print(scratch)
print(persistent)
s3://openscapeshub-scratch/ateucher
s3://openscapeshub-persistent/ateucher

Our user-specific directories in the two buckets aren’t actually created until we put something in them, so if we try to check for their existence or list their contents before they are created, we will get an error. We will use the S3FileSystem.touch() method to place a simple empty file called .placeholder in each one to bring them into existence.

s3.touch(f"{scratch}/.placeholder")

s3.ls(scratch)
['openscapeshub-scratch/ateucher/.placeholder']

and in our persistent bucket:

s3.touch(f"{persistent}/.placeholder")

s3.ls(persistent)
['openscapeshub-persistent/ateucher/.placeholder']

(Note that adding these placeholders isn’t strictly necessary, as the first time you write anything to these buckets they will be created.)

Save dataset as netcdf file in SCRATCH bucket

Next we can save ds_subset as a netcdf file in our scratch bucket. This involves writing to a temporary directory first, and then moving that to the SCRATCH bucket:

# Where we want to store it:
scratch_nc_file_path = f"{scratch}/test123.nc"

# Create a temporary intermediate file and save it to the bucket
with tempfile.NamedTemporaryFile(suffix = ".nc") as tmp:
    ds_subset.to_netcdf(tmp.name) # save it to a temporary file
    s3.put(tmp.name, scratch_nc_file_path) # move that file to the scratch bucket

# Ensure the file is there
s3.ls(scratch)
['openscapeshub-scratch/ateucher/.placeholder',
 'openscapeshub-scratch/ateucher/test123.nc']

And we can open it to ensure it worked:

ds_from_scratch = xr.open_dataarray(s3.open(scratch_nc_file_path))

ds_from_scratch
<xarray.DataArray 'SLA' (Time: 18, Latitude: 120, Longitude: 156)> Size: 1MB
[336960 values with dtype=float32]
Coordinates:
  * Longitude  (Longitude) float32 624B 234.6 234.8 234.9 ... 260.1 260.2 260.4
  * Latitude   (Latitude) float32 480B 15.92 16.08 16.25 ... 35.42 35.58 35.75
  * Time       (Time) datetime64[ns] 144B 2021-07-05T12:00:00 ... 2021-09-28T...
Attributes:
    units:          m
    long_name:      Sea Level Anomaly Estimate
    standard_name:  sea_surface_height_above_sea_level
    alias:          sea_surface_height_above_sea_level
ds_from_scratch.hvplot.image(x='Longitude', y='Latitude', cmap='RdBu', clim=(-0.5, 0.5), title="Sea Level Anomaly Estimate (m)")

Move data to the persistent bucket

If we decide this is a file we want to keep around for a longer time period, we can move it to our persistent bucket. We can even make a subdirectory in our persistent bucket to keep us organized:

persistent_dest_dir = f"{persistent}/my-analysis-data/"

# Make directory in persistent bucket
s3.mkdir(persistent_dest_dir)

# Move the file
s3.mv(scratch_nc_file_path, persistent_dest_dir)

# Check the scratch and persistent bucket listings:
s3.ls(scratch)
['openscapeshub-scratch/ateucher/.placeholder']
s3.ls(persistent)
['openscapeshub-persistent/ateucher/.placeholder',
 'openscapeshub-persistent/ateucher/my-analysis-data']
s3.ls(persistent_dest_dir)
['openscapeshub-persistent/ateucher/my-analysis-data/test123.nc']

Move existing data from HOME to PERSISTENT_BUCKET

You may already have some data in your HOME directory that you would like to move out to a persistent bucket. You can do that using the awsv2 s3 command line tool, which is already installed on the hub. You can open a terminal from the Hub Launcher - it will open in your HOME directory. You can then use the awsv2 s3 mv command to move a file to your bucket.

Move a single file from HOME to PERSISTENT_BUCKET:

$ awsv2 s3 mv my-big-file.nc $PERSISTENT_BUCKET/ # The trailing slash is important here
move: ./my-big-file.nc to s3://openscapeshub-persistent/ateucher/my-big-file.nc

Move a directory of data from HOME to PERSISTENT

List the contents of the local results-data directory:

$ ls results-data/
my-big-file1.nc  my-big-file2.nc

Use awsv2 s3 mv with the --recursive flag to move all files in a directory to a new directory in PERSISTENT_BUCKET

$ awsv2 s3 mv --recursive results-data $PERSISTENT_BUCKET/results-data/
move: results-data/my-big-file1.nc to s3://openscapeshub-persistent/ateucher/results-data/my-big-file1.nc
move: results-data/my-big-file2.nc to s3://openscapeshub-persistent/ateucher/results-data/my-big-file2.nc