Skip to content

Remote data access tutorial for CMIP6 Zarr data #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 23, 2022
331 changes: 331 additions & 0 deletions intermediate/remote_data_access.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0ebd8a4d-6937-4ad6-9c93-fa944fb389c1",
"metadata": {},
"source": [
"# Accessing remote data stored on the cloud\n",
"\n",
"In this tutorial, we'll cover the following:\n",
"- Finding a cloud hosted Zarr archive of CMIP6 dataset(s)\n",
"- Remote data access to a single CMIP6 dataset (sea surface height)\n",
"- Calculate future predicted sea level change in 2100 compared to 2015"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7533f0e-5dd1-423a-9a04-8ed755d180a2",
"metadata": {},
"outputs": [],
"source": [
"import gcsfs\n",
"import pandas as pd\n",
"import xarray as xr"
]
},
{
"cell_type": "markdown",
"id": "95002377-b0a6-479f-928d-53b044b390df",
"metadata": {},
"source": [
"## Finding cloud native data\n",
"\n",
"Cloud-native data means data that is structured for efficient querying across the network.\n",
"Typically, this means having metadata that describes the entire file in the header of the\n",
"file, or having a a separate pointer file (so that there is no need to download everything first).\n",
"\n",
"Quite commonly, you'll see cloud-native datasets stored on these\n",
"three object storage providers, though there are many other ones too.\n",
"\n",
"- [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3)\n",
"- [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs)\n",
"- [Google Cloud Storage](https://cloud.google.com/storage)"
]
},
{
"cell_type": "markdown",
"id": "bc520e32-204f-4f92-bdec-4f678160d6de",
"metadata": {},
"source": [
"### Getting cloud hosted CMIP6 data\n",
"\n",
"The [Coupled Model Intercomparison Project Phase 6 (CMIP6)](https://en.wikipedia.org/wiki/CMIP6#CMIP_Phase_6)\n",
"dataset is a rich archive of modelling experiments carried out to predict the climate change impacts.\n",
"The datasets are stored using the [Zarr](https://zarr.dev) format, and we'll go over how to access it.\n",
"\n",
"Sources:\n",
"- https://esgf-node.llnl.gov/search/cmip6/\n",
"- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
]
},
{
"cell_type": "markdown",
"id": "8d12400d-ab5e-420e-b9f5-b61e083dc9ce",
"metadata": {},
"source": [
"First, let's open a CSV containing the list of CMIP6 datasets available"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1d9f94c-dbe3-4151-8ee7-fa182724810b",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv\")\n",
"print(f\"Number of rows: {len(df)}\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "eb263332-dc60-4bd1-9ef3-cf9612cf09a1",
"metadata": {},
"source": [
"Over 5 million rows! Let's filter it down to the variable and experiment\n",
"we're interested in, e.g. sea surface height.\n",
"\n",
"For the `variable_id`, you can look it up given some keyword at\n",
"https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y/edit#gid=1221485271\n",
"\n",
"For the `experiment_id`, download the spreadsheet from\n",
"https://github.com/ES-DOC/esdoc-docs/blob/master/cmip6/experiments/spreadsheet/experiments.xlsx,\n",
"go to the 'experiment' tab, and find the one you're interested in.\n",
"\n",
"Another good place to find the right model runs is https://esgf-node.llnl.gov/search/cmip6\n",
"(once you get your head around the acronyms and short names)."
]
},
{
"cell_type": "markdown",
"id": "9b435c14-fd56-481c-b5f4-781794a1cc1a",
"metadata": {},
"source": [
"Below, we'll filter to CMIP6 experiments matching:\n",
"- Sea Surface Height Above Geoid [m] (variable_id: `zos`)\n",
"- Shared Socioeconomic Pathway 5 (experiment_id: `ssp585`)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fe50e53-b02f-4a84-bc4a-e1934fe32661",
"metadata": {},
"outputs": [],
"source": [
"df_zos = df.query(\"variable_id == 'zos' & experiment_id == 'ssp585'\")\n",
"df_zos"
]
},
{
"cell_type": "markdown",
"id": "9ddfad3e-d4de-4c0a-be6f-53f1f7928f51",
"metadata": {},
"source": [
"There's 274 modelled scenarios for SSP5.\n",
"Let's just get the URL to the first one in the list for now."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5515186d-8571-439a-b5a8-b8b56aab77f6",
"metadata": {},
"outputs": [],
"source": [
"print(df_zos.zstore.iloc[0])"
]
},
{
"cell_type": "markdown",
"id": "b68bcfbb-24c9-420d-b297-44c678b7f8ce",
"metadata": {},
"source": [
"## Reading from the remote Zarr storage"
]
},
{
"cell_type": "markdown",
"id": "b3f5660d-bd46-44f6-8f6d-a62947b6f2c4",
"metadata": {},
"source": [
"In many cases, you'll need to first connect to the cloud provider.\n",
"The CMIP6 dataset allows anonymous access, but for some cases,\n",
"you may need to authentication."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4e6d5e3-35a0-4c31-a1b8-96258cf50974",
"metadata": {},
"outputs": [],
"source": [
"fs = gcsfs.GCSFileSystem(token=\"anon\")"
]
},
{
"cell_type": "markdown",
"id": "b959f829-e434-4a84-82d2-2f2b24dc84d2",
"metadata": {},
"source": [
"Next, we'll need a mapping to the Google Storage object.\n",
"This can be done using `fs.get_mapper`.\n",
"\n",
"A more generic way (for other cloud providers) is to use\n",
"[`fsspec.get_mapper`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.get_mapper) instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1527d1f-503e-4b0b-8433-794067ed46cc",
"metadata": {},
"outputs": [],
"source": [
"store = fs.get_mapper(\n",
" \"gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b694baac-9259-4de8-8eae-ac3cb653d894",
"metadata": {},
"source": [
"With that, we can open the Zarr store like so."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74b6d289-a852-4216-a3b6-4483d5ff2854",
"metadata": {},
"outputs": [],
"source": [
"ds = xr.open_zarr(store=store, consolidated=True)\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "d81a5958-517b-4215-8c02-b1083b4b4fe2",
"metadata": {},
"source": [
"### Selecting time slices\n",
"\n",
"Let's say we want to calculate sea level change between\n",
"2015 and 2100. We can access just the specific time points\n",
"needed using [`xr.Dataset.sel`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1101b455-ba65-4cab-a3b6-bf2601958400",
"metadata": {},
"outputs": [],
"source": [
"zos_2015jan = ds.zos.sel(time=\"2015-01-16\").isel(time=0)\n",
"zos_2100dec = ds.zos.sel(time=\"2100-12-16\").isel(time=0)"
]
},
{
"cell_type": "markdown",
"id": "fb8d90a2-9883-41da-b26c-7b5547a15270",
"metadata": {},
"source": [
"Sea level change would just be 2100 minus 2015."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f5fa1ee-260c-4ec4-898a-230826f9f2c8",
"metadata": {},
"outputs": [],
"source": [
"sealevelchange = zos_2100dec - zos_2015jan"
]
},
{
"cell_type": "markdown",
"id": "0e087f3b-0315-40db-ae03-a3393b49c30e",
"metadata": {},
"source": [
"Note that up to this point, we have not actually downloaded any\n",
"(big) data yet from the cloud. This is all working based on\n",
"metadata only.\n",
"\n",
"To bring the data from the cloud to your local computer, call `.compute`.\n",
"This will take a while depending on your connection speed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38c2152e-67e7-449e-8f1a-2d64f63dedda",
"metadata": {},
"outputs": [],
"source": [
"sealevelchange = sealevelchange.compute()"
]
},
{
"cell_type": "markdown",
"id": "5226729f-07db-4fe6-a980-9a1f630c8277",
"metadata": {},
"source": [
"We can do a quick plot to show how Sea Level is predicted to change\n",
"between 2015-2100 (from one modelled experiment)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c42ed9f-fc61-4762-9765-3dd553d5c2ad",
"metadata": {},
"outputs": [],
"source": [
"sealevelchange.plot.imshow()"
]
},
{
"cell_type": "markdown",
"id": "b4361786-c889-4ae7-a704-dcbda50513da",
"metadata": {},
"source": [
"Notice the blue parts between -40 and -60 South where sea level has dropped?\n",
"That's to do with the Antarctic ice sheet losing mass and resulting in a lower\n",
"gravitational pull, resulting in a relative decrease in sea level. Over most\n",
"of the Northern Hemisphere though, sea level rise has increased between 2015 and 2100."
]
},
{
"cell_type": "markdown",
"id": "a87aa0a3-c82e-4da0-a5d0-31e42039feae",
"metadata": {},
"source": [
"That's all! Hopefully this will get you started on accessing more cloud-native datasets!"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}