From 4069c0e13038b4afef3b8eec469cd405265c0697 Mon Sep 17 00:00:00 2001 From: Wei Ji <23487320+weiji14@users.noreply.github.com> Date: Sat, 16 Jul 2022 18:06:15 -0400 Subject: [PATCH 1/9] Remote data access tutorial for CMIP6 sea surface height Zarr data Draft tutorial on accessing cloud-native CMIP6 datasets on Google Cloud. Covers opening Zarr files using xarray.open_zarr. Also did a basic sea level change (between 2015 and 2100) calculation and plot to show the result. --- intermediate/remote_data_access.ipynb | 335 ++++++++++++++++++++++++++ 1 file changed, 335 insertions(+) create mode 100644 intermediate/remote_data_access.ipynb diff --git a/intermediate/remote_data_access.ipynb b/intermediate/remote_data_access.ipynb new file mode 100644 index 00000000..ea1a2444 --- /dev/null +++ b/intermediate/remote_data_access.ipynb @@ -0,0 +1,335 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0ebd8a4d-6937-4ad6-9c93-fa944fb389c1", + "metadata": {}, + "source": [ + "# Accessing remote data stored on the cloud\n", + "\n", + "In this tutorial, we'll cover the following:\n", + "- Finding a cloud hosted Zarr archive of CMIP6 dataset(s)\n", + "- Remote data access to a single CMIP6 dataset (sea surface height)\n", + "- Calculate future predicted sea level change in 2100 compared to 2015" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7533f0e-5dd1-423a-9a04-8ed755d180a2", + "metadata": {}, + "outputs": [], + "source": [ + "import gcsfs\n", + "import pandas as pd\n", + "import xarray as xr" + ] + }, + { + "cell_type": "markdown", + "id": "95002377-b0a6-479f-928d-53b044b390df", + "metadata": {}, + "source": [ + "## Finding cloud native data\n", + "\n", + "Cloud-native data means data that is structured for efficient querying across the network.\n", + "Typically, this means having metadata that describes the entire file in the header of the\n", + "file, or having a a separate pointer file (so that there is no need to download everything first).\n", + "\n", + "Quite commonly, you'll see cloud-native datasets stored on these\n", + "three object storage providers, though there are many other ones too.\n", + "\n", + "- [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3)\n", + "- [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs)\n", + "- [Google Cloud Storage](https://cloud.google.com/storage)" + ] + }, + { + "cell_type": "markdown", + "id": "bc520e32-204f-4f92-bdec-4f678160d6de", + "metadata": {}, + "source": [ + "### Getting cloud hosted CMIP6 data\n", + "\n", + "The [Coupled Model Intercomparison Project Phase 6 (CMIP6)](https://en.wikipedia.org/wiki/CMIP6#CMIP_Phase_6)\n", + "dataset is a rich archive of modelling experiments carried out to predict the climate change impacts.\n", + "The datasets are stored using the [Zarr](https://zarr.dev) format, and we'll go over how to access it.\n", + "\n", + "Sources:\n", + "- https://esgf-node.llnl.gov/search/cmip6/\n", + "- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6" + ] + }, + { + "cell_type": "markdown", + "id": "8d12400d-ab5e-420e-b9f5-b61e083dc9ce", + "metadata": {}, + "source": [ + "First, let's open a CSV containing the list of CMIP6 datasets available" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1d9f94c-dbe3-4151-8ee7-fa182724810b", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv\")\n", + "print(f\"Number of rows: {len(df)}\")\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "eb263332-dc60-4bd1-9ef3-cf9612cf09a1", + "metadata": {}, + "source": [ + "Over 5 million rows! Let's filter it down to the variable and experiment\n", + "we're interested in, e.g. sea surface height.\n", + "\n", + "For the `variable_id`, you can look it up given some keyword at\n", + "https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y/edit#gid=1221485271\n", + "\n", + "For the `experiment_id`, download the spreadsheet from\n", + "https://github.com/ES-DOC/esdoc-docs/blob/master/cmip6/experiments/spreadsheet/experiments.xlsx,\n", + "go to the 'experiment' tab, and find the one you're interested in.\n", + "\n", + "Another good place to find the right model runs is https://esgf-node.llnl.gov/search/cmip6\n", + "(once you get your head around the acronyms and short names)." + ] + }, + { + "cell_type": "markdown", + "id": "9b435c14-fd56-481c-b5f4-781794a1cc1a", + "metadata": {}, + "source": [ + "Below, we'll filter to CMIP6 experiments matching:\n", + "- Sea Surface Height Above Geoid [m] (variable_id: `zos`)\n", + "- Shared Socioeconomic Pathway 5 (experiment_id: `ssp585`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2fe50e53-b02f-4a84-bc4a-e1934fe32661", + "metadata": {}, + "outputs": [], + "source": [ + "df_zos = df.query(\"variable_id == 'zos' & experiment_id == 'ssp585'\")\n", + "df_zos" + ] + }, + { + "cell_type": "markdown", + "id": "9ddfad3e-d4de-4c0a-be6f-53f1f7928f51", + "metadata": {}, + "source": [ + "There's 274 modelled scenarios for SSP5.\n", + "Let's just get the URL to the first one in the list for now." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5515186d-8571-439a-b5a8-b8b56aab77f6", + "metadata": {}, + "outputs": [], + "source": [ + "print(df_zos.zstore.iloc[0])" + ] + }, + { + "cell_type": "markdown", + "id": "b68bcfbb-24c9-420d-b297-44c678b7f8ce", + "metadata": {}, + "source": [ + "## Reading from the remote Zarr storage" + ] + }, + { + "cell_type": "markdown", + "id": "b3f5660d-bd46-44f6-8f6d-a62947b6f2c4", + "metadata": {}, + "source": [ + "In many cases, you'll need to first connect to the cloud provider.\n", + "The CMIP6 dataset allows anonymous access, but for some cases,\n", + "you may need to authentication." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4e6d5e3-35a0-4c31-a1b8-96258cf50974", + "metadata": {}, + "outputs": [], + "source": [ + "fs = gcsfs.GCSFileSystem(token=\"anon\")" + ] + }, + { + "cell_type": "markdown", + "id": "b959f829-e434-4a84-82d2-2f2b24dc84d2", + "metadata": {}, + "source": [ + "Next, we'll need a mapping to the Google Storage object.\n", + "This can be done using `fs.get_mapper`.\n", + "\n", + "A more generic way (for other cloud providers) is to use\n", + "[`fsspec.get_mapper`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.get_mapper) instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1527d1f-503e-4b0b-8433-794067ed46cc", + "metadata": {}, + "outputs": [], + "source": [ + "store = fs.get_mapper(\"gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/\")" + ] + }, + { + "cell_type": "markdown", + "id": "b694baac-9259-4de8-8eae-ac3cb653d894", + "metadata": {}, + "source": [ + "With that, we can open the Zarr store like so." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74b6d289-a852-4216-a3b6-4483d5ff2854", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.open_zarr(store=store, consolidated=True)\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "d81a5958-517b-4215-8c02-b1083b4b4fe2", + "metadata": {}, + "source": [ + "### Selecting time slices\n", + "\n", + "Let's say we want to calculate sea level change between\n", + "2015 and 2100. We can access just the specific time points\n", + "needed using [`xr.Dataset.sel`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1101b455-ba65-4cab-a3b6-bf2601958400", + "metadata": {}, + "outputs": [], + "source": [ + "zos_2015jan = ds.zos.sel(time=\"2015-01-16\").isel(time=0)\n", + "zos_2100dec = ds.zos.sel(time=\"2100-12-16\").isel(time=0)" + ] + }, + { + "cell_type": "markdown", + "id": "fb8d90a2-9883-41da-b26c-7b5547a15270", + "metadata": {}, + "source": [ + "Sea level change would just be 2100 minus 2015." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f5fa1ee-260c-4ec4-898a-230826f9f2c8", + "metadata": {}, + "outputs": [], + "source": [ + "sealevelchange = (zos_2100dec - zos_2015jan)" + ] + }, + { + "cell_type": "markdown", + "id": "0e087f3b-0315-40db-ae03-a3393b49c30e", + "metadata": {}, + "source": [ + "Note that up to this point, we have not actually downloaded any\n", + "(big) data yet from the cloud. This is all working based on\n", + "metadata only.\n", + "\n", + "To bring the data from the cloud to your local computer, call `.compute`.\n", + "This will take a while depending on your connection speed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38c2152e-67e7-449e-8f1a-2d64f63dedda", + "metadata": {}, + "outputs": [], + "source": [ + "sealevelchange = sealevelchange.compute()" + ] + }, + { + "cell_type": "markdown", + "id": "5226729f-07db-4fe6-a980-9a1f630c8277", + "metadata": {}, + "source": [ + "We can do a quick plot to show how Sea Level is predicted to change\n", + "between 2015-2100 (from one modelled experiment)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c42ed9f-fc61-4762-9765-3dd553d5c2ad", + "metadata": {}, + "outputs": [], + "source": [ + "sealevelchange.plot.imshow()" + ] + }, + { + "cell_type": "markdown", + "id": "b4361786-c889-4ae7-a704-dcbda50513da", + "metadata": {}, + "source": [ + "Notice the blue parts between -40 and -60 South where sea level has dropped?\n", + "That's to do with the Antarctic ice sheet losing mass and resulting in a lower\n", + "gravitational pull, resulting in a relative decrease in sea level. Over most\n", + "of the Northern Hemisphere though, sea level rise has increased between 2015 and 2100." + ] + }, + { + "cell_type": "markdown", + "id": "a87aa0a3-c82e-4da0-a5d0-31e42039feae", + "metadata": {}, + "source": [ + "That's all! Hopefully this will get you started on accessing more cloud-native datasets!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "xarray-tutorial", + "language": "python", + "name": "xarray-tutorial" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 665ec9a505789c166ef96869555cf7e5da029345 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sat, 16 Jul 2022 22:07:37 +0000 Subject: [PATCH 2/9] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- intermediate/remote_data_access.ipynb | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/intermediate/remote_data_access.ipynb b/intermediate/remote_data_access.ipynb index ea1a2444..f24f5885 100644 --- a/intermediate/remote_data_access.ipynb +++ b/intermediate/remote_data_access.ipynb @@ -186,7 +186,9 @@ "metadata": {}, "outputs": [], "source": [ - "store = fs.get_mapper(\"gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/\")" + "store = fs.get_mapper(\n", + " \"gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/\"\n", + ")" ] }, { @@ -246,7 +248,7 @@ "metadata": {}, "outputs": [], "source": [ - "sealevelchange = (zos_2100dec - zos_2015jan)" + "sealevelchange = zos_2100dec - zos_2015jan" ] }, { @@ -312,11 +314,6 @@ } ], "metadata": { - "kernelspec": { - "display_name": "xarray-tutorial", - "language": "python", - "name": "xarray-tutorial" - }, "language_info": { "codemirror_mode": { "name": "ipython", @@ -326,8 +323,7 @@ "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.5" + "pygments_lexer": "ipython3" } }, "nbformat": 4, From e96181b89f369e1358ee421de0529cfee5429a48 Mon Sep 17 00:00:00 2001 From: Wei Ji <23487320+weiji14@users.noreply.github.com> Date: Fri, 29 Jul 2022 16:23:42 -0400 Subject: [PATCH 3/9] Edit CMIP6 catalog csv link and use squeeze Co-Authored-By: Julius Busecke Co-Authored-By: dcherian --- intermediate/remote_data_access.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/intermediate/remote_data_access.ipynb b/intermediate/remote_data_access.ipynb index f24f5885..175df28d 100644 --- a/intermediate/remote_data_access.ipynb +++ b/intermediate/remote_data_access.ipynb @@ -75,7 +75,7 @@ "metadata": {}, "outputs": [], "source": [ - "df = pd.read_csv(\"https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv\")\n", + "df = pd.read_csv(\"https://cmip6.storage.googleapis.com/pangeo-cmip6.csv\")\n", "print(f\"Number of rows: {len(df)}\")\n", "df.head()" ] @@ -125,7 +125,7 @@ "id": "9ddfad3e-d4de-4c0a-be6f-53f1f7928f51", "metadata": {}, "source": [ - "There's 274 modelled scenarios for SSP5.\n", + "There's 272 modelled scenarios for SSP5.\n", "Let's just get the URL to the first one in the list for now." ] }, @@ -229,8 +229,8 @@ "metadata": {}, "outputs": [], "source": [ - "zos_2015jan = ds.zos.sel(time=\"2015-01-16\").isel(time=0)\n", - "zos_2100dec = ds.zos.sel(time=\"2100-12-16\").isel(time=0)" + "zos_2015jan = ds.zos.sel(time=\"2015-01-16\").squeeze()\n", + "zos_2100dec = ds.zos.sel(time=\"2100-12-16\").squeeze()" ] }, { From 936c43c90c5d66e20f5ba380f89c36541915eeeb Mon Sep 17 00:00:00 2001 From: Wei Ji <23487320+weiji14@users.noreply.github.com> Date: Fri, 29 Jul 2022 16:24:50 -0400 Subject: [PATCH 4/9] Add remote data access tutorial to table of contents --- _toc.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/_toc.yml b/_toc.yml index e15e32b2..e4ccf6c1 100644 --- a/_toc.yml +++ b/_toc.yml @@ -37,6 +37,7 @@ parts: - file: intermediate/xarray_and_dask - file: intermediate/xarray_ecosystem - file: intermediate/hvplot + - file: intermediate/remote_data_access - file: data_cleaning/ice_velocity.ipynb - caption: Advanced From 5711d9e6563661e54032f69b9771962947b172de Mon Sep 17 00:00:00 2001 From: Wei Ji <23487320+weiji14@users.noreply.github.com> Date: Fri, 29 Jul 2022 16:29:57 -0400 Subject: [PATCH 5/9] Add link to Pangeo / ESGF Cloud Data Working Group documentation https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html --- intermediate/remote_data_access.ipynb | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/intermediate/remote_data_access.ipynb b/intermediate/remote_data_access.ipynb index 175df28d..aca85661 100644 --- a/intermediate/remote_data_access.ipynb +++ b/intermediate/remote_data_access.ipynb @@ -57,7 +57,8 @@ "\n", "Sources:\n", "- https://esgf-node.llnl.gov/search/cmip6/\n", - "- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6" + "- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6\n", + "- Pangeo/ESGF Cloud Data Access tutorial - https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html" ] }, { From cfe83b2b82c58537c7a223c635d5b28f79ea8bd1 Mon Sep 17 00:00:00 2001 From: Wei Ji <23487320+weiji14@users.noreply.github.com> Date: Wed, 10 Aug 2022 18:28:26 -0400 Subject: [PATCH 6/9] Update intermediate/remote_data_access.ipynb --- intermediate/remote_data_access.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intermediate/remote_data_access.ipynb b/intermediate/remote_data_access.ipynb index aca85661..ea891a73 100644 --- a/intermediate/remote_data_access.ipynb +++ b/intermediate/remote_data_access.ipynb @@ -90,7 +90,7 @@ "we're interested in, e.g. sea surface height.\n", "\n", "For the `variable_id`, you can look it up given some keyword at\n", - "https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y/edit#gid=1221485271\n", + "https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y\n", "\n", "For the `experiment_id`, download the spreadsheet from\n", "https://github.com/ES-DOC/esdoc-docs/blob/master/cmip6/experiments/spreadsheet/experiments.xlsx,\n", From 04a272e8b054a4af723e7a62441466e4df7043c6 Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Tue, 23 Aug 2022 11:34:17 -0600 Subject: [PATCH 7/9] Rename remote_data_access.ipynb to cmip6-cloud.ipynb --- intermediate/{remote_data_access.ipynb => cmip6-cloud.ipynb} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename intermediate/{remote_data_access.ipynb => cmip6-cloud.ipynb} (100%) diff --git a/intermediate/remote_data_access.ipynb b/intermediate/cmip6-cloud.ipynb similarity index 100% rename from intermediate/remote_data_access.ipynb rename to intermediate/cmip6-cloud.ipynb From 34daaf8c28370cb7cb678be42d65b379e6be86cd Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Tue, 23 Aug 2022 11:34:33 -0600 Subject: [PATCH 8/9] Update _toc.yml --- _toc.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_toc.yml b/_toc.yml index e4ccf6c1..713fd663 100644 --- a/_toc.yml +++ b/_toc.yml @@ -37,7 +37,7 @@ parts: - file: intermediate/xarray_and_dask - file: intermediate/xarray_ecosystem - file: intermediate/hvplot - - file: intermediate/remote_data_access + - file: intermediate/cmip6-cloud - file: data_cleaning/ice_velocity.ipynb - caption: Advanced From 2093ddde77941cc52ba4aa982d4515cbed52bfda Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Tue, 23 Aug 2022 11:35:33 -0600 Subject: [PATCH 9/9] Update intermediate/cmip6-cloud.ipynb --- intermediate/cmip6-cloud.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intermediate/cmip6-cloud.ipynb b/intermediate/cmip6-cloud.ipynb index ea891a73..0dda2a0f 100644 --- a/intermediate/cmip6-cloud.ipynb +++ b/intermediate/cmip6-cloud.ipynb @@ -155,7 +155,7 @@ "source": [ "In many cases, you'll need to first connect to the cloud provider.\n", "The CMIP6 dataset allows anonymous access, but for some cases,\n", - "you may need to authentication." + "you may need to authenticate." ] }, {