You can run this notebook in a live session or view it on Github.
Toy weather data#
Here is an example of how to easily manipulate a toy weather dataset using xarray and other recommended Python libraries:
[1]:
import numpy as np
import pandas as pd
import seaborn as sns
import xarray as xr
np.random.seed(123)
xr.set_options(display_style="html")
times = pd.date_range("2000-01-01", "2001-12-31", name="time")
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)
ds = xr.Dataset(
{
"tmin": (("time", "location"), tmin_values),
"tmax": (("time", "location"), tmax_values),
},
{"time": times, "location": ["IA", "IN", "IL"]},
)
ds
/tmp/ipykernel_3819/3043960761.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
[1]:
<xarray.Dataset> Size: 41kB Dimensions: (time: 731, location: 3) Coordinates: * time (time) datetime64[ns] 6kB 2000-01-01 2000-01-02 ... 2001-12-31 * location (location) <U2 24B 'IA' 'IN' 'IL' Data variables: tmin (time, location) float64 18kB -8.037 -1.788 ... -1.346 -4.544 tmax (time, location) float64 18kB 12.98 3.31 6.779 ... 3.343 3.805
Examine a dataset with pandas and seaborn#
Convert to a pandas DataFrame#
[2]:
df = ds.to_dataframe()
df.head()
[2]:
tmin | tmax | ||
---|---|---|---|
time | location | ||
2000-01-01 | IA | -8.037369 | 12.980549 |
IN | -1.788441 | 3.310409 | |
IL | -3.931542 | 6.778554 | |
2000-01-02 | IA | -9.341157 | 0.447856 |
IN | -6.558073 | 6.372712 |
[3]:
df.describe()
[3]:
tmin | tmax | |
---|---|---|
count | 2193.000000 | 2193.000000 |
mean | 9.975426 | 20.108232 |
std | 10.963228 | 11.010569 |
min | -13.395763 | -3.506234 |
25% | -0.040347 | 9.853905 |
50% | 10.060403 | 19.967409 |
75% | 20.083590 | 30.045588 |
max | 33.456060 | 43.271148 |
Visualize using pandas#
[4]:
ds.mean(dim="location").to_dataframe().plot()
[4]:
<Axes: xlabel='time'>
Visualize using seaborn#
[5]:
sns.pairplot(df.reset_index(), vars=ds.data_vars)
[5]:
<seaborn.axisgrid.PairGrid at 0x7fa8096b45e0>
Probability of freeze by calendar month#
[6]:
freeze = (ds["tmin"] <= 0).groupby("time.month").mean("time")
freeze
[6]:
<xarray.DataArray 'tmin' (month: 12, location: 3)> Size: 288B array([[0.9516129 , 0.88709677, 0.93548387], [0.84210526, 0.71929825, 0.77192982], [0.24193548, 0.12903226, 0.16129032], [0. , 0. , 0. ], [0. , 0. , 0. ], [0. , 0. , 0. ], [0. , 0. , 0. ], [0. , 0. , 0. ], [0. , 0. , 0. ], [0. , 0.01612903, 0. ], [0.33333333, 0.35 , 0.23333333], [0.93548387, 0.85483871, 0.82258065]]) Coordinates: * location (location) <U2 24B 'IA' 'IN' 'IL' * month (month) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12
[7]:
freeze.to_pandas().plot()
[7]:
<Axes: xlabel='month'>
Monthly averaging#
[8]:
monthly_avg = ds.resample(time="1MS").mean()
monthly_avg.sel(location="IA").to_dataframe().plot(style="s-")
[8]:
<Axes: xlabel='time'>
Note that MS
here refers to Month-Start; M
labels Month-End (the last day of the month).
Calculate monthly anomalies#
In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.
[9]:
climatology = ds.groupby("time.month").mean("time")
anomalies = ds.groupby("time.month") - climatology
anomalies.mean("location").to_dataframe()[["tmin", "tmax"]].plot()
[9]:
<Axes: xlabel='time'>
Calculate standardized monthly anomalies#
You can create standardized anomalies where the difference between the observations and the climatological monthly mean is divided by the climatological standard deviation.
[10]:
climatology_mean = ds.groupby("time.month").mean("time")
climatology_std = ds.groupby("time.month").std("time")
stand_anomalies = xr.apply_ufunc(
lambda x, m, s: (x - m) / s,
ds.groupby("time.month"),
climatology_mean,
climatology_std,
)
stand_anomalies.mean("location").to_dataframe()[["tmin", "tmax"]].plot()
[10]:
<Axes: xlabel='time'>
Fill missing values with climatology#
The fillna
method on grouped objects lets you easily fill missing values by group:
[11]:
# throw away the first half of every month
some_missing = ds.tmin.sel(time=ds["time.day"] > 15).reindex_like(ds)
filled = some_missing.groupby("time.month").fillna(climatology.tmin)
both = xr.Dataset({"some_missing": some_missing, "filled": filled})
both
[11]:
<xarray.Dataset> Size: 47kB Dimensions: (time: 731, location: 3) Coordinates: * time (time) datetime64[ns] 6kB 2000-01-01 2000-01-02 ... 2001-12-31 * location (location) <U2 24B 'IA' 'IN' 'IL' month (time) int64 6kB 1 1 1 1 1 1 1 1 1 ... 12 12 12 12 12 12 12 12 Data variables: some_missing (time, location) float64 18kB nan nan nan ... -1.346 -4.544 filled (time, location) float64 18kB -5.163 -4.216 ... -1.346 -4.544
[12]:
df = both.sel(time="2000").mean("location").reset_coords(drop=True).to_dataframe()
df.head()
[12]:
some_missing | filled | |
---|---|---|
time | ||
2000-01-01 | NaN | -4.686763 |
2000-01-02 | NaN | -4.686763 |
2000-01-03 | NaN | -4.686763 |
2000-01-04 | NaN | -4.686763 |
2000-01-05 | NaN | -4.686763 |
[13]:
df[["filled", "some_missing"]].plot()
[13]:
<Axes: xlabel='time'>