🍾 Xarray is now 10 years old! 🎉

GroupBy: Group and Bin Data#

Often we want to bin or group data, produce statistics (mean, variance) on the groups, and then return a reduced data set. To do this, Xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

  • Split your data into multiple independent groups.

  • Apply some function to each group.

  • Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable has recently been implemented. Note that for one-dimensional data, it is usually faster to rely on pandas’ implementation of the same pipeline.

Tip

To substantially improve the performance of GroupBy operations, particularly with dask install the flox package. flox extends Xarray’s in-built GroupBy capabilities by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default.

Split#

Let’s create a simple example dataset:

In [1]: ds = xr.Dataset(
   ...:     {"foo": (("x", "y"), np.random.rand(4, 3))},
   ...:     coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))},
   ...: )
   ...: 

In [2]: arr = ds["foo"]

In [3]: ds
Out[3]: 
<xarray.Dataset> Size: 144B
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) int64 32B 10 20 30 40
    letters  (x) <U1 16B 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 96B 0.127 0.9667 0.2605 0.8972 ... 0.543 0.373 0.448

If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a GroupBy object:

In [4]: ds.groupby("letters")
Out[4]: 
DatasetGroupBy, grouped over 'letters'
2 groups with labels 'a', 'b'.

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

In [5]: ds.groupby("letters").groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}

You can also iterate over groups in (label, group) pairs:

In [6]: list(ds.groupby("letters"))
Out[6]: 
[('a',
  <xarray.Dataset> Size: 72B
  Dimensions:  (x: 2, y: 3)
  Coordinates:
    * x        (x) int64 16B 10 40
      letters  (x) <U1 8B 'a' 'a'
  Dimensions without coordinates: y
  Data variables:
      foo      (x, y) float64 48B 0.127 0.9667 0.2605 0.543 0.373 0.448),
 ('b',
  <xarray.Dataset> Size: 72B
  Dimensions:  (x: 2, y: 3)
  Coordinates:
    * x        (x) int64 16B 20 30
      letters  (x) <U1 8B 'b' 'b'
  Dimensions without coordinates: y
  Data variables:
      foo      (x, y) float64 48B 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]

You can index out a particular group:

In [7]: ds.groupby("letters")["b"]
Out[7]: 
<xarray.Dataset> Size: 72B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 16B 20 30
    letters  (x) <U1 8B 'b' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 48B 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231

Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.

Binning#

Sometimes you don’t want to use all the unique values to determine the groups but instead want to “bin” the data into coarser groups. You could always create a customized coordinate, but xarray facilitates this via the groupby_bins() method.

In [8]: x_bins = [0, 25, 50]

In [9]: ds.groupby_bins("x", x_bins).groups
Out[9]: 
{Interval(0, 25, closed='right'): [0, 1],
 Interval(25, 50, closed='right'): [2, 3]}

The binning is implemented via pandas.cut(), whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose float labels which identify the bin centers:

In [10]: x_bin_labels = [12.5, 37.5]

In [11]: ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups
Out[11]: {12.5: [0, 1], 37.5: [2, 3]}

Apply#

To apply a function to each group, you can use the flexible map() method. The resulting objects are automatically concatenated back together along the group axis:

In [12]: def standardize(x):
   ....:     return (x - x.mean()) / x.std()
   ....: 

In [13]: arr.groupby("letters").map(standardize)
Out[13]: 
<xarray.DataArray 'foo' (x: 4, y: 3)> Size: 96B
array([[-1.23 ,  1.937, -0.726],
       [ 1.42 , -0.46 , -0.607],
       [-0.191,  1.214, -1.376],
       [ 0.339, -0.302, -0.019]])
Coordinates:
  * x        (x) int64 32B 10 20 30 40
    letters  (x) <U1 16B 'a' 'b' 'b' 'a'
Dimensions without coordinates: y

GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:

In [14]: arr.groupby("letters").mean(dim="x")
Out[14]: 
<xarray.DataArray 'foo' (letters: 2, y: 3)> Size: 48B
array([[0.335, 0.67 , 0.354],
       [0.674, 0.609, 0.23 ]])
Coordinates:
  * letters  (letters) object 16B 'a' 'b'
Dimensions without coordinates: y

Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:

In [15]: ds.groupby("x").std(...)
Out[15]: 
<xarray.Dataset> Size: 80B
Dimensions:  (x: 4)
Coordinates:
  * x        (x) int64 32B 10 20 30 40
    letters  (x) <U1 16B 'a' 'b' 'b' 'a'
Data variables:
    foo      (x) float64 32B 0.3684 0.2554 0.2931 0.06957

Note

We use an ellipsis () here to indicate we want to reduce over all other dimensions

First and last#

There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:

In [16]: ds.groupby("letters").first(...)
Out[16]: 
<xarray.Dataset> Size: 64B
Dimensions:  (letters: 2, y: 3)
Coordinates:
  * letters  (letters) object 16B 'a' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (letters, y) float64 48B 0.127 0.9667 0.2605 0.8972 0.3767 0.3362

By default, they skip missing values (control this with skipna).

Grouped arithmetic#

GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:

In [17]: alt = arr.groupby("letters").mean(...)

In [18]: alt
Out[18]: 
<xarray.DataArray 'foo' (letters: 2)> Size: 16B
array([0.453, 0.504])
Coordinates:
  * letters  (letters) object 16B 'a' 'b'

In [19]: ds.groupby("letters") - alt
Out[19]: 
<xarray.Dataset> Size: 144B
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) int64 32B 10 20 30 40
    letters  (x) <U1 16B 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 96B -0.3261 0.5137 -0.1926 ... -0.08002 -0.005036

This last line is roughly equivalent to the following:

results = []
for label, group in ds.groupby('letters'):
    results.append(group - alt.sel(letters=label))
xr.concat(results, dim='x')

Iterating and Squeezing#

Previously, Xarray defaulted to squeezing out dimensions of size one when iterating over a GroupBy object. This behaviour is being removed. You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.

In [20]: next(iter(arr.groupby("x", squeeze=False)))
Out[20]: 
(10,
 <xarray.DataArray 'foo' (x: 1, y: 3)> Size: 24B
 array([[0.127, 0.967, 0.26 ]])
 Coordinates:
   * x        (x) int64 8B 10
     letters  (x) <U1 4B 'a'
 Dimensions without coordinates: y)

Multidimensional Grouping#

Many datasets have a multidimensional coordinate variable (e.g. longitude) which is different from the logical grid dimensions (e.g. nx, ny). Such variables are valid under the CF conventions. Xarray supports groupby operations over multidimensional coordinate variables:

In [21]: da = xr.DataArray(
   ....:     [[0, 1], [2, 3]],
   ....:     coords={
   ....:         "lon": (["ny", "nx"], [[30, 40], [40, 50]]),
   ....:         "lat": (["ny", "nx"], [[10, 10], [20, 20]]),
   ....:     },
   ....:     dims=["ny", "nx"],
   ....: )
   ....: 

In [22]: da
Out[22]: 
<xarray.DataArray (ny: 2, nx: 2)> Size: 32B
array([[0, 1],
       [2, 3]])
Coordinates:
    lon      (ny, nx) int64 32B 30 40 40 50
    lat      (ny, nx) int64 32B 10 10 20 20
Dimensions without coordinates: ny, nx

In [23]: da.groupby("lon").sum(...)
Out[23]: 
<xarray.DataArray (lon: 3)> Size: 24B
array([0, 3, 3])
Coordinates:
  * lon      (lon) int64 24B 30 40 50

In [24]: da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
Out[24]: 
<xarray.DataArray (ny: 2, nx: 2)> Size: 32B
array([[ 0. , -0.5],
       [ 0.5,  0. ]])
Coordinates:
    lon      (ny, nx) int64 32B 30 40 40 50
    lat      (ny, nx) int64 32B 10 10 20 20
Dimensions without coordinates: ny, nx

Because multidimensional groups have the ability to generate a very large number of bins, coarse-binning via groupby_bins() may be desirable:

In [25]: da.groupby_bins("lon", [0, 45, 50]).sum()
Out[25]: 
<xarray.DataArray (lon_bins: 2)> Size: 16B
array([3, 3])
Coordinates:
  * lon_bins  (lon_bins) object 16B (0, 45] (45, 50]

These methods group by lon values. It is also possible to groupby each cell in a grid, regardless of value, by stacking multiple dimensions, applying your function, and then unstacking the result:

In [26]: stacked = da.stack(gridcell=["ny", "nx"])

In [27]: stacked.groupby("gridcell").sum(...).unstack("gridcell")
Out[27]: 
<xarray.DataArray (ny: 2, nx: 2)> Size: 32B
array([[0, 1],
       [2, 3]])
Coordinates:
  * ny       (ny) int64 16B 0 1
  * nx       (nx) int64 16B 0 1
    lon      (ny, nx) int64 32B 30 40 40 50
    lat      (ny, nx) int64 32B 10 10 20 20