đŸŸ Xarray is now 10 years old! 🎉

xarray.apply_ufunc

Contents

xarray.apply_ufunc#

xarray.apply_ufunc(func, *args, input_core_dims=None, output_core_dims=((), ), exclude_dims=frozenset({}), vectorize=False, join='exact', dataset_join='exact', dataset_fill_value=<no-fill-value>, keep_attrs=None, kwargs=None, dask='forbidden', output_dtypes=None, output_sizes=None, meta=None, dask_gufunc_kwargs=None, on_missing_core_dim='raise')[source]#

Apply a vectorized function for unlabeled arrays on xarray objects.

The function will be mapped over the data variable(s) of the input arguments using xarray’s standard rules for labeled computation, including alignment, broadcasting, looping over GroupBy/Dataset variables, and merging of coordinates.

Parameters:
  • func (callable()) – Function to call like func(*args, **kwargs) on unlabeled arrays (.data) that returns an array or tuple of arrays. If multiple arguments with non-matching dimensions are supplied, this function is expected to vectorize (broadcast) over axes of positional arguments in the style of NumPy universal functions [1] (if this is not the case, set vectorize=True). If this function returns multiple outputs, you must set output_core_dims as well.

  • *args (Dataset, DataArray, DataArrayGroupBy, DatasetGroupBy, Variable, numpy.ndarray, dask.array.Array or scalar) – Mix of labeled and/or unlabeled arrays to which to apply the function.

  • input_core_dims (sequence of sequence, optional) – List of the same length as args giving the list of core dimensions on each input argument that should not be broadcast. By default, we assume there are no core dimensions on any input arguments.

    For example, input_core_dims=[[], ['time']] indicates that all dimensions on the first argument and all dimensions other than ‘time’ on the second argument should be broadcast.

    Core dimensions are automatically moved to the last axes of input variables before applying func, which facilitates using NumPy style generalized ufuncs [2].

  • output_core_dims (list of tuple, optional) – List of the same length as the number of output arguments from func, giving the list of core dimensions on each output that were not broadcast on the inputs. By default, we assume that func outputs exactly one array, with axes corresponding to each broadcast dimension.

    Core dimensions are assumed to appear as the last dimensions of each output in the provided order.

  • exclude_dims (set, optional) – Core dimensions on the inputs to exclude from alignment and broadcasting entirely. Any input coordinates along these dimensions will be dropped. Each excluded dimension must also appear in input_core_dims for at least one argument. Only dimensions listed here are allowed to change size between input and output objects.

  • vectorize (bool, optional) – If True, then assume func only takes arrays defined over core dimensions as input and vectorize it automatically with numpy.vectorize(). This option exists for convenience, but is almost always slower than supplying a pre-vectorized function.

  • join ({"outer", "inner", "left", "right", "exact"}, default: "exact") – Method for joining the indexes of the passed objects along each dimension, and the variables of Dataset objects with mismatched data variables:

    • ‘outer’: use the union of object indexes

    • ‘inner’: use the intersection of object indexes

    • ‘left’: use indexes from the first object with each dimension

    • ‘right’: use indexes from the last object with each dimension

    • ‘exact’: raise ValueError instead of aligning when indexes to be aligned are not equal

  • dataset_join ({"outer", "inner", "left", "right", "exact"}, default: "exact") – Method for joining variables of Dataset objects with mismatched data variables.

    • ‘outer’: take variables from both Dataset objects

    • ‘inner’: take only overlapped variables

    • ‘left’: take only variables from the first object

    • ‘right’: take only variables from the last object

    • ‘exact’: data variables on all Dataset objects must match exactly

  • dataset_fill_value (optional) – Value used in place of missing variables on Dataset inputs when the datasets do not share the exact same data_vars. Required if dataset_join not in {'inner', 'exact'}, otherwise ignored.

  • keep_attrs ({"drop", "identical", "no_conflicts", "drop_conflicts", "override"} or bool, optional) –

    • ‘drop’ or False: empty attrs on returned xarray object.

    • ‘identical’: all attrs must be the same on every object.

    • ‘no_conflicts’: attrs from all objects are combined, any that have the same name must also have the same value.

    • ‘drop_conflicts’: attrs from all objects are combined, any that have the same name but different values are dropped.

    • ‘override’ or True: skip comparing and copy attrs from the first object to the result.

  • kwargs (dict, optional) – Optional keyword arguments passed directly on to call func.

  • dask ({"forbidden", "allowed", "parallelized"}, default: "forbidden") – How to handle applying to objects containing lazy data in the form of dask arrays:

    • ‘forbidden’ (default): raise an error if a dask array is encountered.

    • ‘allowed’: pass dask arrays directly on to func. Prefer this option if func natively supports dask arrays.

    • ‘parallelized’: automatically parallelize func if any of the inputs are a dask array by using dask.array.apply_gufunc(). Multiple output arguments are supported. Only use this option if func does not natively support dask arrays (e.g. converts them to numpy arrays).

  • dask_gufunc_kwargs (dict, optional) – Optional keyword arguments passed to dask.array.apply_gufunc() if dask=’parallelized’. Possible keywords are output_sizes, allow_rechunk and meta.

  • output_dtypes (list of dtype, optional) – Optional list of output dtypes. Only used if dask='parallelized' or vectorize=True.

  • output_sizes (dict, optional) – Optional mapping from dimension names to sizes for outputs. Only used if dask=’parallelized’ and new dimensions (not found on inputs) appear on outputs. output_sizes should be given in the dask_gufunc_kwargs parameter. It will be removed as direct parameter in a future version.

  • meta (optional) – Size-0 object representing the type of array wrapped by dask array. Passed on to dask.array.apply_gufunc(). meta should be given in the dask_gufunc_kwargs parameter . It will be removed as direct parameter a future version.

  • on_missing_core_dim ({"raise", "copy", "drop"}, default: "raise") – How to handle missing core dimensions on input variables.

Returns:

Notes

This function is designed for the more common case where func can work on numpy arrays. If func needs to manipulate a whole xarray object subset to each block it is possible to use xarray.map_blocks().

Note that due to the overhead xarray.map_blocks() is considerably slower than apply_ufunc.

Examples

Calculate the vector magnitude of two arguments:

>>> def magnitude(a, b):
...     func = lambda x, y: np.sqrt(x**2 + y**2)
...     return xr.apply_ufunc(func, a, b)
...

You can now apply magnitude() to DataArray and Dataset objects, with automatically preserved dimensions and coordinates, e.g.,

>>> array = xr.DataArray([1, 2, 3], coords=[("x", [0.1, 0.2, 0.3])])
>>> magnitude(array, -array)
<xarray.DataArray (x: 3)> Size: 24B
array([1.41421356, 2.82842712, 4.24264069])
Coordinates:
  * x        (x) float64 24B 0.1 0.2 0.3

Plain scalars, numpy arrays and a mix of these with xarray objects is also supported:

>>> magnitude(3, 4)
5.0
>>> magnitude(3, np.array([0, 4]))
array([3., 5.])
>>> magnitude(array, 0)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 2., 3.])
Coordinates:
  * x        (x) float64 24B 0.1 0.2 0.3

Other examples of how you could use apply_ufunc to write functions to (very nearly) replicate existing xarray functionality:

Compute the mean (.mean) over one dimension:

>>> def mean(obj, dim):
...     # note: apply always moves core dimensions to the end
...     return apply_ufunc(
...         np.mean, obj, input_core_dims=[[dim]], kwargs={"axis": -1}
...     )
...

Inner product over a specific dimension (like dot()):

>>> def _inner(x, y):
...     result = np.matmul(x[..., np.newaxis, :], y[..., :, np.newaxis])
...     return result[..., 0, 0]
...
>>> def inner_product(a, b, dim):
...     return apply_ufunc(_inner, a, b, input_core_dims=[[dim], [dim]])
...

Stack objects along a new dimension (like concat()):

>>> def stack(objects, dim, new_coord):
...     # note: this version does not stack coordinates
...     func = lambda *x: np.stack(x, axis=-1)
...     result = apply_ufunc(
...         func,
...         *objects,
...         output_core_dims=[[dim]],
...         join="outer",
...         dataset_fill_value=np.nan
...     )
...     result[dim] = new_coord
...     return result
...

If your function is not vectorized but can be applied only to core dimensions, you can use vectorize=True to turn into a vectorized function. This wraps numpy.vectorize(), so the operation isn’t terribly fast. Here we’ll use it to calculate the distance between empirical samples from two probability distributions, using a scipy function that needs to be applied to vectors:

>>> import scipy.stats
>>> def earth_mover_distance(first_samples, second_samples, dim="ensemble"):
...     return apply_ufunc(
...         scipy.stats.wasserstein_distance,
...         first_samples,
...         second_samples,
...         input_core_dims=[[dim], [dim]],
...         vectorize=True,
...     )
...

Most of NumPy’s builtin functions already broadcast their inputs appropriately for use in apply_ufunc. You may find helper functions such as numpy.broadcast_arrays() helpful in writing your function. apply_ufunc also works well with numba.vectorize() and numba.guvectorize().

See also

numpy.broadcast_arrays numba.vectorize numba.guvectorize dask.array.apply_gufunc xarray.map_blocks

Automatic parallelization with apply_ufunc and map_blocks

User guide describing apply_ufunc() and map_blocks().

apply_ufunc

Advanced Tutorial on applying numpy function using apply_ufunc()

References