Computation

The labels associated with DataArray and Dataset objects enables some powerful shortcuts for computation, noteably including aggregation and broadcasting by dimension names.

Basic array math

Arithemtic operations with a single DataArray automatically vectorize (like numpy) over all array values:

In [1]: arr = xray.DataArray(np.random.randn(2, 3),
   ...:                      [('x', ['a', 'b']), ('y', [10, 20, 30])])
   ...: 

In [2]: arr - 3
Out[2]: 
<xray.DataArray (x: 2, y: 3)>
array([[-2.5308877 , -3.28286334, -4.5090585 ],
       [-4.13563237, -1.78788797, -3.17321465]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

In [3]: abs(arr)
Out[3]: 
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 ,  0.28286334,  1.5090585 ],
       [ 1.13563237,  1.21211203,  0.17321465]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:

In [4]: np.sin(arr)
Out[4]: 
<xray.DataArray (x: 2, y: 3)>
array([[ 0.45209466, -0.27910634, -0.99809483],
       [-0.90680094,  0.9363595 , -0.17234978]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

Data arrays also implement many numpy.ndarray methods:

In [5]: arr.round(2)
Out[5]: 
<xray.DataArray (x: 2, y: 3)>
array([[ 0.47, -0.28, -1.51],
       [-1.14,  1.21, -0.17]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

In [6]: arr.T
Out[6]: 
<xray.DataArray (y: 3, x: 2)>
array([[ 0.4691123 , -1.13563237],
       [-0.28286334,  1.21211203],
       [-1.5090585 , -0.17321465]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

It also has the isnull and notnull methods from pandas:

In [7]: xray.DataArray([0, 1, np.nan, np.nan, 2]).isnull()
Out[7]: 
<xray.DataArray (dim_0: 5)>
array([False, False,  True,  True, False], dtype=bool)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4

Aggregation

Aggregation methods from ndarray have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):

In [8]: arr.sum(dim='x')
Out[8]: 
<xray.DataArray (y: 3)>
array([-0.66652007,  0.92924868, -1.68227315])
Coordinates:
  * y        (y) int64 10 20 30

In [9]: arr.std(['x', 'y'])
Out[9]: 
<xray.DataArray ()>
array(0.9156385956757354)

In [10]: arr.min()
Out[10]: 
<xray.DataArray ()>
array(-1.5090585031735124)

If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:

In [11]: arr.get_axis_num('y')
Out[11]: 1

To perform a NA skipping aggregations, pass the NA aware numpy functions directly to reduce method:

In [12]: arr.reduce(np.nanmean, dim='y')
Out[12]: 
<xray.DataArray (x: 2)>
array([-0.44093652, -0.032245  ])
Coordinates:
  * x        (x) |S1 'a' 'b'

Warning

Currently, xray uses the standard ndarray methods which do not automatically skip missing values, but we expect to switch the default to NA skipping versions (like pandas) in a future version (GH130).

Broadcasting by dimension name

DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.

This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:

In [13]: a = xray.DataArray([1, 2], [('x', ['a', 'b'])])

In [14]: a
Out[14]: 
<xray.DataArray (x: 2)>
array([1, 2])
Coordinates:
  * x        (x) |S1 'a' 'b'

In [15]: b = xray.DataArray([-1, -2, -3], [('y', [10, 20, 30])])

In [16]: b
Out[16]: 
<xray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
  * y        (y) int64 10 20 30

With xray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:

In [17]: a * b
Out[17]: 
<xray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
       [-2, -4, -6]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

Moreover, dimensions are always reordered to the order in which they first appeared:

In [18]: c = xray.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])

In [19]: c
Out[19]: 
<xray.DataArray (y: 3, x: 2)>
array([[0, 1],
       [2, 3],
       [4, 5]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

In [20]: a + c
Out[20]: 
<xray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
       [3, 5, 7]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

This means, for example, that you always subtract an array from its transpose:

In [21]: c - c.T
Out[21]: 
<xray.DataArray (y: 3, x: 2)>
array([[0, 0],
       [0, 0],
       [0, 0]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

Alignment and coordinates

For now, performing most binary operations on xray objects requires that the all index Coordinates (that is, coordinates with the same name as a dimension) have the same values:

In [22]: arr + arr[:1]
ValueError: coordinate 'x' is not aligned

However, xray does have shortcuts (copied from pandas) that make aligning DataArray and Dataset objects easy and fast.

In [23]: a, b = xray.align(arr, arr[:1])

In [24]: a + b
Out[24]: 
<xray.DataArray (x: 1, y: 3)>
array([[ 0.9382246 , -0.56572669, -3.01811701]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) object 'a'

See Align and reindex for more details.

Warning

pandas does index based alignment automatically when doing math, using join='outer'. xray doesn’t have automatic alignment yet, but we do intend to enable it in a future version (GH186). Unlike pandas, we expect to default to join='inner'.

Although index coordinates are required to match exactly, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalars:

In [25]: arr[0]
Out[25]: 
<xray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
  * y        (y) int64 10 20 30
    x        |S1 'a'

In [26]: arr[1]
Out[26]: 
<xray.DataArray (y: 3)>
array([-1.13563237,  1.21211203, -0.17321465])
Coordinates:
  * y        (y) int64 10 20 30
    x        |S1 'b'

# notice that the scalar coordinate 'x' is silently dropped
In [27]: arr[1] - arr[0]
Out[27]: 
<xray.DataArray (y: 3)>
array([-1.60474467,  1.49497537,  1.33584385])
Coordinates:
  * y        (y) int64 10 20 30

Still, xray will persist other coordinates in arithmetic, as long as there are no conflicting values:

# only one argument has the 'x' coordinate
In [28]: arr[0] + 1
Out[28]: 
<xray.DataArray (y: 3)>
array([ 1.4691123 ,  0.71713666, -0.5090585 ])
Coordinates:
  * y        (y) int64 10 20 30
    x        |S1 'a'

# both arguments have the same 'x' coordinate
In [29]: arr[0] - arr[0]
Out[29]: 
<xray.DataArray (y: 3)>
array([ 0.,  0.,  0.])
Coordinates:
  * y        (y) int64 10 20 30
    x        |S1 'a'

Math with Datasets

Datasets support arithmetic operations by automatically looping over all variables as well as dimensions:

In [30]: ds = xray.Dataset({'x_and_y': (('x', 'y'), np.random.randn(2, 3)),
   ....:                    'x_only': ('x', np.random.randn(2))},
   ....:                    coords=arr.coords)
   ....: 

In [31]: ds > 0
Out[31]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'
Variables:
    x_only   (x) bool True False
    x_and_y  (x, y) bool True False False False False True

In [32]: ds.mean(dim='x')
Out[32]: 
<xray.Dataset>
Dimensions:  (y: 3)
Coordinates:
  * y        (y) int64 10 20 30
Variables:
    x_only   float64 0.007392
    x_and_y  (y) float64 -0.9927 -0.7696 0.105

Datasets have most of the same ndarray methods found on data arrays. Again, these operations loop over all dataset variables:

In [33]: abs(ds)
Out[33]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'
Variables:
    x_only   (x) float64 0.7216 0.7068
    x_and_y  (x, y) float64 0.1192 1.044 0.8618 2.105 0.4949 1.072

transpose() can also be used to reorder dimensions on all variables:

In [34]: ds.transpose('y', 'x')
Out[34]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30
Variables:
    x_only   (x) float64 0.7216 -0.7068
    x_and_y  (y, x) float64 0.1192 -2.105 -1.044 -0.4949 -0.8618 1.072

Unfortunately, a limitation of the current version of numpy means that we cannot override ufuncs for datasets, because datasets cannot be written as a single array [1]. apply() works around this limitation, by applying the given function to each variable in the dataset:

In [35]: ds.apply(np.sin)
Out[35]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30
Variables:
    x_only   (x) float64 0.6606 -0.6494
    x_and_y  (x, y) float64 0.1189 -0.8645 -0.759 -0.8609 -0.475 0.8781

Datasets also use looping over variables for broadcasting in binary arithemtic. You can do arithemtic between any DataArray and a dataset as long as they have aligned indexes:

In [36]: ds + arr
Out[36]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'
Variables:
    x_only   (x, y) float64 1.191 0.4387 -0.7875 -1.842 0.5053 -0.88
    x_and_y  (x, y) float64 0.5883 -1.327 -2.371 -3.24 0.7172 0.8986

Arithemtic between two datasets requires that the datasets also have the same variables:

In [37]: ds2 = xray.Dataset({'x_and_y': 0, 'x_only': 100})

In [38]: ds - ds2
Out[38]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'
Variables:
    x_only   (x) float64 -99.28 -100.7
    x_and_y  (x, y) float64 0.1192 -1.044 -0.8618 -2.105 -0.4949 1.072

There is no shortcut similar to align for aligning variable names, but you may find rename() and drop_vars() useful.

Note

When we enable automatic alignment over indexes, we will probably enable automatic alignment between dataset variables as well.

[1]When numpy 1.10 is released, we should be able to override ufuncs for datasets by making use of __numpy_ufunc__.