Combining data¶
- For combining datasets or data arrays along a dimension, see concatenate.
- For combining datasets with different variables, see merge.
Concatenate¶
To combine arrays along existing or new dimension into a larger array, you
can use concat()
. concat
takes an iterable of DataArray
or Dataset
objects, as well as a dimension name, and concatenates along
that dimension:
In [1]: arr = xr.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr[:, :1]
Out[2]:
<xarray.DataArray (x: 2, y: 1)>
array([[ 0.4691123 ],
[-1.13563237]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10
# this resembles how you would use np.concatenate
In [3]: xr.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In addition to combining along an existing dimension, concat
can create a
new dimension by stacking lower dimensional arrays together:
In [4]: arr[0]
Out[4]:
<xarray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xr.concat([arr[0], arr[1]], 'x')
Out[5]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
If the second argument to concat
is a new dimension name, the arrays will
be concatenated along that new dimension, which is always inserted as the first
dimension:
In [6]: xr.concat([arr[0], arr[1]], 'new_dim')
Out[6]:
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 0 1
The second argument to concat
can also be an Index
or
DataArray
object as well as a string, in which case it is
used to label the values along the new dimension:
In [7]: xr.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[7]:
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 -90 -100
Of course, concat
also works on Dataset
objects:
In [8]: ds = arr.to_dataset(name='foo')
In [9]: xr.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[9]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
concat()
has a number of options which provide deeper control
over which variables are concatenated and how it handles conflicting variables
between datasets. With the default parameters, xarray will load some coordinate
variables into memory to compare them between datasets. This may be prohibitively
expensive if you are manipulating your dataset lazily using Out of core computation with dask.
Merge¶
To combine variables and coordinates between multiple DataArray
and/or
Dataset
object, use merge()
. It can merge a list of
Dataset
, DataArray
or dictionaries of objects convertible to
DataArray
objects:
In [10]: xr.merge([ds, ds.rename({'foo': 'bar'})])
Out[10]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [11]: xr.merge([xr.DataArray(n, name='var%d' % n) for n in range(5)])
Out[11]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
var0 int64 0
var1 int64 1
var2 int64 2
var3 int64 3
var4 int64 4
If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the union of all index coordinates:
In [12]: other = xr.Dataset({'bar': ('x', [1, 2, 3, 4]), 'x': list('abcd')})
In [13]: xr.merge([ds, other])
Out[13]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) object 'a' 'b' 'c' 'd'
* y (y) int64 10 20 30
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 nan ...
bar (x) int64 1 2 3 4
This ensures that merge
is non-destructive. xarray.MergeError
is raised
if you attempt to merge two variables with the same name but different values:
In [14]: xr.merge([ds, ds + 1])
MergeError: conflicting values for variable 'foo' on objects to be combined:
first value: <xarray.Variable (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
second value: <xarray.Variable (x: 2, y: 3)>
array([[ 1.4691123 , 0.71713666, -0.5090585 ],
[-0.13563237, 2.21211203, 0.82678535]])
The same non-destructive merging between DataArray
index coordinates is
used in the Dataset
constructor:
In [15]: xr.Dataset({'a': arr[:-1], 'b': arr[1:]})
Out[15]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
Data variables:
a (x, y) float64 0.4691 -0.2829 -1.509 nan nan nan
b (x, y) float64 nan nan nan -1.136 1.212 -0.1732
Update¶
In contrast to merge
, update
modifies a dataset in-place without
checking for conflicts, and will overwrite any existing variables with new
values:
In [16]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[16]:
<xarray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.
update
also performs automatic alignment if necessary. Unlike merge
, it
maintains the alignment of the original array instead of merging indexes:
In [17]: ds.update(other)
Out[17]:
<xarray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x) int64 1 2
The exact same alignment logic when setting a variable with __setitem__
syntax:
In [18]: ds['baz'] = xr.DataArray([9, 9, 9, 9, 9], coords=[('x', list('abcde'))])
In [19]: ds.baz
Out[19]:
<xarray.DataArray 'baz' (x: 2)>
array([9, 9])
Coordinates:
* x (x) object 'a' 'b'
Equals and identical¶
xarray objects can be compared by using the equals()
,
identical()
and
broadcast_equals()
methods. These methods are used by
the optional compat
argument on concat
and merge
.
equals
checks dimension names, indexes and array
values:
In [20]: arr.equals(arr.copy())
Out[20]: True
identical
also checks attributes, and the name of each
object:
In [21]: arr.identical(arr.rename('bar'))
Out[21]: False
broadcast_equals
does a more relaxed form of equality
check that allows variables to have different dimensions, as long as values
are constant along those new dimensions:
In [22]: left = xr.Dataset(coords={'x': 0})
In [23]: right = xr.Dataset({'x': [0, 0, 0]})
In [24]: left.broadcast_equals(right)
Out[24]: True
Like pandas objects, two xarray objects are still equal or identical if they have
missing values marked by NaN
in the same locations.
In contrast, the ==
operation performs element-wise comparison (like
numpy):
In [25]: arr == arr.copy()
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[ True, True, True],
[ True, True, True]], dtype=bool)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Note that NaN
does not compare equal to NaN
in element-wise comparison;
you may need to deal with missing values explicitly.