xarray: N-D labeled arrays and datasets in Python

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.

Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

Documentation

Getting Started

Overview: Why xarray?

Features

Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:

  • Apply operations over dimensions by name: x.sum('time').
  • Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
  • Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
  • Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
  • Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

pandas provides many of these features, but it does not make use of dimension names, and its core data structures are fixed dimensional arrays.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions (e.g., np.newaxis) to align arrays.

Core data structures

xarray has two core data structures. Both are fundamentally N-dimensional:

  • DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
  • Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

The value of attaching labels to numpy’s numpy.ndarray may be fairly obvious, but the dataset may need more motivation.

The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.

This data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.

xarray distinguishes itself from many tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.

Goals and aspirations

pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in.

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).

Importantly, xarray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.

Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.

Frequently Asked Questions

Why is pandas not enough?

pandas is a fantastic library for analysis of low-dimensional labelled data - if it can be sensibly described as “rows and columns”, pandas is probably the right choice. However, sometimes we want to use higher dimensional arrays (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.

Pandas has historically supported N-dimensional panels, but deprecated them in version 0.20 in favor of Xarray data structures. There are now built-in methods on both sides to convert between pandas and Xarray, allowing for more focussed development effort. Xarray objects have a much richer model of dimensionality - if you were using Panels:

  • You need to create a new factory type for each dimensionality.
  • You can’t do math between NDPanels with different dimensionality.
  • Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.
  • You often have to manually convert collections of pandas arrays (Series, DataFrames, etc) to have the same number of dimensions. In contrast, this sort of data structure fits very naturally in an xarray Dataset.

You can read about switching from Panels to Xarray here. Pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.

How do xarray data structures differ from those found in pandas?

The main distinguishing feature of xarray’s DataArray over labeled arrays in pandas is that dimensions can have names (e.g., “time”, “latitude”, “longitude”). Names are much easier to keep track of than axis numbers, and xarray uses dimension names for indexing, aggregation and broadcasting. Not only can you write x.sel(time='2000-01-01') and x.mean(dim='time'), but operations like x - x.mean(dim='time') always work, no matter the order of the “time” dimension. You never need to reshape arrays (e.g., with np.newaxis) to align them for arithmetic operations in xarray.

Should I use xarray instead of pandas?

It’s not an either/or choice! xarray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.

That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas.

Why don’t aggregations return Python scalars?

xarray tries hard to be self-consistent: operations on a DataArray (resp. Dataset) return another DataArray (resp. Dataset) object. In particular, operations returning scalar values (e.g. indexing or aggregations like mean or sum applied to all axes) will also return xarray objects.

Unfortunately, this means we sometimes have to explicitly cast our results from xarray when using them in other libraries. As an illustration, the following code fragment

In [1]: arr = xr.DataArray([1, 2, 3])

In [2]: pd.Series({'x': arr[0], 'mean': arr.mean(), 'std': arr.std()})
Out[2]: 
mean                  <xarray.DataArray ()>\narray(2.0)
std     <xarray.DataArray ()>\narray(0.816496580927726)
x                       <xarray.DataArray ()>\narray(1)
dtype: object

does not yield the pandas DataFrame we expected. We need to specify the type conversion ourselves:

In [3]: pd.Series({'x': arr[0], 'mean': arr.mean(), 'std': arr.std()}, dtype=float)
Out[3]: 
mean    2.000000
std     0.816497
x       1.000000
dtype: float64

Alternatively, we could use the item method or the float constructor to convert values one at a time

In [4]: pd.Series({'x': arr[0].item(), 'mean': float(arr.mean())})
Out[4]: 
mean    2.0
x       1.0
dtype: float64

What is your approach to metadata?

We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xarray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

Automatic interpretation of labels is powerful but also reduces flexibility. With xarray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically interpret and enforce units or CF conventions. (An exception is serialization to and from netCDF files.)

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

How should I cite xarray?

If you are using xarray and would like to cite it in academic publication, we would certainly appreciate it. We recommend two citations.

  1. At a minimum, we recommend citing the xarray overview journal article, published in the Journal of Open Research Software.

    • Hoyer, S. & Hamman, J., (2017). xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5(1), p.10. DOI: http://doi.org/10.5334/jors.148

      Here’s an example of a BibTeX entry:

      @article{hoyer2017xarray,
        title     = {xarray: {N-D} labeled arrays and datasets in {Python}},
        author    = {Hoyer, S. and J. Hamman},
        journal   = {Journal of Open Research Software},
        volume    = {5},
        number    = {1},
        year      = {2017},
        publisher = {Ubiquity Press},
        doi       = {10.5334/jors.148},
        url       = {http://doi.org/10.5334/jors.148}
      }
      
  2. You may also want to cite a specific version of the xarray package. We provide a Zenodo citation and DOI for this purpose:

    https://zenodo.org/badge/doi/10.5281/zenodo.598201.svg

    An example BibTeX entry:

    @misc{xarray_v0_8_0,
          author = {Stephan Hoyer and Clark Fitzgerald and Joe Hamman and others},
          title  = {xarray: v0.8.0},
          month  = aug,
          year   = 2016,
          doi    = {10.5281/zenodo.59499},
          url    = {http://dx.doi.org/10.5281/zenodo.59499}
         }
    

Examples

Quick overview

Here are some quick examples of what you can do with xarray.DataArray objects. Everything is explained in much more detail in the rest of the documentation.

To begin, import numpy, pandas and xarray using their customary abbreviations:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xarray as xr
Create a DataArray

You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:

In [4]: xr.DataArray(np.random.randn(2, 3))
Out[4]: 
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 1.643563, -1.469388,  0.357021],
       [-0.6746  , -1.776904, -0.968914]])
Dimensions without coordinates: dim_0, dim_1

In [5]: data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y'))

In [6]: data
Out[6]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

If you supply a pandas Series or DataFrame, metadata is copied directly:

In [7]: xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))
Out[7]: 
<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
  * dim_0    (dim_0) object 'a' 'b' 'c'

Here are the key properties for a DataArray:

# like in pandas, values is a numpy array that you can modify in-place
In [8]: data.values
Out[8]: 
array([[-1.295,  0.414,  0.277],
       [-0.472, -0.014, -0.363]])

In [9]: data.dims
Out[9]: ('x', 'y')

In [10]: data.coords
Out[10]: 
Coordinates:
  * x        (x) <U1 'a' 'b'

# you can use this dictionary to store arbitrary metadata
In [11]: data.attrs
Out[11]: OrderedDict()
Indexing

xarray supports four kind of indexing. These operations are just as fast as in pandas, because we borrow pandas’ indexing machinery.

# positional and by integer label, like numpy
In [12]: data[[0, 1]]
Out[12]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

# positional and by coordinate label, like pandas
In [13]: data.loc['a':'b']
Out[13]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

# by dimension name and integer label
In [14]: data.isel(x=slice(2))
Out[14]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

# by dimension name and coordinate label
In [15]: data.sel(x=['a', 'b'])
Out[15]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y
Computation

Data arrays work very similarly to numpy ndarrays:

In [16]: data + 10
Out[16]: 
<xarray.DataArray (x: 2, y: 3)>
array([[  8.705476,  10.413738,  10.276662],
       [  9.527965,   9.98604 ,   9.637457]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

In [17]: np.sin(data)
Out[17]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-0.962079,  0.402035,  0.273146],
       [-0.454699, -0.013959, -0.354653]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

In [18]: data.T
Out[18]: 
<xarray.DataArray (y: 3, x: 2)>
array([[-1.294524, -0.472035],
       [ 0.413738, -0.01396 ],
       [ 0.276662, -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

In [19]: data.sum()
Out[19]: 
<xarray.DataArray ()>
array(-1.4526610277231344)

However, aggregation operations can use dimension names instead of axis numbers:

In [20]: data.mean(dim='x')
Out[20]: 
<xarray.DataArray (y: 3)>
array([-0.883279,  0.199889, -0.042941])
Dimensions without coordinates: y

Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:

In [21]: a = xr.DataArray(np.random.randn(3), [data.coords['y']])

In [22]: b = xr.DataArray(np.random.randn(4), dims='z')

In [23]: a
Out[23]: 
<xarray.DataArray (y: 3)>
array([-0.006154, -0.923061,  0.895717])
Coordinates:
  * y        (y) int64 0 1 2

In [24]: b
Out[24]: 
<xarray.DataArray (z: 4)>
array([ 0.805244, -1.206412,  2.565646,  1.431256])
Dimensions without coordinates: z

In [25]: a + b
Out[25]: 
<xarray.DataArray (y: 3, z: 4)>
array([[ 0.79909 , -1.212565,  2.559492,  1.425102],
       [-0.117817, -2.129472,  1.642585,  0.508195],
       [ 1.700961, -0.310694,  3.461363,  2.326973]])
Coordinates:
  * y        (y) int64 0 1 2
Dimensions without coordinates: z

It also means that in most cases you do not need to worry about the order of dimensions:

In [26]: data - data.T
Out[26]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

Operations also align based on index labels:

In [27]: data[:-1] - data[:1]
Out[27]: 
<xarray.DataArray (x: 1, y: 3)>
array([[ 0.,  0.,  0.]])
Coordinates:
  * x        (x) <U1 'a'
Dimensions without coordinates: y
GroupBy

xarray supports grouped operations using a very similar API to pandas:

In [28]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')

In [29]: labels
Out[29]: 
<xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'],
      dtype='<U1')
Coordinates:
  * y        (y) int64 0 1 2

In [30]: data.groupby(labels).mean('y')
Out[30]: 
<xarray.DataArray (x: 2, labels: 2)>
array([[-0.508931,  0.413738],
       [-0.417289, -0.01396 ]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * labels   (labels) object 'E' 'F'

In [31]: data.groupby(labels).apply(lambda x: x - x.min())
Out[31]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.      ,  0.427698,  1.571185],
       [ 0.822489,  0.      ,  0.931981]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 0 1 2
    labels   (y) <U1 'E' 'F' 'E'
pandas

Xarray objects can be easily converted to and from pandas objects:

In [32]: series = data.to_series()

In [33]: series
Out[33]: 
x  y
a  0   -1.294524
   1    0.413738
   2    0.276662
b  0   -0.472035
   1   -0.013960
   2   -0.362543
dtype: float64

# convert back
In [34]: series.to_xarray()
Out[34]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2
Datasets

xarray.Dataset is a dict-like container of aligned DataArray objects. You can think of it as a multi-dimensional generalization of the pandas.DataFrame:

In [35]: ds = xr.Dataset({'foo': data, 'bar': ('x', [1, 2]), 'baz': np.pi})

In [36]: ds
Out[36]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 -1.295 0.4137 0.2767 -0.472 -0.01396 -0.3625
    bar      (x) int64 1 2
    baz      float64 3.142

Use dictionary indexing to pull out Dataset variables as DataArray objects:

In [37]: ds['foo']
Out[37]: 
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[-1.294524,  0.413738,  0.276662],
       [-0.472035, -0.01396 , -0.362543]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

Variables in datasets can have different dtype and even different dimensions, but all dimensions are assumed to refer to points in the same shared coordinate system.

You can do almost everything you can do with DataArray objects with Dataset objects (including indexing and arithmetic) if you prefer to work with multiple variables at once.

NetCDF

NetCDF is the recommended binary serialization format for xarray objects. Users from the geosciences will recognize that the Dataset data model looks very similar to a netCDF file (which, in fact, inspired it).

You can directly read and write xarray objects to disk using to_netcdf(), open_dataset() and open_dataarray():

In [38]: ds.to_netcdf('example.nc')

In [39]: xr.open_dataset('example.nc')
Out[39]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) object 'a' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 ...
    bar      (x) int64 ...
    baz      float64 ...

Toy weather data

Here is an example of how to easily manipulate a toy weather dataset using xarray and other recommended Python libraries:

Shared setup:

import numpy as np
import pandas as pd
import seaborn as sns  # pandas aware plotting library

import xarray as xr

np.random.seed(123)

times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

ds = xr.Dataset({'tmin': (('time', 'location'), tmin_values),
                 'tmax': (('time', 'location'), tmax_values)},
                {'time': times, 'location': ['IA', 'IN', 'IL']})
Examine a dataset with pandas and seaborn
In [1]: ds
Out[1]: 
<xarray.Dataset>
Dimensions:   (location: 3, time: 731)
Coordinates:
  * time      (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * location  (location) <U2 'IA' 'IN' 'IL'
Data variables:
    tmin      (time, location) float64 -8.037 -1.788 -3.932 -9.341 -6.558 ...
    tmax      (time, location) float64 12.98 3.31 6.779 0.4479 6.373 4.843 ...

In [2]: df = ds.to_dataframe()

In [3]: df.head()
Out[3]: 
                          tmin       tmax
location time                            
IA       2000-01-01  -8.037369  12.980549
         2000-01-02  -9.341157   0.447856
         2000-01-03 -12.139719   5.322699
         2000-01-04  -7.492914   1.889425
         2000-01-05  -0.447129   0.791176

In [4]: df.describe()
Out[4]: 
              tmin         tmax
count  2193.000000  2193.000000
mean      9.975426    20.108232
std      10.963228    11.010569
min     -13.395763    -3.506234
25%      -0.040347     9.853905
50%      10.060403    19.967409
75%      20.083590    30.045588
max      33.456060    43.271148

In [5]: ds.mean(dim='location').to_dataframe().plot()
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x7f10d813ecf8>
_images/examples_tmin_tmax_plot.png
In [6]: sns.pairplot(df.reset_index(), vars=ds.data_vars)
Out[6]: <seaborn.axisgrid.PairGrid at 0x7f10ecd3e2e8>
_images/examples_pairplot.png
Probability of freeze by calendar month
In [7]: freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')

In [8]: freeze
Out[8]: 
<xarray.DataArray 'tmin' (month: 12, location: 3)>
array([[ 0.951613,  0.887097,  0.935484],
       [ 0.842105,  0.719298,  0.77193 ],
       [ 0.241935,  0.129032,  0.16129 ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.016129,  0.      ],
       [ 0.333333,  0.35    ,  0.233333],
       [ 0.935484,  0.854839,  0.822581]])
Coordinates:
  * location  (location) <U2 'IA' 'IN' 'IL'
  * month     (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

In [9]: freeze.to_pandas().plot()
Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x7f10cf6ff828>
_images/examples_freeze_prob.png
Monthly averaging
In [10]: monthly_avg = ds.resample(time='1MS').mean()

In [11]: monthly_avg.sel(location='IA').to_dataframe().plot(style='s-')
Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x7f10cf664550>
_images/examples_tmin_tmax_plot_mean.png

Note that MS here refers to Month-Start; M labels Month-End (the last day of the month).

Calculate monthly anomalies

In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.

In [12]: climatology = ds.groupby('time.month').mean('time')

In [13]: anomalies = ds.groupby('time.month') - climatology

In [14]: anomalies.mean('location').to_dataframe()[['tmin', 'tmax']].plot()
Out[14]: <matplotlib.axes._subplots.AxesSubplot at 0x7f10cf69d1d0>
_images/examples_anomalies_plot.png
Calculate standardized monthly anomalies

You can create standardized anomalies where the difference between the observations and the climatological monthly mean is divided by the climatological standard deviation.

In [15]: climatology_mean = ds.groupby('time.month').mean('time')

In [16]: climatology_std = ds.groupby('time.month').std('time')

In [17]: stand_anomalies = xr.apply_ufunc(
   ....:                                  lambda x, m, s: (x - m) / s,
   ....:                                  ds.groupby('time.month'),
   ....:                                  climatology_mean, climatology_std)
   ....: 

In [18]: stand_anomalies.mean('location').to_dataframe()[['tmin', 'tmax']].plot()
Out[18]: <matplotlib.axes._subplots.AxesSubplot at 0x7f10cf5f69b0>
_images/examples_standardized_anomalies_plot.png
Fill missing values with climatology

The fillna() method on grouped objects lets you easily fill missing values by group:

# throw away the first half of every month
In [19]: some_missing = ds.tmin.sel(time=ds['time.day'] > 15).reindex_like(ds)

In [20]: filled = some_missing.groupby('time.month').fillna(climatology.tmin)

In [21]: both = xr.Dataset({'some_missing': some_missing, 'filled': filled})

In [22]: both
Out[22]: 
<xarray.Dataset>
Dimensions:       (location: 3, time: 731)
Coordinates:
  * time          (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
  * location      (location) object 'IA' 'IN' 'IL'
    month         (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Data variables:
    some_missing  (time, location) float64 nan nan nan nan nan nan nan nan ...
    filled        (time, location) float64 -5.163 -4.216 -4.681 -5.163 ...

In [23]: df = both.sel(time='2000').mean('location').reset_coords(drop=True).to_dataframe()

In [24]: df[['filled', 'some_missing']].plot()
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x7f10cf521fd0>
_images/examples_filled.png

Calculating Seasonal Averages from Timeseries of Monthly Means

Author: Joe Hamman

The data used for this example can be found in the xarray-data repository.

Suppose we have a netCDF or xarray.Dataset of monthly mean data and we want to calculate the seasonal average. To do this properly, we need to calculate the weighted average considering that each month has a different number of days.

%matplotlib inline
import numpy as np
import pandas as pd
import xarray as xr
from netCDF4 import num2date
import matplotlib.pyplot as plt

print("numpy version  : ", np.__version__)
print("pandas version : ", pd.__version__)
print("xarray version : ", xr.__version__)
numpy version  :  1.11.1
pandas version :  0.18.1
xarray version :  0.8.2
Some calendar information so we can support any netCDF calendar.
dpm = {'noleap': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       '365_day': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'standard': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'proleptic_gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'all_leap': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       '366_day': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       '360_day': [0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]}
A few calendar functions to determine the number of days in each month

If you were just using the standard calendar, it would be easy to use the calendar.month_range function.

def leap_year(year, calendar='standard'):
    """Determine if year is a leap year"""
    leap = False
    if ((calendar in ['standard', 'gregorian',
        'proleptic_gregorian', 'julian']) and
        (year % 4 == 0)):
        leap = True
        if ((calendar == 'proleptic_gregorian') and
            (year % 100 == 0) and
            (year % 400 != 0)):
            leap = False
        elif ((calendar in ['standard', 'gregorian']) and
                 (year % 100 == 0) and (year % 400 != 0) and
                 (year < 1583)):
            leap = False
    return leap

def get_dpm(time, calendar='standard'):
    """
    return a array of days per month corresponding to the months provided in `months`
    """
    month_length = np.zeros(len(time), dtype=np.int)

    cal_days = dpm[calendar]

    for i, (month, year) in enumerate(zip(time.month, time.year)):
        month_length[i] = cal_days[month]
        if leap_year(year, calendar=calendar):
            month_length[i] += 1
    return month_length
Open the Dataset
ds = xr.tutorial.load_dataset('rasm')
print(ds)
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
Attributes:
    title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
    institution: U.W.
    source: RACM R1002RBRxaaa01a
    output_frequency: daily
    output_mode: averaged
    convention: CF-1.4
    references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
    comment: Output from the Variable Infiltration Capacity (VIC) model.
    nco_openmp_thread_number: 1
    NCO: 4.3.7
    history: history deleted for brevity
Now for the heavy lifting:

We first have to come up with the weights, - calculate the month lengths for each monthly data record - calculate weights using groupby('time.season')

Finally, we just need to multiply our weights by the Dataset and sum allong the time dimension.

# Make a DataArray with the number of days in each month, size = len(time)
month_length = xr.DataArray(get_dpm(ds.time.to_index(), calendar='noleap'),
                            coords=[ds.time], name='month_length')

# Calculate the weights by grouping by 'time.season'.
# Conversion to float type ('astype(float)') only necessary for Python 2.x
weights = month_length.groupby('time.season') / month_length.astype(float).groupby('time.season').sum()

# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))

# Calculate the weighted average
ds_weighted = (ds * weights).groupby('time.season').sum(dim='time')
print(ds_weighted)
<xarray.Dataset>
Dimensions:  (season: 4, x: 275, y: 205)
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * season   (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
    Tair     (season, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    xc       (season, y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 ...
    yc       (season, y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 ...
# only used for comparisons
ds_unweighted = ds.groupby('time.season').mean('time')
ds_diff = ds_weighted - ds_unweighted
# Quick plot to show the results
notnull = pd.notnull(ds_unweighted['Tair'][0])

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(14,12))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
    ds_weighted['Tair'].sel(season=season).where(notnull).plot.pcolormesh(
        ax=axes[i, 0], vmin=-30, vmax=30, cmap='Spectral_r',
        add_colorbar=True, extend='both')

    ds_unweighted['Tair'].sel(season=season).where(notnull).plot.pcolormesh(
        ax=axes[i, 1], vmin=-30, vmax=30, cmap='Spectral_r',
        add_colorbar=True, extend='both')

    ds_diff['Tair'].sel(season=season).where(notnull).plot.pcolormesh(
        ax=axes[i, 2], vmin=-0.1, vmax=.1, cmap='RdBu_r',
        add_colorbar=True, extend='both')

    axes[i, 0].set_ylabel(season)
    axes[i, 1].set_ylabel('')
    axes[i, 2].set_ylabel('')

for ax in axes.flat:
    ax.axes.get_xaxis().set_ticklabels([])
    ax.axes.get_yaxis().set_ticklabels([])
    ax.axes.axis('tight')
    ax.set_xlabel('')

axes[0, 0].set_title('Weighted by DPM')
axes[0, 1].set_title('Equal Weighting')
axes[0, 2].set_title('Difference')

plt.tight_layout()

fig.suptitle('Seasonal Surface Air Temperature', fontsize=16, y=1.02)
<matplotlib.text.Text at 0x117c18048>
_images/monthly_means_output.png
# Wrap it into a simple function
def season_mean(ds, calendar='standard'):
    # Make a DataArray of season/year groups
    year_season = xr.DataArray(ds.time.to_index().to_period(freq='Q-NOV').to_timestamp(how='E'),
                               coords=[ds.time], name='year_season')

    # Make a DataArray with the number of days in each month, size = len(time)
    month_length = xr.DataArray(get_dpm(ds.time.to_index(), calendar=calendar),
                                coords=[ds.time], name='month_length')
    # Calculate the weights by grouping by 'time.season'
    weights = month_length.groupby('time.season') / month_length.groupby('time.season').sum()

    # Test that the sum of the weights for each season is 1.0
    np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))

    # Calculate the weighted average
    return (ds * weights).groupby('time.season').sum(dim='time')

Working with Multidimensional Coordinates

Author: Ryan Abernathey

Many datasets have physical coordinates which differ from their logical coordinates. Xarray provides several ways to plot and analyze such datasets.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xarray as xr

In [4]: import netCDF4

In [5]: import cartopy.crs as ccrs

In [6]: import matplotlib.pyplot as plt

As an example, consider this dataset from the xarray-data repository.

In [7]: ds = xr.tutorial.load_dataset('rasm')

In [8]: ds
Out[8]: 
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:                       "4.6.0"
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...

In this example, the logical coordinates are x and y, while the physical coordinates are xc and yc, which represent the latitudes and longitude of the data.

In [9]: ds.xc.attrs
Out[9]: 
OrderedDict([('long_name', 'longitude of grid cell center'),
             ('units', 'degrees_east'),
             ('bounds', 'xv')])

In [10]: ds.yc.attrs
Out[10]: 
OrderedDict([('long_name', 'latitude of grid cell center'),
             ('units', 'degrees_north'),
             ('bounds', 'yv')])
Plotting

Let’s examine these coordinate variables by plotting them.

In [11]: fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(9,3))

In [12]: ds.xc.plot(ax=ax1);

In [13]: ds.yc.plot(ax=ax2);
_images/xarray_multidimensional_coords_8_2.png

Note that the variables xc (longitude) and yc (latitude) are two-dimensional scalar fields.

If we try to plot the data variable Tair, by default we get the logical coordinates.

In [14]: ds.Tair[0].plot();
_images/xarray_multidimensional_coords_10_1.png

In order to visualize the data on a conventional latitude-longitude grid, we can take advantage of xarray’s ability to apply cartopy map projections.

In [15]: plt.figure(figsize=(7,2));

In [16]: ax = plt.axes(projection=ccrs.PlateCarree());

In [17]: ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(),
   ....:                            x='xc', y='yc', add_colorbar=False);
   ....: 

In [18]: ax.coastlines();

In [19]: plt.tight_layout();
_images/xarray_multidimensional_coords_12_0.png
Multidimensional Groupby

The above example allowed us to visualize the data on a regular latitude-longitude grid. But what if we want to do a calculation that involves grouping over one of these physical coordinates (rather than the logical coordinates), for example, calculating the mean temperature at each latitude. This can be achieved using xarray’s groupby function, which accepts multidimensional variables. By default, groupby will use every unique value in the variable, which is probably not what we want. Instead, we can use the groupby_bins function to specify the output coordinates of the group.

# define two-degree wide latitude bins
In [20]: lat_bins = np.arange(0, 91, 2)

# define a label for each bin corresponding to the central latitude
In [21]: lat_center = np.arange(1, 90, 2)

# group according to those bins and take the mean
In [22]: Tair_lat_mean = ds.Tair.groupby_bins('xc', lat_bins, labels=lat_center).mean()

# plot the result
In [23]: Tair_lat_mean.plot();
_images/xarray_multidimensional_coords_14_1.png

Note that the resulting coordinate for the groupby_bins operation got the _bins suffix appended: xc_bins. This help us distinguish it from the original multidimensional variable xc.

Recipes

Multiple lines from a 2d DataArray

Use xarray.plot.line() on a 2d DataArray to plot selections as multiple lines.

See Multiple lines showing variation along a dimension for more details.

_images/sphx_glr_plot_lines_from_2d_001.png
import matplotlib.pyplot as plt

import xarray as xr

# Load the data
ds = xr.tutorial.load_dataset('air_temperature')
air = ds.air - 273.15  # to celsius

# Prepare the figure
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), sharey=True)

# Selected latitude indices
isel_lats = [10, 15, 20]

# Temperature vs longitude plot - illustrates the "hue" kwarg
air.isel(time=0, lat=isel_lats).plot.line(ax=ax1, hue='lat')
ax1.set_ylabel('°C')

# Temperature vs time plot - illustrates the "x" and "add_legend" kwargs
air.isel(lon=30, lat=isel_lats).plot.line(ax=ax2, x='time', add_legend=False)
ax2.set_ylabel('')

# Show
plt.tight_layout()
plt.show()

Total running time of the script: ( 0 minutes 0.510 seconds)

Generated by Sphinx-Gallery

Control the plot’s colorbar

Use cbar_kwargs keyword to specify the number of ticks. The spacing kwarg can be used to draw proportional ticks.

import matplotlib.pyplot as plt

import xarray as xr

# Load the data
air_temp = xr.tutorial.load_dataset('air_temperature')
air2d = air_temp.air.isel(time=500)

# Prepare the figure
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(14, 4))

# Irregular levels to illustrate the use of a proportional colorbar
levels = [245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 310, 340]

# Plot data
air2d.plot(ax=ax1, levels=levels)
air2d.plot(ax=ax2, levels=levels, cbar_kwargs={'ticks': levels})
air2d.plot(ax=ax3, levels=levels, cbar_kwargs={'ticks': levels,
                                               'spacing': 'proportional'})

# Show plots
plt.tight_layout()
plt.show()

Total running time of the script: ( 0 minutes 0.000 seconds)

Generated by Sphinx-Gallery

imshow() and map projections

Using rasterio’s projection information for more accurate plots.

This example extends Parsing rasterio’s geocoordinates and plots the image in the original map projection instead of relying on pcolormesh and a map transformation.

_images/sphx_glr_plot_rasterio_rgb_001.png
import os
import urllib.request

import cartopy.crs as ccrs
import matplotlib.pyplot as plt

import xarray as xr

# Download the file from rasterio's repository
url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'
urllib.request.urlretrieve(url, 'RGB.byte.tif')

# Read the data
da = xr.open_rasterio('RGB.byte.tif')

# The data is in UTM projection. We have to set it manually until
# https://github.com/SciTools/cartopy/issues/813 is implemented
crs = ccrs.UTM('18N')

# Plot on a map
ax = plt.subplot(projection=crs)
da.plot.imshow(ax=ax, rgb='band', transform=crs)
ax.coastlines('10m', color='r')
plt.show()

# Delete the file
os.remove('RGB.byte.tif')

Total running time of the script: ( 0 minutes 1.041 seconds)

Generated by Sphinx-Gallery

Centered colormaps

xarray’s automatic colormaps choice

_images/sphx_glr_plot_colorbar_center_001.png
import matplotlib.pyplot as plt

import xarray as xr

# Load the data
ds = xr.tutorial.load_dataset('air_temperature')
air = ds.air.isel(time=0)

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(8, 6))

# The first plot (in kelvins) chooses "viridis" and uses the data's min/max
air.plot(ax=ax1, cbar_kwargs={'label': 'K'})
ax1.set_title('Kelvins: default')
ax2.set_xlabel('')

# The second plot (in celsius) now chooses "BuRd" and centers min/max around 0
airc = air - 273.15
airc.plot(ax=ax2, cbar_kwargs={'label': '°C'})
ax2.set_title('Celsius: default')
ax2.set_xlabel('')
ax2.set_ylabel('')

# The center doesn't have to be 0
air.plot(ax=ax3, center=273.15, cbar_kwargs={'label': 'K'})
ax3.set_title('Kelvins: center=273.15')

# Or it can be ignored
airc.plot(ax=ax4, center=False, cbar_kwargs={'label': '°C'})
ax4.set_title('Celsius: center=False')
ax4.set_ylabel('')

# Mke it nice
plt.tight_layout()
plt.show()

Total running time of the script: ( 0 minutes 0.902 seconds)

Generated by Sphinx-Gallery

Multiple plots and map projections

Control the map projection parameters on multiple axes

This example illustrates how to plot multiple maps and control their extent and aspect ratio.

For more details see this discussion on github.

_images/sphx_glr_plot_cartopy_facetgrid_001.png
from __future__ import division

import cartopy.crs as ccrs
import matplotlib.pyplot as plt

import xarray as xr

# Load the data
ds = xr.tutorial.load_dataset('air_temperature')
air = ds.air.isel(time=[0, 724]) - 273.15

# This is the map projection we want to plot *onto*
map_proj = ccrs.LambertConformal(central_longitude=-95, central_latitude=45)

p = air.plot(transform=ccrs.PlateCarree(),  # the data's projection
             col='time', col_wrap=1,  # multiplot settings
             aspect=ds.dims['lon'] / ds.dims['lat'],  # for a sensible figsize
             subplot_kws={'projection': map_proj})  # the plot's projection

# We have to set the map's options on all four axes
for ax in p.axes.flat:
    ax.coastlines()
    ax.set_extent([-160, -30, 5, 75])
    # Without this aspect attributes the maps will look chaotic and the
    # "extent" attribute above will be ignored
    ax.set_aspect('equal', 'box-forced')

plt.show()

Total running time of the script: ( 0 minutes 1.453 seconds)

Generated by Sphinx-Gallery

Parsing rasterio’s geocoordinates

Converting a projection’s cartesian coordinates into 2D longitudes and latitudes.

These new coordinates might be handy for plotting and indexing, but it should be kept in mind that a grid which is regular in projection coordinates will likely be irregular in lon/lat. It is often recommended to work in the data’s original map projection (see imshow() and map projections).

_images/sphx_glr_plot_rasterio_001.png
import os
import urllib.request

import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
from rasterio.warp import transform

import xarray as xr

# Download the file from rasterio's repository
url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'
urllib.request.urlretrieve(url, 'RGB.byte.tif')

# Read the data
da = xr.open_rasterio('RGB.byte.tif')

# Compute the lon/lat coordinates with rasterio.warp.transform
ny, nx = len(da['y']), len(da['x'])
x, y = np.meshgrid(da['x'], da['y'])

# Rasterio works with 1D arrays
lon, lat = transform(da.crs, {'init': 'EPSG:4326'},
                     x.flatten(), y.flatten())
lon = np.asarray(lon).reshape((ny, nx))
lat = np.asarray(lat).reshape((ny, nx))
da.coords['lon'] = (('y', 'x'), lon)
da.coords['lat'] = (('y', 'x'), lat)

# Compute a greyscale out of the rgb image
greyscale = da.mean(dim='band')

# Plot on a map
ax = plt.subplot(projection=ccrs.PlateCarree())
greyscale.plot(ax=ax, x='lon', y='lat', transform=ccrs.PlateCarree(),
               cmap='Greys_r', add_colorbar=False)
ax.coastlines('10m', color='r')
plt.show()

# Delete the file
os.remove('RGB.byte.tif')

Total running time of the script: ( 0 minutes 1.661 seconds)

Generated by Sphinx-Gallery

Generated by Sphinx-Gallery

Installation

Required dependencies

  • Python 2.7 [1], 3.4, 3.5, or 3.6
  • numpy (1.11 or later)
  • pandas (0.18.0 or later)

Optional dependencies

For netCDF and IO
  • netCDF4: recommended if you want to use xarray for reading or writing netCDF files
  • scipy: used as a fallback for reading/writing netCDF3
  • pydap: used as a fallback for accessing OPeNDAP
  • h5netcdf: an alternative library for reading and writing netCDF4 files that does not use the netCDF-C libraries
  • pynio: for reading GRIB and other geoscience specific file formats
  • zarr: for chunked, compressed, N-dimensional arrays.
  • netcdftime: recommended if you want to encode/decode datetimes for non-standard calendars or dates before year 1678 or after year 2262.
For accelerating xarray
  • bottleneck: speeds up NaN-skipping and rolling window aggregations by a large factor (1.1 or later)
  • cyordereddict: speeds up most internal operations with xarray data structures (for python versions < 3.5)
For parallel computing
For plotting

Instructions

xarray itself is a pure Python package, but its dependencies are not. The easiest way to get everything installed is to use conda. To install xarray with its recommended dependencies using the conda command line tool:

$ conda install xarray dask netCDF4 bottleneck

We recommend using the community maintained conda-forge channel if you need difficult-to-build dependencies such as cartopy or pynio:

$ conda install -c conda-forge xarray cartopy pynio

New releases may also appear in conda-forge before being updated in the default channel.

If you don’t use conda, be sure you have the required dependencies (numpy and pandas) installed first. Then, install xarray with pip:

$ pip install xarray

Testing

To run the test suite after installing xarray, first install (via pypi or conda)

  • py.test: Simple unit testing library
  • mock: additional testing library required for python version 2

and run py.test --pyargs xarray.

Performance Monitoring

A fixed-point performance monitoring of (a part of) our codes can be seen on this page.

To run these benchmark tests in a local machine, first install - airspeed-velocity: a tool for benchmarking Python packages over their lifetime.

and run asv run  # this will install some conda environments in ./.asv/envs

[1]

Xarray plans to drop support for python 2.7 at the end of 2018. This means that new releases of xarray published after this date will only be installable on python 3+ environments, but older versions of xarray will always be available to python 2.7 users. For more information see the following references:

User Guide

Data Structures

DataArray

xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:

  • values: a numpy.ndarray holding the array’s values
  • dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
  • coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
  • attrs: an OrderedDict to hold arbitrary metadata (attributes)

xarray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.

DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xarray makes no attempt to interpret them, and propagates them only in unambiguous cases (see FAQ, What is your approach to metadata?).

Creating a DataArray

The DataArray constructor takes:

  • data: a multi-dimensional array of values (e.g., a numpy ndarray, Series, DataFrame or Panel)
  • coords: a list or dictionary of coordinates. If a list, it should be a list of tuples where the first element is the dimension name and the second element is the corresponding coordinate array_like object.
  • dims: a list of dimension names. If omitted and coords is a list of tuples, dimension names are taken from coords.
  • attrs: a dictionary of attributes to add to the instance
  • name: a string that names the instance
In [1]: data = np.random.rand(4, 3)

In [2]: locs = ['IA', 'IL', 'IN']

In [3]: times = pd.date_range('2000-01-01', periods=4)

In [4]: foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])

In [5]: foo
Out[5]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

Only data is required; all of other arguments will be filled in with default values:

In [6]: xr.DataArray(data)
Out[6]: 
<xarray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Dimensions without coordinates: dim_0, dim_1

As you can see, dimension names are always present in the xarray data model: if you do not provide them, defaults of the form dim_N will be created. However, coordinates are always optional, and dimensions do not have automatic coordinate labels.

Note

This is different from pandas, where axes always have tick labels, which default to the integers [0, ..., n-1].

Prior to xarray v0.9, xarray copied this behavior: default coordinates for each dimension would be created if coordinates were not supplied explicitly. This is no longer the case.

Coordinates can be specified in the following ways:

  • A list of values with length equal to the number of dimensions, providing coordinate labels for each dimension. Each value must be of one of the following forms:
    • A DataArray or Variable
    • A tuple of the form (dims, data[, attrs]), which is converted into arguments for Variable
    • A pandas object or scalar value, which is converted into a DataArray
    • A 1D array or list, which is interpreted as values for a one dimensional coordinate variable along the same dimension as it’s name
  • A dictionary of {coord_name: coord} where values are of the same form as the list. Supplying coordinates as a dictionary allows other coordinates than those corresponding to dimensions (more on these later). If you supply coords as a dictionary, you must explicitly provide dims.

As a list of tuples:

In [7]: xr.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

As a dictionary:

In [8]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': ('space', [1, 2, 3])},
   ...:              dims=['time', 'space'])
   ...: 
Out[8]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    const    int64 42
    ranking  (space) int64 1 2 3

As a dictionary with coords across multiple dimensions:

In [9]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': (('time', 'space'), np.arange(12).reshape(4,3))},
   ...:              dims=['time', 'space'])
   ...: 
Out[9]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    const    int64 42
    ranking  (time, space) int64 0 1 2 3 4 5 6 7 8 9 10 11

If you create a DataArray by supplying a pandas Series, DataFrame or Panel, any non-specified arguments in the DataArray constructor will be filled in from the pandas object:

In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])

In [11]: df.index.name = 'abc'

In [12]: df.columns.name = 'xyz'

In [13]: df
Out[13]: 
xyz  x  y
abc      
a    0  2
b    1  3

In [14]: xr.DataArray(df)
Out[14]: 
<xarray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
       [1, 3]])
Coordinates:
  * abc      (abc) object 'a' 'b'
  * xyz      (xyz) object 'x' 'y'
DataArray properties

Let’s take a look at the important properties on our array:

In [15]: foo.values
Out[15]: 
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])

In [16]: foo.dims
Out[16]: ('time', 'space')

In [17]: foo.coords
Out[17]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

In [18]: foo.attrs
Out[18]: OrderedDict()

In [19]: print(foo.name)
None

You can modify values inplace:

In [20]: foo.values = 1.0 * foo.values

Note

The array values in a DataArray have a single (homogeneous) data type. To work with heterogeneous or structured data types in xarray, use coordinates, or put separate DataArray objects in a single Dataset (see below).

Now fill in some of that missing metadata:

In [21]: foo.name = 'foo'

In [22]: foo.attrs['units'] = 'meters'

In [23]: foo
Out[23]: 
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters

The rename() method is another option, returning a new data array:

In [24]: foo.rename('bar')
Out[24]: 
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222],
       [ 0.451376,  0.840255,  0.123102],
       [ 0.543026,  0.373012,  0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters
DataArray Coordinates

The coords property is dict like. Individual coordinates can be accessed from the coordinates by name, or even by indexing the data array itself:

In [25]: foo.coords['time']
Out[25]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

In [26]: foo['time']
Out[26]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

These are also DataArray objects, which contain tick-labels for each dimension.

Coordinates can also be set or removed by using the dictionary like syntax:

In [27]: foo['ranking'] = ('space', [1, 2, 3])

In [28]: foo.coords
Out[28]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    ranking  (space) int64 1 2 3

In [29]: del foo['ranking']

In [30]: foo.coords
Out[30]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

For more details, see Coordinates below.

Dataset

xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.

In addition to the dict-like interface of the dataset itself, which can be used to access any variable in a dataset, datasets have four key properties:

  • dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
  • data_vars: a dict-like container of DataArrays corresponding to variables
  • coords: another dict-like container of DataArrays intended to label points used in data_vars (e.g., arrays of numbers, datetime objects or strings)
  • attrs: an OrderedDict to hold arbitrary metadata

The distinction between whether a variables falls in data or coordinates (borrowed from CF conventions) is mostly semantic, and you can probably get away with ignoring it if you like: dictionary like access on a dataset will supply variables found in either category. However, xarray does make use of the distinction for indexing and computations. Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data.

Here is an example of how we might structure a dataset for a weather forecast:

_images/dataset-diagram.png

In this example, it would be natural to call temperature and precipitation “data variables” and all the other arrays “coordinate variables” because they label the points along the dimensions. (see [1] for more background on this example).

Creating a Dataset

To make an Dataset from scratch, supply dictionaries for any variables (data_vars), coordinates (coords) and attributes (attrs).

  • data_vars should be a dictionary with each key as the name of the variable and each value as one of:
    • A DataArray or Variable
    • A tuple of the form (dims, data[, attrs]), which is converted into arguments for Variable
    • A pandas object, which is converted into a DataArray
    • A 1D array or list, which is interpreted as values for a one dimensional coordinate variable along the same dimension as it’s name
  • coords should be a dictionary of the same form as data_vars.
  • attrs should be a dictionary.

Let’s create some fake data for the example we show above:

In [31]: temp = 15 + 8 * np.random.randn(2, 2, 3)

In [32]: precip = 10 * np.random.rand(2, 2, 3)

In [33]: lon = [[-99.83, -99.32], [-99.79, -99.23]]

In [34]: lat = [[42.25, 42.21], [42.63, 42.59]]

# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
In [35]: ds = xr.Dataset({'temperature': (['x', 'y', 'time'],  temp),
   ....:                  'precipitation': (['x', 'y', 'time'], precip)},
   ....:                 coords={'lon': (['x', 'y'], lon),
   ....:                         'lat': (['x', 'y'], lat),
   ....:                         'time': pd.date_range('2014-09-06', periods=3),
   ....:                         'reference_time': pd.Timestamp('2014-09-05')})
   ....: 

In [36]: ds
Out[36]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...

Here we pass xarray.DataArray objects or a pandas object as values in the dictionary:

In [37]: xr.Dataset({'bar': foo})
Out[37]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Data variables:
    bar      (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
In [38]: xr.Dataset({'bar': foo.to_pandas()})
Out[38]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) object 'IA' 'IL' 'IN'
Data variables:
    bar      (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...

Where a pandas object is supplied as a value, the names of its indexes are used as dimension names, and its data is aligned to any existing dimensions.

You can also create an dataset from:

Dataset contents

Dataset implements the Python mapping interface, with values given by xarray.DataArray objects:

In [39]: 'temperature' in ds
Out[39]: True

In [40]: ds['temperature']
Out[40]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.040566,  23.57443 ,  20.772441],
        [  9.345831,   6.6834  ,  17.174879]],

       [[ 11.600221,  19.536163,  17.209856],
        [  6.300794,   9.610482,  15.909187]]])
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y

Valid keys include each listed coordinate and data variable.

Data and coordinate variables are also contained separately in the data_vars and coords dictionary-like attributes:

In [41]: ds.data_vars
Out[41]: 
Data variables:
    temperature    (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation  (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...

In [42]: ds.coords
Out[42]: 
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05

Finally, like data arrays, datasets also store arbitrary metadata in the form of attributes:

In [43]: ds.attrs
Out[43]: OrderedDict()

In [44]: ds.attrs['title'] = 'example attribute'

In [45]: ds
Out[45]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
Attributes:
    title:    example attribute

xarray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or numpy.ndarray objects.

As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes:

In [46]: ds.temperature
Out[46]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.040566,  23.57443 ,  20.772441],
        [  9.345831,   6.6834  ,  17.174879]],

       [[ 11.600221,  19.536163,  17.209856],
        [  6.300794,   9.610482,  15.909187]]])
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y

This is particularly useful in an exploratory context, because you can tab-complete these variable names with tools like IPython.

Warning

We are changing the behavior of iterating over a Dataset the next major release of xarray, to only include data variables instead of both data variables and coordinates. In the meantime, prefer iterating over ds.data_vars or ds.coords.

Dictionary like methods

We can update a dataset in-place using Python’s standard dictionary syntax. For example, to create this example dataset from scratch, we could have written:

In [47]: ds = xr.Dataset()

In [48]: ds['temperature'] = (('x', 'y', 'time'), temp)

In [49]: ds['precipitation'] = (('x', 'y', 'time'), precip)

In [50]: ds.coords['lat'] = (('x', 'y'), lat)

In [51]: ds.coords['lon'] = (('x', 'y'), lon)

In [52]: ds.coords['time'] = pd.date_range('2014-09-06', periods=3)

In [53]: ds.coords['reference_time'] = pd.Timestamp('2014-09-05')

To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __delitem__, get and update(). Note that assigning a DataArray or pandas object to a Dataset variable using __setitem__ or update will automatically align the array(s) to the original dataset’s indexes.

You can copy a Dataset by calling the copy() method. By default, the copy is shallow, so only the container will be copied: the arrays in the Dataset will still be stored in the same underlying numpy.ndarray objects. You can copy all data by calling ds.copy(deep=True).

Transforming datasets

In addition to dictionary-like methods (described above), xarray has additional methods (like pandas) for transforming datasets into new objects.

For removing variables, you can select and drop an explicit list of variables by indexing with a list of names or using the drop() methods to return a new Dataset. These operations keep around coordinates:

In [54]: list(ds[['temperature']])
Out[54]: ['temperature', 'lat', 'reference_time', 'time', 'lon']

In [55]: list(ds[['x']])
Out[55]: ['x', 'reference_time']

In [56]: list(ds.drop('temperature'))
Out[56]: ['precipitation', 'lat', 'lon', 'time', 'reference_time']

If a dimension name is given as an argument to drop, it also drops all variables that use that dimension:

In [57]: list(ds.drop('time'))
Out[57]: ['temperature', 'precipitation', 'lat', 'lon', 'reference_time']

As an alternate to dictionary-like modifications, you can use assign() and assign_coords(). These methods return a new dataset with additional (or replaced) or values:

In [58]: ds.assign(temperature2 = 2 * ds.temperature)
Out[58]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    temperature2    (x, y, time) float64 22.08 47.15 41.54 18.69 13.37 34.35 ...

There is also the pipe() method that allows you to use a method call with an external function (e.g., ds.pipe(func)) instead of simply calling it (e.g., func(ds)). This allows you to write pipelines for transforming you data (using “method chaining”) instead of writing hard to follow nested function calls:

# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
In [59]: plt.plot((2 * ds.temperature.sel(x=0)).mean('y'))
Out[59]: [<matplotlib.lines.Line2D at 0x7f10f75b4390>]

In [60]: (ds.temperature
   ....:  .sel(x=0)
   ....:  .pipe(lambda x: 2 * x)
   ....:  .mean('y')
   ....:  .pipe(plt.plot))
   ....: 
Out[60]: [<matplotlib.lines.Line2D at 0x7f10f75b4f98>]

Both pipe and assign replicate the pandas methods of the same names (DataFrame.pipe and DataFrame.assign).

With xarray, there is no performance penalty for creating new datasets, even if variables are lazily loaded from a file on disk. Creating new objects instead of mutating existing objects often results in easier to understand code, so we encourage using this approach.

Renaming variables

Another useful option is the rename() method to rename dataset variables:

In [61]: ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
Out[61]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temp            (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precip          (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...

The related swap_dims() method allows you do to swap dimension and non-dimension variables:

In [62]: ds.coords['day'] = ('time', [6, 7, 8])

In [63]: ds.swap_dims({'time': 'day'})
Out[63]: 
<xarray.Dataset>
Dimensions:         (day: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    time            (day) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
  * day             (day) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, day) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, day) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...

Coordinates

Coordinates are ancillary variables stored for DataArray and Dataset objects in the coords attribute:

In [64]: ds.coords
Out[64]: 
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8

Unlike attributes, xarray does interpret and persist coordinates in operations that transform xarray objects. There are two types of coordinates in xarray:

  • dimension coordinates are one dimensional coordinates with a name equal to their sole dimension (marked by * when printing a dataset or data array). They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these “dimension” coordinates use a pandas.Index internally to store their values.
  • non-dimension coordinates are variables that contain coordinate data, but are not a dimension coordinate. They can be multidimensional (see Working with Multidimensional Coordinates), and there is no relationship between the name of a non-dimension coordinate and the name(s) of its dimension(s). Non-dimension coordinates can be useful for indexing or plotting; otherwise, xarray does not make any direct use of the values associated with them. They are not used for alignment or automatic indexing, nor are they required to match when doing arithmetic (see Coordinates).

Note

xarray’s terminology differs from the CF terminology, where the “dimension coordinates” are called “coordinate variables”, and the “non-dimension coordinates” are called “auxiliary coordinate variables” (see GH1295 for more details).

Modifying coordinates

To entirely add or remove coordinate arrays, you can use dictionary like syntax, as shown above.

To convert back and forth between data and coordinates, you can use the set_coords() and reset_coords() methods:

In [65]: ds.reset_coords()
Out[65]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8

In [66]: ds.set_coords(['temperature', 'precipitation'])
Out[66]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
    *empty*

In [67]: ds['temperature'].reset_coords(drop=True)
Out[67]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.040566,  23.57443 ,  20.772441],
        [  9.345831,   6.6834  ,  17.174879]],

       [[ 11.600221,  19.536163,  17.209856],
        [  6.300794,   9.610482,  15.909187]]])
Coordinates:
  * time     (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Dimensions without coordinates: x, y

Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xarray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful.

Coordinates methods

Coordinates objects also have a few useful methods, mostly for converting them into dataset objects:

In [68]: ds.coords.to_dataset()
Out[68]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    reference_time  datetime64[ns] 2014-09-05
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    day             (time) int64 6 7 8
Dimensions without coordinates: x, y
Data variables:
    *empty*

The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see Computation):

In [69]: alt = xr.Dataset(coords={'z': [10], 'lat': 0, 'lon': 0})

In [70]: ds.coords.merge(alt.coords)
Out[70]: 
<xarray.Dataset>
Dimensions:         (time: 3, z: 1)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8
  * z               (z) int64 10
Data variables:
    *empty*

The coords.merge method may be useful if you want to implement your own binary operations that act on xarray objects. In the future, we hope to write more helper functions so that you can easily make your functions act like xarray’s built-in arithmetic.

Indexes

To convert a coordinate (or any DataArray) into an actual pandas.Index, use the to_index() method:

In [71]: ds['time'].to_index()
Out[71]: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name='time', freq='D')

A useful shortcut is the indexes property (on both DataArray and Dataset), which lazily constructs a dictionary whose keys are given by each dimension and whose the values are Index objects:

In [72]: ds.indexes
Out[72]: time: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name='time', freq='D')
MultiIndex coordinates

Xarray supports labeling coordinate values with a pandas.MultiIndex:

In [73]: midx = pd.MultiIndex.from_arrays([['R', 'R', 'V', 'V'], [.1, .2, .7, .9]],
   ....:                                  names=('band', 'wn'))
   ....: 

In [74]: mda = xr.DataArray(np.random.rand(4), coords={'spec': midx}, dims='spec')

In [75]: mda
Out[75]: 
<xarray.DataArray (spec: 4)>
array([ 0.641666,  0.274592,  0.462354,  0.871372])
Coordinates:
  * spec     (spec) MultiIndex
  - band     (spec) object 'R' 'R' 'V' 'V'
  - wn       (spec) float64 0.1 0.2 0.7 0.9

For convenience multi-index levels are directly accessible as “virtual” or “derived” coordinates (marked by - when printing a dataset or data array):

In [76]: mda['band']
Out[76]: 
<xarray.DataArray 'band' (spec: 4)>
array(['R', 'R', 'V', 'V'], dtype=object)
Coordinates:
  * spec     (spec) MultiIndex
  - band     (spec) object 'R' 'R' 'V' 'V'
  - wn       (spec) float64 0.1 0.2 0.7 0.9

In [77]: mda.wn
Out[77]: 
<xarray.DataArray 'wn' (spec: 4)>
array([ 0.1,  0.2,  0.7,  0.9])
Coordinates:
  * spec     (spec) MultiIndex
  - band     (spec) object 'R' 'R' 'V' 'V'
  - wn       (spec) float64 0.1 0.2 0.7 0.9

Indexing with multi-index levels is also possible using the sel method (see Multi-level indexing).

Unlike other coordinates, “virtual” level coordinates are not stored in the coords attribute of DataArray and Dataset objects (although they are shown when printing the coords attribute). Consequently, most of the coordinates related methods don’t apply for them. It also can’t be used to replace one particular level.

Because in a DataArray or Dataset object each multi-index level is accessible as a “virtual” coordinate, its name must not conflict with the names of the other levels, coordinates and data variables of the same object. Even though Xarray set default names for multi-indexes with unnamed levels, it is recommended that you explicitly set the names of the levels.

[1]Latitude and longitude are 2D arrays because the dataset uses projected coordinates. reference_time refers to the reference time at which the forecast was made, rather than time which is the valid time for which the forecast applies.

Indexing and selecting data

xarray offers extremely flexible indexing routines that combine the best features of NumPy and pandas for data selection.

The most basic way to access elements of a DataArray object is to use Python’s [] syntax, such as array[i, j], where i and j are both integers. As xarray objects can store coordinates corresponding to each dimension of an array, label-based indexing similar to pandas.DataFrame.loc is also possible. In label-based indexing, the element position i is automatically looked-up from the coordinate values.

Dimensions of xarray objects have names, so you can also lookup the dimensions by name, instead of remembering their positional order.

Thus in total, xarray supports four different kinds of indexing, as described below and summarized in this table:

Dimension lookup Index lookup DataArray syntax Dataset syntax
Positional By integer arr[:, 0] not available
Positional By label arr.loc[:, 'IA'] not available
By name By integer arr.isel(space=0) or
arr[dict(space=0)]
ds.isel(space=0) or
ds[dict(space=0)]
By name By label arr.sel(space='IA') or
arr.loc[dict(space='IA')]
ds.sel(space='IA') or
ds.loc[dict(space='IA')]

More advanced indexing is also possible for all the methods by supplying DataArray objects as indexer. See Vectorized Indexing for the details.

Positional indexing

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [1]: arr = xr.DataArray(np.random.rand(4, 3),
   ...:                    [('time', pd.date_range('2000-01-01', periods=4)),
   ...:                     ('space', ['IA', 'IL', 'IN'])])
   ...: 

In [2]: arr[:2]
Out[2]: 
<xarray.DataArray (time: 2, space: 3)>
array([[ 0.12697 ,  0.966718,  0.260476],
       [ 0.897237,  0.37675 ,  0.336222]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) <U2 'IA' 'IL' 'IN'

In [3]: arr[0, 0]
Out[3]: 
<xarray.DataArray ()>
array(0.12696983303810094)
Coordinates:
    time     datetime64[ns] 2000-01-01
    space    <U2 'IA'

In [4]: arr[:, [2, 1]]
Out[4]: 
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.260476,  0.966718],
       [ 0.336222,  0.37675 ],
       [ 0.123102,  0.840255],
       [ 0.447997,  0.373012]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IN' 'IL'

Attributes are persisted in all indexing operations.

Warning

Positional indexing deviates from the NumPy when indexing with multiple arrays like arr[[0, 1], [0, 1]], as described in Vectorized Indexing.

xarray also supports label-based indexing, just like pandas. Because we use a pandas.Index under the hood, label based indexing is very fast. To do label based indexing, use the loc attribute:

In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]: 
<xarray.DataArray (time: 2)>
array([ 0.12697 ,  0.897237])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
    space    <U2 'IA'

In this example, the selected is a subpart of the array in the range ‘2000-01-01’:‘2000-01-02’ along the first coordinate time and with ‘IA’ value from the second coordinate space.

You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xarray is inclusive of both the start and stop bounds.

Setting values with label based indexing is also supported:

In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10

In [7]: arr
Out[7]: 
<xarray.DataArray (time: 4, space: 3)>
array([[  0.12697 , -10.      , -10.      ],
       [  0.897237,   0.37675 ,   0.336222],
       [  0.451376,   0.840255,   0.123102],
       [  0.543026,   0.373012,   0.447997]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

Indexing with dimension names

With the dimension names, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:

  1. Use a dictionary as the argument for array positional or label based array indexing:

    # index by integer array indices
    In [8]: arr[dict(space=0, time=slice(None, 2))]
    Out[8]: 
    <xarray.DataArray (time: 2)>
    array([ 0.12697 ,  0.897237])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
        space    <U2 'IA'
    
    # index by dimension coordinate labels
    In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))]
    Out[9]: 
    <xarray.DataArray (time: 2, space: 3)>
    array([[  0.12697 , -10.      , -10.      ],
           [  0.897237,   0.37675 ,   0.336222]])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
      * space    (space) <U2 'IA' 'IL' 'IN'
    
  2. Use the sel() and isel() convenience methods:

    # index by integer array indices
    In [10]: arr.isel(space=0, time=slice(None, 2))
    Out[10]: 
    <xarray.DataArray (time: 2)>
    array([ 0.12697 ,  0.897237])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
        space    <U2 'IA'
    
    # index by dimension coordinate labels
    In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02'))
    Out[11]: 
    <xarray.DataArray (time: 2, space: 3)>
    array([[  0.12697 , -10.      , -10.      ],
           [  0.897237,   0.37675 ,   0.336222]])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
      * space    (space) <U2 'IA' 'IL' 'IN'
    

The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, Python slice() objects or 1-dimensional arrays.

Note

We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, Python does yet not support indexing with keyword arguments like arr[space=0]

Nearest neighbor lookups

The label based selection methods sel(), reindex() and reindex_like() all support method and tolerance keyword argument. The method parameter allows for enabling nearest neighbor (inexact) lookups by use of the methods 'pad', 'backfill' or 'nearest':

In [12]: data = xr.DataArray([1, 2, 3], [('x', [0, 1, 2])])

In [13]: data.sel(x=[1.1, 1.9], method='nearest')
Out[13]: 
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
  * x        (x) int64 1 2

In [14]: data.sel(x=0.1, method='backfill')
Out[14]: 
<xarray.DataArray ()>
array(2)
Coordinates:
    x        int64 1

In [15]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
Out[15]: 
<xarray.DataArray (x: 5)>
array([1, 2, 2, 3, 3])
Coordinates:
  * x        (x) float64 0.5 1.0 1.5 2.0 2.5

Tolerance limits the maximum distance for valid matches with an inexact lookup:

In [16]: data.reindex(x=[1.1, 1.5], method='nearest', tolerance=0.2)
Out[16]: 
<xarray.DataArray (x: 2)>
array([  2.,  nan])
Coordinates:
  * x        (x) float64 1.1 1.5

The method parameter is not yet supported if any of the arguments to .sel() is a slice object:

In [17]: data.sel(x=slice(1, 3), method='nearest')
NotImplementedError

However, you don’t need to use method to do inexact slicing. Slicing already returns all values inside the range (inclusive), as long as the index labels are monotonic increasing:

In [18]: data.sel(x=slice(0.9, 3.1))
Out[18]: 
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
  * x        (x) int64 1 2

Indexing axes with monotonic decreasing labels also works, as long as the slice or .loc arguments are also decreasing:

In [19]: reversed_data = data[::-1]

In [20]: reversed_data.loc[3.1:0.9]
Out[20]: 
<xarray.DataArray (x: 2)>
array([3, 2])
Coordinates:
  * x        (x) int64 2 1

Dataset indexing

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:

In [21]: ds = arr.to_dataset(name='foo')

In [22]: ds.isel(space=[0], time=[0])
Out[22]: 
<xarray.Dataset>
Dimensions:  (space: 1, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA'
Data variables:
    foo      (time, space) float64 0.127

In [23]: ds.sel(time='2000-01-01')
Out[23]: 
<xarray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA' 'IL' 'IN'
Data variables:
    foo      (space) float64 0.127 -10.0 -10.0

Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with dimension names:

In [24]: ds[dict(space=[0], time=[0])]
Out[24]: 
<xarray.Dataset>
Dimensions:  (space: 1, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA'
Data variables:
    foo      (time, space) float64 0.127

In [25]: ds.loc[dict(time='2000-01-01')]
Out[25]: 
<xarray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) <U2 'IA' 'IL' 'IN'
Data variables:
    foo      (space) float64 0.127 -10.0 -10.0

Using indexing to assign values to a subset of dataset (e.g., ds[dict(space=0)] = 1) is not yet supported.

Dropping labels

The drop() method returns a new object with the listed index labels along a dimension dropped:

In [26]: ds.drop(['IN', 'IL'], dim='space')
Out[26]: 
<xarray.Dataset>
Dimensions:  (space: 1, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA'
Data variables:
    foo      (time, space) float64 0.127 0.8972 0.4514 0.543

drop is both a Dataset and DataArray method.

Masking with where

Indexing methods on xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in xarray, use where():

In [27]: arr2 = xr.DataArray(np.arange(16).reshape(4, 4), dims=['x', 'y'])

In [28]: arr2.where(arr2.x + arr2.y < 4)
Out[28]: 
<xarray.DataArray (x: 4, y: 4)>
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,  nan],
       [  8.,   9.,  nan,  nan],
       [ 12.,  nan,  nan,  nan]])
Dimensions without coordinates: x, y

This is particularly useful for ragged indexing of multi-dimensional data, e.g., to apply a 2D mask to an image. Note that where follows all the usual xarray broadcasting and alignment rules for binary operations (e.g., +) between the object being indexed and the condition, as described in Computation:

In [29]: arr2.where(arr2.y < 2)
Out[29]: 
<xarray.DataArray (x: 4, y: 4)>
array([[  0.,   1.,  nan,  nan],
       [  4.,   5.,  nan,  nan],
       [  8.,   9.,  nan,  nan],
       [ 12.,  13.,  nan,  nan]])
Dimensions without coordinates: x, y

By default where maintains the original size of the data. For cases where the selected data size is much smaller than the original data, use of the option drop=True clips coordinate elements that are fully masked:

In [30]: arr2.where(arr2.y < 2, drop=True)
Out[30]: 
<xarray.DataArray (x: 4, y: 2)>
array([[  0.,   1.],
       [  4.,   5.],
       [  8.,   9.],
       [ 12.,  13.]])
Dimensions without coordinates: x, y

Selecting values with isin

To check whether elements of an xarray object contain a single object, you can compare with the equality operator == (e.g., arr == 3). To check multiple values, use isin():

In [31]: arr = xr.DataArray([1, 2, 3, 4, 5], dims=['x'])

In [32]: arr.isin([2, 4])
Out[32]: 
<xarray.DataArray (x: 5)>
array([False,  True, False,  True, False], dtype=bool)
Dimensions without coordinates: x

isin() works particularly well with where() to support indexing by arrays that are not already labels of an array:

In [33]: lookup = xr.DataArray([-1, -2, -3, -4, -5], dims=['x'])

In [34]: arr.where(lookup.isin([-2, -4]), drop=True)
Out[34]: 
<xarray.DataArray (x: 2)>
array([ 2.,  4.])
Dimensions without coordinates: x

However, some caution is in order: when done repeatedly, this type of indexing is significantly slower than using sel().

Vectorized Indexing

Like numpy and pandas, xarray supports indexing many array elements at once in a vectorized manner.

If you only provide integers, slices, or unlabeled arrays (array without dimension names, such as np.ndarray, list, but not DataArray() or Variable()) indexing can be understood as orthogonally. Each indexer component selects independently along the corresponding dimension, similar to how vector indexing works in Fortran or MATLAB, or after using the numpy.ix_() helper:

In [35]: da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'],
   ....:                   coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
   ....: 

In [36]: da
Out[36]: 
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'

In [37]: da[[0, 1], [1, 1]]
Out[37]: 
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [5, 5]])
Coordinates:
  * x        (x) int64 0 1
  * y        (y) <U1 'b' 'b'

For more flexibility, you can supply DataArray() objects as indexers. Dimensions on resultant arrays are given by the ordered union of the indexers’ dimensions:

In [38]: ind_x = xr.DataArray([0, 1], dims=['x'])

In [39]: ind_y = xr.DataArray([0, 1], dims=['y'])

In [40]: da[ind_x, ind_y]  # orthogonal indexing
Out[40]: 
<xarray.DataArray (x: 2, y: 2)>
array([[0, 1],
       [4, 5]])
Coordinates:
  * x        (x) int64 0 1
  * y        (y) <U1 'a' 'b'

In [41]: da[ind_x, ind_x]  # vectorized indexing
Out[41]: 
<xarray.DataArray (x: 2)>
array([0, 5])
Coordinates:
  * x        (x) int64 0 1
    y        (x) <U1 'a' 'b'

Slices or sequences/arrays without named-dimensions are treated as if they have the same dimension which is indexed along:

# Because [0, 1] is used to index along dimension 'x',
# it is assumed to have dimension 'x'
In [42]: da[[0, 1], ind_x]
Out[42]: 
<xarray.DataArray (x: 2)>
array([0, 5])
Coordinates:
  * x        (x) int64 0 1
    y        (x) <U1 'a' 'b'

Furthermore, you can use multi-dimensional DataArray() as indexers, where the resultant array dimension is also determined by indexers’ dimension:

In [43]: ind = xr.DataArray([[0, 1], [0, 1]], dims=['a', 'b'])

In [44]: da[ind]
Out[44]: 
<xarray.DataArray (a: 2, b: 2, y: 4)>
array([[[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]]])
Coordinates:
    x        (a, b) int64 0 1 0 1
  * y        (y) <U1 'a' 'b' 'c' 'd'
Dimensions without coordinates: a, b

Similar to how NumPy’s advanced indexing works, vectorized indexing for xarray is based on our broadcasting rules. See Indexing rules for the complete specification.

Vectorized indexing also works with isel, loc, and sel:

In [45]: ind = xr.DataArray([[0, 1], [0, 1]], dims=['a', 'b'])

In [46]: da.isel(y=ind)  # same as da[:, ind]
Out[46]: 
<xarray.DataArray (x: 3, a: 2, b: 2)>
array([[[0, 1],
        [0, 1]],

       [[4, 5],
        [4, 5]],

       [[8, 9],
        [8, 9]]])
Coordinates:
  * x        (x) int64 0 1 2
    y        (a, b) object 'a' 'b' 'a' 'b'
Dimensions without coordinates: a, b

In [47]: ind = xr.DataArray([['a', 'b'], ['b', 'a']], dims=['a', 'b'])

In [48]: da.loc[:, ind]  # same as da.sel(y=ind)
Out[48]: 
<xarray.DataArray (x: 3, a: 2, b: 2)>
array([[[0, 1],
        [1, 0]],

       [[4, 5],
        [5, 4]],

       [[8, 9],
        [9, 8]]])
Coordinates:
  * x        (x) int64 0 1 2
    y        (a, b) object 'a' 'b' 'b' 'a'
Dimensions without coordinates: a, b

These methods may and also be applied to Dataset objects

In [49]: ds2 = da.to_dataset(name='bar')

In [50]: ds2.isel(x=xr.DataArray([0, 1, 2], dims=['points']))
Out[50]: 
<xarray.Dataset>
Dimensions:  (points: 3, y: 4)
Coordinates:
    x        (points) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'
Dimensions without coordinates: points
Data variables:
    bar      (points, y) int64 0 1 2 3 4 5 6 7 8 9 10 11

Tip

If you are lazily loading your data from disk, not every form of vectorized indexing is supported (or if supported, may not be supported efficiently). You may find increased performance by loading your data into memory first, e.g., with load().

Note

Vectorized indexing is a new feature in v0.10. In older versions of xarray, dimensions of indexers are ignored. Dedicated methods for some advanced indexing use cases, isel_points and sel_points are now deprecated. See More advanced indexing for their alternative.

Note

If an indexer is a DataArray(), its coordinates should not conflict with the selected subpart of the target array (except for the explicitly indexed dimensions with .loc/.sel). Otherwise, IndexError will be raised.

Assigning values with indexing

Vectorized indexing can be used to assign values to xarray object.

In [51]: da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'],
   ....:                   coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
   ....: 

In [52]: da
Out[52]: 
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'

In [53]: da[0] = -1  # assignment with broadcasting

In [54]: da
Out[54]: 
<xarray.DataArray (x: 3, y: 4)>
array([[-1, -1, -1, -1],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'

In [55]: ind_x = xr.DataArray([0, 1], dims=['x'])

In [56]: ind_y = xr.DataArray([0, 1], dims=['y'])

In [57]: da[ind_x, ind_y] = -2  # assign -2 to (ix, iy) = (0, 0) and (1, 1)

In [58]: da
Out[58]: 
<xarray.DataArray (x: 3, y: 4)>
array([[-2, -2, -1, -1],
       [-2, -2,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'

In [59]: da[ind_x, ind_y] += 100  # increment is also possible

In [60]: da
Out[60]: 
<xarray.DataArray (x: 3, y: 4)>
array([[98, 98, -1, -1],
       [98, 98,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'

Like numpy.ndarray, value assignment sometimes works differently from what one may expect.

In [61]: da = xr.DataArray([0, 1, 2, 3], dims=['x'])

In [62]: ind = xr.DataArray([0, 0, 0], dims=['x'])

In [63]: da[ind] -= 1

In [64]: da
Out[64]: 
<xarray.DataArray (x: 4)>
array([-1,  1,  2,  3])
Dimensions without coordinates: x

Where the 0th element will be subtracted 1 only once. This is because v[0] = v[0] - 1 is called three times, rather than v[0] = v[0] - 1 - 1 - 1. See Assigning values to indexed arrays for the details.

Note

Dask array does not support value assignment (see Parallel computing with dask for the details).

Note

Coordinates in both the left- and right-hand-side arrays should not conflict with each other. Otherwise, IndexError will be raised.

Warning

Do not try to assign values when using any of the indexing methods isel or sel:

# DO NOT do this
arr.isel(space=0) = 0

Assigning values with the chained indexing using .sel or .isel fails silently.

In [65]: da = xr.DataArray([0, 1, 2, 3], dims=['x'])

# DO NOT do this
In [66]: da.isel(x=[0, 1, 2])[1] = -1

In [67]: da
Out[67]: 
<xarray.DataArray (x: 4)>
array([0, 1, 2, 3])
Dimensions without coordinates: x

More advanced indexing

The use of DataArray() objects as indexers enables very flexible indexing. The following is an example of the pointwise indexing:

In [68]: da = xr.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])

In [69]: da
Out[69]: 
<xarray.DataArray (x: 7, y: 8)>
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55]])
Dimensions without coordinates: x, y

In [70]: da.isel(x=xr.DataArray([0, 1, 6], dims='z'),
   ....:         y=xr.DataArray([0, 1, 0], dims='z'))
   ....: 
Out[70]: 
<xarray.DataArray (z: 3)>
array([ 0,  9, 48])
Dimensions without coordinates: z

where three elements at (ix, iy) = ((0, 0), (1, 1), (6, 0)) are selected and mapped along a new dimension z.

If you want to add a coordinate to the new dimension z, you can supply a DataArray() with a coordinate,

In [71]: da.isel(x=xr.DataArray([0, 1, 6], dims='z',
   ....:                        coords={'z': ['a', 'b', 'c']}),
   ....:         y=xr.DataArray([0, 1, 0], dims='z'))
   ....: 
Out[71]: 
<xarray.DataArray (z: 3)>
array([ 0,  9, 48])
Coordinates:
  * z        (z) <U1 'a' 'b' 'c'

Analogously, label-based pointwise-indexing is also possible by the .sel method:

In [72]: times = xr.DataArray(pd.to_datetime(['2000-01-03', '2000-01-02', '2000-01-01']),
   ....:                      dims='new_time')
   ....: 

In [73]: arr.sel(space=xr.DataArray(['IA', 'IL', 'IN'], dims=['new_time']),
   ....:         time=times)
   ....: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-73-012885a177c1> in <module>()
      1 arr.sel(space=xr.DataArray(['IA', 'IL', 'IN'], dims=['new_time']),
----> 2         time=times)
      3 

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/dataarray.py in sel(self, method, tolerance, drop, **indexers)
    765         """
    766         ds = self._to_temp_dataset().sel(drop=drop, method=method,
--> 767                                          tolerance=tolerance, **indexers)
    768         return self._from_temp_dataset(ds)
    769 

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/dataset.py in sel(self, method, tolerance, drop, **indexers)
   1470         """
   1471         pos_indexers, new_indexes = remap_label_indexers(self, method,
-> 1472                                                          tolerance, **indexers)
   1473         result = self.isel(drop=drop, **pos_indexers)
   1474         return result._replace_indexes(new_indexes)

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/coordinates.py in remap_label_indexers(obj, method, tolerance, **indexers)
    344 
    345     pos_indexers, new_indexes = indexing.remap_label_indexers(
--> 346         obj, v_indexers, method=method, tolerance=tolerance
    347     )
    348     # attach indexer's coordinate to pos_indexers

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
    235     new_indexes = {}
    236 
--> 237     dim_indexers = get_dim_indexers(data_obj, indexers)
    238     for dim, label in iteritems(dim_indexers):
    239         try:

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
    203     if invalid:
    204         raise ValueError("dimensions or multi-index levels %r do not exist"
--> 205                          % invalid)
    206 
    207     level_indexers = defaultdict(dict)

ValueError: dimensions or multi-index levels ['space', 'time'] do not exist

Align and reindex

xarray’s reindex, reindex_like and align impose a DataArray or Dataset onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with NaN.

xarray operations that combine multiple objects generally automatically align their arguments to share the same indexes. However, manual alignment can be useful for greater control and for increased performance.

To reindex a particular dimension, use reindex():

In [74]: arr.reindex(space=['IA', 'CA'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-74-7487217e63b1> in <module>()
----> 1 arr.reindex(space=['IA', 'CA'])

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/dataarray.py in reindex(self, method, tolerance, copy, **indexers)
    880         """
    881         ds = self._to_temp_dataset().reindex(
--> 882             method=method, tolerance=tolerance, copy=copy, **indexers)
    883         return self._from_temp_dataset(ds)
    884 

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/dataset.py in reindex(self, indexers, method, tolerance, copy, **kw_indexers)
   1767         bad_dims = [d for d in indexers if d not in self.dims]
   1768         if bad_dims:
-> 1769             raise ValueError('invalid reindex dimensions: %s' % bad_dims)
   1770 
   1771         variables = alignment.reindex_variables(

ValueError: invalid reindex dimensions: ['space']

The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:

In [75]: foo = arr.rename('foo')

In [76]: baz = (10 * arr[:2, :2]).rename('baz')
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-76-111d5d54c6b0> in <module>()
----> 1 baz = (10 * arr[:2, :2]).rename('baz')

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/dataarray.py in __getitem__(self, key)
    471         else:
    472             # xarray-style array indexing
--> 473             return self.isel(**self._item_key_to_dict(key))
    474 
    475     def __setitem__(self, key, value):

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/dataarray.py in _item_key_to_dict(self, key)
    437             return key
    438         else:
--> 439             key = indexing.expanded_indexer(key, self.ndim)
    440             return dict(zip(self.dims, key))
    441 

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/indexing.py in expanded_indexer(key, ndim)
     40             new_key.append(k)
     41     if len(new_key) > ndim:
---> 42         raise IndexError('too many indices')
     43     new_key.extend((ndim - len(new_key)) * [slice(None)])
     44     return tuple(new_key)

IndexError: too many indices

In [77]: baz
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-77-6eadeac2dade> in <module>()
----> 1 baz

NameError: name 'baz' is not defined

Reindexing foo with baz selects out the first two values along each dimension:

In [78]: foo.reindex_like(baz)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-78-7cbddb96db16> in <module>()
----> 1 foo.reindex_like(baz)

NameError: name 'baz' is not defined

The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:

In [79]: baz.reindex_like(foo)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-79-1948f3ca3545> in <module>()
----> 1 baz.reindex_like(foo)

NameError: name 'baz' is not defined

The align() function lets us perform more flexible database-like 'inner', 'outer', 'left' and 'right' joins:

In [80]: xr.align(foo, baz, join='inner')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-80-9db9be39fbf6> in <module>()
----> 1 xr.align(foo, baz, join='inner')

NameError: name 'baz' is not defined

In [81]: xr.align(foo, baz, join='outer')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-81-37f3bf3c6366> in <module>()
----> 1 xr.align(foo, baz, join='outer')

NameError: name 'baz' is not defined

Both reindex_like and align work interchangeably between DataArray and Dataset objects, and with any number of matching dimension names:

In [82]: ds
Out[82]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Data variables:
    foo      (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...

In [83]: ds.reindex_like(baz)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-83-0d971433d0be> in <module>()
----> 1 ds.reindex_like(baz)

NameError: name 'baz' is not defined

In [84]: other = xr.DataArray(['a', 'b', 'c'], dims='other')

# this is a no-op, because there are no shared dimension names
In [85]: ds.reindex_like(other)
Out[85]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Data variables:
    foo      (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...

Missing coordinate labels

Coordinate labels for each dimension are optional (as of xarray v0.9). Label based indexing with .sel and .loc uses standard positional, integer-based indexing as a fallback for dimensions without a coordinate label:

In [86]: array = xr.DataArray([1, 2, 3], dims='x')

In [87]: array.sel(x=[0, -1])
Out[87]: 
<xarray.DataArray (x: 2)>
array([1, 3])
Dimensions without coordinates: x

Alignment between xarray objects where one or both do not have coordinate labels succeeds only if all dimensions of the same name have the same length. Otherwise, it raises an informative error:

In [88]: xr.align(array, array[:2])
ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {2, 3}

Underlying Indexes

xarray uses the pandas.Index internally to perform indexing operations. If you need to access the underlying indexes, they are available through the indexes attribute.

In [89]: arr
Out[89]: 
<xarray.DataArray (x: 5)>
array([1, 2, 3, 4, 5])
Dimensions without coordinates: x

In [90]: arr.indexes
Out[90]: 

In [91]: arr.indexes['time']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-91-1d2af65d4da4> in <module>()
----> 1 arr.indexes['time']

~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.3/lib/python3.6/site-packages/xarray-0.10.3-py3.6.egg/xarray/core/coordinates.py in __getitem__(self, key)
    302     def __getitem__(self, key):
    303         if key not in self._sizes:
--> 304             raise KeyError(key)
    305         return self._variables[key].to_index()
    306 

KeyError: 'time'

Use get_index() to get an index for a dimension, falling back to a default pandas.RangeIndex if it has no coordinate labels:

In [92]: array
Out[92]: 
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Dimensions without coordinates: x

In [93]: array.get_index('x')
Out[93]: RangeIndex(start=0, stop=3, step=1, name='x')

Copies vs. Views

Whether array indexing returns a view or a copy of the underlying data depends on the nature of the labels.

For positional (integer) indexing, xarray follows the same rules as NumPy:

  • Positional indexing with only integers and slices returns a view.
  • Positional indexing with arrays or lists returns a copy.

The rules for label based indexing are more complex:

  • Label-based indexing with only slices returns a view.
  • Label-based indexing with arrays returns a copy.
  • Label-based indexing with scalars returns a view or a copy, depending upon if the corresponding positional indexer can be represented as an integer or a slice object. The exact rules are determined by pandas.

Whether data is a copy or a view is more predictable in xarray than in pandas, so unlike pandas, xarray does not produce SettingWithCopy warnings. However, you should still avoid assignment with chained indexing.

Multi-level indexing

Just like pandas, advanced indexing on multi-level indexes is possible with loc and sel. You can slice a multi-index by providing multiple indexers, i.e., a tuple of slices, labels, list of labels, or any selector allowed by pandas:

In [94]: midx = pd.MultiIndex.from_product([list('abc'), [0, 1]],
   ....:                                   names=('one', 'two'))
   ....: 

In [95]: mda = xr.DataArray(np.random.rand(6, 3),
   ....:                    [('x', midx), ('y', range(3))])
   ....: 

In [96]: mda
Out[96]: 
<xarray.DataArray (x: 6, y: 3)>
array([[ 0.129441,  0.859879,  0.820388],
       [ 0.352054,  0.228887,  0.776784],
       [ 0.594784,  0.137554,  0.8529  ],
       [ 0.235507,  0.146227,  0.589869],
       [ 0.574012,  0.06127 ,  0.590426],
       [ 0.24535 ,  0.340445,  0.984729]])
Coordinates:
  * x        (x) MultiIndex
  - one      (x) object 'a' 'a' 'b' 'b' 'c' 'c'
  - two      (x) int64 0 1 0 1 0 1
  * y        (y) int64 0 1 2

In [97]: mda.sel(x=(list('ab'), [0]))
Out[97]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.129441,  0.859879,  0.820388],
       [ 0.594784,  0.137554,  0.8529  ]])
Coordinates:
  * x        (x) MultiIndex
  - one      (x) object 'a' 'b'
  - two      (x) int64 0 0
  * y        (y) int64 0 1 2

You can also select multiple elements by providing a list of labels or tuples or a slice of tuples:

In [98]: mda.sel(x=[('a', 0), ('b', 1)])
Out[98]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.129441,  0.859879,  0.820388],
       [ 0.235507,  0.146227,  0.589869]])
Coordinates:
  * x        (x) MultiIndex
  - one      (x) object 'a' 'b'
  - two      (x) int64 0 1
  * y        (y) int64 0 1 2

Additionally, xarray supports dictionaries:

In [99]: mda.sel(x={'one': 'a', 'two': 0})
Out[99]: 
<xarray.DataArray (y: 3)>
array([ 0.129441,  0.859879,  0.820388])
Coordinates:
    x        object ('a', 0)
  * y        (y) int64 0 1 2

For convenience, sel also accepts multi-index levels directly as keyword arguments:

In [100]: mda.sel(one='a', two=0)
Out[100]: 
<xarray.DataArray (y: 3)>
array([ 0.129441,  0.859879,  0.820388])
Coordinates:
    x        object ('a', 0)
  * y        (y) int64 0 1 2

Note that using sel it is not possible to mix a dimension indexer with level indexers for that dimension (e.g., mda.sel(x={'one': 'a'}, two=0) will raise a ValueError).

Like pandas, xarray handles partial selection on multi-index (level drop). As shown below, it also renames the dimension / coordinate when the multi-index is reduced to a single index.

In [101]: mda.loc[{'one': 'a'}, ...]
Out[101]: 
<xarray.DataArray (two: 2, y: 3)>
array([[ 0.129441,  0.859879,  0.820388],
       [ 0.352054,  0.228887,  0.776784]])
Coordinates:
  * two      (two) int64 0 1
  * y        (y) int64 0 1 2

Unlike pandas, xarray does not guess whether you provide index levels or dimensions when using loc in some ambiguous cases. For example, for mda.loc[{'one': 'a', 'two': 0}] and mda.loc['a', 0] xarray always interprets (‘one’, ‘two’) and (‘a’, 0) as the names and labels of the 1st and 2nd dimension, respectively. You must specify all dimensions or use the ellipsis in the loc specifier, e.g. in the example above, mda.loc[{'one': 'a', 'two': 0}, :] or mda.loc[('a', 0), ...].

Indexing rules

Here we describe the full rules xarray uses for vectorized indexing. Note that this is for the purposes of explanation: for the sake of efficiency and to support various backends, the actual implementation is different.

  1. (Only for label based indexing.) Look up positional indexes along each dimension from the corresponding pandas.Index.
  2. A full slice object : is inserted for each dimension without an indexer.
  3. slice objects are converted into arrays, given by np.arange(*slice.indices(...)).
  4. Assume dimension names for array indexers without dimensions, such as np.ndarray and list, from the dimensions to be indexed along. For example, v.isel(x=[0, 1]) is understood as v.isel(x=xr.DataArray([0, 1], dims=['x'])).
  5. For each variable in a Dataset or DataArray (the array and its coordinates):
    1. Broadcast all relevant indexers based on their dimension names (see Broadcasting by dimension name for full details).
    2. Index the underling array by the broadcast indexers, using NumPy’s advanced indexing rules.
  6. If any indexer DataArray has coordinates and no coordinate with the same name exists, attach them to the indexed object.

Note

Only 1-dimensional boolean arrays can be used as indexers.

Computation

The labels associated with DataArray and Dataset objects enables some powerful shortcuts for computation, notably including aggregation and broadcasting by dimension names.

Basic array math

Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values:

In [1]: arr = xr.DataArray(np.random.RandomState(0).randn(2, 3),
   ...:                    [('x', ['a', 'b']), ('y', [10, 20, 30])])
   ...: 

In [2]: arr - 3
Out[2]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1.235948, -2.599843, -2.021262],
       [-0.759107, -1.132442, -3.977278]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

In [3]: abs(arr)
Out[3]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 1.764052,  0.400157,  0.978738],
       [ 2.240893,  1.867558,  0.977278]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:

In [4]: np.sin(arr)
Out[4]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.981384,  0.389563,  0.829794],
       [ 0.783762,  0.956288, -0.828978]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

Use where() to conditionally switch between values:

In [5]: xr.where(arr > 0, 'positive', 'negative')
Out[5]: 
<xarray.DataArray (x: 2, y: 3)>
array([['positive', 'positive', 'positive'],
       ['positive', 'positive', 'negative']],
      dtype='<U8')
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

Data arrays also implement many numpy.ndarray methods:

In [6]: arr.round(2)
Out[6]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 1.76,  0.4 ,  0.98],
       [ 2.24,  1.87, -0.98]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

In [7]: arr.T
Out[7]: 
<xarray.DataArray (y: 3, x: 2)>
array([[ 1.764052,  2.240893],
       [ 0.400157,  1.867558],
       [ 0.978738, -0.977278]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

Missing values

xarray objects borrow the isnull(), notnull(), count(), dropna(), fillna(), ffill(), and bfill() methods for working with missing data from pandas:

In [8]: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'])

In [9]: x.isnull()
Out[9]: 
<xarray.DataArray (x: 5)>
array([False, False,  True,  True, False], dtype=bool)
Dimensions without coordinates: x

In [10]: x.notnull()
Out[10]: 
<xarray.DataArray (x: 5)>
array([ True,  True, False, False,  True], dtype=bool)
Dimensions without coordinates: x

In [11]: x.count()
Out[11]: 
<xarray.DataArray ()>
array(3)

In [12]: x.dropna(dim='x')
Out[12]: 
<xarray.DataArray (x: 3)>
array([ 0.,  1.,  2.])
Dimensions without coordinates: x

In [13]: x.fillna(-1)
Out[13]: 
<xarray.DataArray (x: 5)>
array([ 0.,  1., -1., -1.,  2.])
Dimensions without coordinates: x

In [14]: x.ffill('x')
Out[14]: 
<xarray.DataArray (x: 5)>
array([ 0.,  1.,  1.,  1.,  2.])
Dimensions without coordinates: x

In [15]: x.bfill('x')
Out[15]: 
<xarray.DataArray (x: 5)>
array([ 0.,  1.,  2.,  2.,  2.])
Dimensions without coordinates: x

Like pandas, xarray uses the float value np.nan (not-a-number) to represent missing values.

xarray objects also have an interpolate_na() method for filling missing values via 1D interpolation.

In [16]: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'],
   ....:                  coords={'xx': xr.Variable('x', [0, 1, 1.1, 1.9, 3])})
   ....: 

In [17]: x.interpolate_na(dim='x', method='linear', use_coordinate='xx')
Out[17]: 
<xarray.DataArray (x: 5)>
array([ 0.  ,  1.  ,  1.05,  1.45,  2.  ])
Coordinates:
    xx       (x) float64 0.0 1.0 1.1 1.9 3.0
Dimensions without coordinates: x

Note that xarray slightly diverges from the pandas interpolate syntax by providing the use_coordinate keyword which facilitates a clear specification of which values to use as the index in the interpolation.

Aggregation

Aggregation methods have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):

In [18]: arr.sum(dim='x')
Out[18]: 
<xarray.DataArray (y: 3)>
array([  4.004946e+00,   2.267715e+00,   1.460104e-03])
Coordinates:
  * y        (y) int64 10 20 30

In [19]: arr.std(['x', 'y'])
Out[19]: 
<xarray.DataArray ()>
array(1.0903834448772864)

In [20]: arr.min()
Out[20]: 
<xarray.DataArray ()>
array(-0.977277879876411)

If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:

In [21]: arr.get_axis_num('y')
Out[21]: 1

These operations automatically skip missing values, like in pandas:

In [22]: xr.DataArray([1, 2, np.nan, 3]).mean()
Out[22]: 
<xarray.DataArray ()>
array(2.0)

If desired, you can disable this behavior by invoking the aggregation method with skipna=False.

Rolling window operations

DataArray objects include a rolling() method. This method supports rolling window aggregation:

In [23]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
   ....:                    dims=('x', 'y'))
   ....: 

In [24]: arr
Out[24]: 
<xarray.DataArray (x: 3, y: 5)>
array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
       [ 2.5,  3. ,  3.5,  4. ,  4.5],
       [ 5. ,  5.5,  6. ,  6.5,  7. ]])
Dimensions without coordinates: x, y

rolling() is applied along one dimension using the name of the dimension as a key (e.g. y) and the window size as the value (e.g. 3). We get back a Rolling object:

In [25]: arr.rolling(y=3)
Out[25]: DataArrayRolling [window->3,center->False,dim->y]

The label position and minimum number of periods in the rolling window are controlled by the center and min_periods arguments:

In [26]: arr.rolling(y=3, min_periods=2, center=True)
Out[26]: DataArrayRolling [window->3,min_periods->2,center->True,dim->y]

Aggregation and summary methods can be applied directly to the Rolling object:

In [27]: r = arr.rolling(y=3)

In [28]: r.mean()
Out[28]: 
<xarray.DataArray (x: 3, y: 5)>
array([[ nan,  nan,  0.5,  1. ,  1.5],
       [ nan,  nan,  3. ,  3.5,  4. ],
       [ nan,  nan,  5.5,  6. ,  6.5]])
Dimensions without coordinates: x, y

In [29]: r.reduce(np.std)
Out[29]: 
<xarray.DataArray (x: 3, y: 5)>
array([[      nan,       nan,  0.408248,  0.408248,  0.408248],
       [      nan,       nan,  0.408248,  0.408248,  0.408248],
       [      nan,       nan,  0.408248,  0.408248,  0.408248]])
Dimensions without coordinates: x, y

Note that rolling window aggregations are faster when bottleneck is installed.

We can also manually iterate through Rolling objects:

In [30]: for label, arr_window in r:
   # arr_window is a view of x

Finally, the rolling object has a construct method which returns a view of the original DataArray with the windowed dimension in the last position. You can use this for more advanced rolling operations such as strided rolling, windowed rolling, convolution, short-time FFT etc.

# rolling with 2-point stride
In [31]: rolling_da = r.construct('window_dim', stride=2)

In [32]: rolling_da
Out[32]: 
<xarray.DataArray (x: 3, y: 3, window_dim: 3)>
array([[[ nan,  nan,  0. ],
        [ 0. ,  0.5,  1. ],
        [ 1. ,  1.5,  2. ]],

       [[ nan,  nan,  2.5],
        [ 2.5,  3. ,  3.5],
        [ 3.5,  4. ,  4.5]],

       [[ nan,  nan,  5. ],
        [ 5. ,  5.5,  6. ],
        [ 6. ,  6.5,  7. ]]])
Dimensions without coordinates: x, y, window_dim

In [33]: rolling_da.mean('window_dim', skipna=False)
Out[33]: 
<xarray.DataArray (x: 3, y: 3)>
array([[ nan,  0.5,  1.5],
       [ nan,  3. ,  4. ],
       [ nan,  5.5,  6.5]])
Dimensions without coordinates: x, y

Because the DataArray given by r.construct('window_dim') is a view of the original array, it is memory efficient. You can also use construct to compute a weighted rolling mean:

In [34]: weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])

In [35]: arr.rolling(y=3).construct('window').dot(weight)
Out[35]: 
<xarray.DataArray (x: 3, y: 5)>
array([[ nan,  nan,  0.5,  1. ,  1.5],
       [ nan,  nan,  3. ,  3.5,  4. ],
       [ nan,  nan,  5.5,  6. ,  6.5]])
Dimensions without coordinates: x, y

Note

numpy’s Nan-aggregation functions such as nansum copy the original array. In xarray, we internally use these functions in our aggregation methods (such as .sum()) if skipna argument is not specified or set to True. This means rolling_da.mean('window_dim') is memory inefficient. To avoid this, use skipna=False as the above example.

Broadcasting by dimension name

DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xarray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.

This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:

In [36]: a = xr.DataArray([1, 2], [('x', ['a', 'b'])])

In [37]: a
Out[37]: 
<xarray.DataArray (x: 2)>
array([1, 2])
Coordinates:
  * x        (x) <U1 'a' 'b'

In [38]: b = xr.DataArray([-1, -2, -3], [('y', [10, 20, 30])])

In [39]: b
Out[39]: 
<xarray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
  * y        (y) int64 10 20 30

With xarray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:

In [40]: a * b
Out[40]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
       [-2, -4, -6]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

Moreover, dimensions are always reordered to the order in which they first appeared:

In [41]: c = xr.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])

In [42]: c
Out[42]: 
<xarray.DataArray (y: 3, x: 2)>
array([[0, 1],
       [2, 3],
       [4, 5]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) <U1 'a' 'b'

In [43]: a + c
Out[43]: 
<xarray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
       [3, 5, 7]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

This means, for example, that you always subtract an array from its transpose:

In [44]: c - c.T
Out[44]: 
<xarray.DataArray (y: 3, x: 2)>
array([[0, 0],
       [0, 0],
       [0, 0]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) <U1 'a' 'b'

You can explicitly broadcast xaray data structures by using the broadcast() function:

In [45]: a2, b2 = xr.broadcast(a, b)

In [46]: a2
Out[46]: 
<xarray.DataArray (x: 2, y: 3)>
array([[1, 1, 1],
       [2, 2, 2]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

In [47]: b2
Out[47]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
       [-1, -2, -3]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) <U1 'a' 'b'

Automatic alignment

xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations.

Similarly to pandas, this alignment is automatic for arithmetic on binary operations. The default result of a binary operation is by the intersection (not the union) of coordinate labels:

In [48]: arr = xr.DataArray(np.arange(3), [('x', range(3))])

In [49]: arr + arr[:-1]
Out[49]: 
<xarray.DataArray (x: 2)>
array([0, 2])
Coordinates:
  * x        (x) int64 0 1

If coordinate values for a dimension are missing on either argument, all matching dimensions must have the same size:

In [50]: In [1]: arr + xr.DataArray([1, 2], dims='x')

In [51]: ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension size(s) {2} than the size of the aligned dimension labels: 3
  File "<ipython-input-51-f15d32fb2f97>", line 1
    ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension size(s) {2} than the size of the aligned dimension labels: 3
                                ^
SyntaxError: invalid syntax

However, one can explicitly change this default automatic alignment type (“inner”) via set_options() in context manager:

In [52]: with xr.set_options(arithmetic_join="outer"):
   ....:     arr + arr[:1]
   ....: 

In [53]: arr + arr[:1]
Out[53]: 
<xarray.DataArray (x: 1)>
array([0])
Coordinates:
  * x        (x) int64 0

Before loops or performance critical code, it’s a good idea to align arrays explicitly (e.g., by putting them in the same Dataset or using align()) to avoid the overhead of repeated alignment with each operation. See Align and reindex for more details.

Note

There is no automatic alignment between arguments when performing in-place arithmetic operations such as +=. You will need to use manual alignment. This ensures in-place arithmetic never needs to modify data types.

Coordinates

Although index coordinates are aligned, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalar coordinates:

In [54]: arr[0]
Out[54]: 
<xarray.DataArray ()>
array(0)
Coordinates:
    x        int64 0

In [55]: arr[1]
Out[55]: 
<xarray.DataArray ()>
array(1)
Coordinates:
    x        int64 1

# notice that the scalar coordinate 'x' is silently dropped
In [56]: arr[1] - arr[0]
Out[56]: 
<xarray.DataArray ()>
array(1)

Still, xarray will persist other coordinates in arithmetic, as long as there are no conflicting values:

# only one argument has the 'x' coordinate
In [57]: arr[0] + 1
Out[57]: 
<xarray.DataArray ()>
array(1)
Coordinates:
    x        int64 0

# both arguments have the same 'x' coordinate
In [58]: arr[0] - arr[0]
Out[58]: 
<xarray.DataArray ()>
array(0)
Coordinates:
    x        int64 0

Math with datasets

Datasets support arithmetic operations by automatically looping over all data variables:

In [59]: ds = xr.Dataset({'x_and_y': (('x', 'y'), np.random.randn(3, 5)),
   ....:                  'x_only': ('x', np.random.randn(3))},
   ....:                  coords=arr.coords)
   ....: 

In [60]: ds > 0
Out[60]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * x        (x) int64 0 1 2
Dimensions without coordinates: y
Data variables:
    x_and_y  (x, y) bool True False False False True False True False False ...
    x_only   (x) bool True False True

Datasets support most of the same methods found on data arrays:

In [61]: ds.mean(dim='x')
Out[61]: 
<xarray.Dataset>
Dimensions:  (y: 5)
Dimensions without coordinates: y
Data variables:
    x_and_y  (y) float64 -0.06634 0.3027 -0.6106 -0.9014 -0.644
    x_only   float64 0.138

In [62]: abs(ds)
Out[62]: 
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
Data variables:
    x_and_y  (x, y) float64 0.4691 0.2829 1.509 1.136 1.212 0.1732 0.1192 ...
    x_only   (x) float64 0.2719 0.425 0.567

Datasets also support NumPy ufuncs (requires NumPy v1.13 or newer), or alternatively you can use apply() to apply a function to each variable in a dataset:

In [63]: np.sin(ds)
Out[63]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * x        (x) int64 0 1 2
Dimensions without coordinates: y
Data variables:
    x_and_y  (x, y) float64 0.4521 -0.2791 -0.9981 -0.9068 0.9364 -0.1723 ...
    x_only   (x) float64 0.2685 -0.4123 0.5371

In [64]: ds.apply(np.sin)
Out[64]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * x        (x) int64 0 1 2
Dimensions without coordinates: y
Data variables:
    x_and_y  (x, y) float64 0.4521 -0.2791 -0.9981 -0.9068 0.9364 -0.1723 ...
    x_only   (x) float64 0.2685 -0.4123 0.5371

Datasets also use looping over variables for broadcasting in binary arithmetic. You can do arithmetic between any DataArray and a dataset:

In [65]: ds + arr
Out[65]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * x        (x) int64 0 1 2
Dimensions without coordinates: y
Data variables:
    x_and_y  (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 0.8268 1.119 ...
    x_only   (x) float64 0.2719 0.575 2.567

Arithmetic between two datasets matches data variables of the same name:

In [66]: ds2 = xr.Dataset({'x_and_y': 0, 'x_only': 100})

In [67]: ds - ds2
Out[67]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * x        (x) int64 0 1 2
Dimensions without coordinates: y
Data variables:
    x_and_y  (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 ...
    x_only   (x) float64 -99.73 -100.4 -99.43

Similarly to index based alignment, the result has the intersection of all matching data variables.

Wrapping custom computation

It doesn’t always make sense to do computation directly with xarray objects:

  • In the inner loop of performance limited code, using xarray can add considerable overhead compared to using NumPy or native Python types. This is particularly true when working with scalars or small arrays (less than ~1e6 elements). Keeping track of labels and ensuring their consistency adds overhead, and xarray’s core itself is not especially fast, because it’s written in Python rather than a compiled language like C. Also, xarray’s high level label-based APIs removes low-level control over how operations are implemented.
  • Even if speed doesn’t matter, it can be important to wrap existing code, or to support alternative interfaces that don’t use xarray objects.

For these reasons, it is often well-advised to write low-level routines that work with NumPy arrays, and to wrap these routines to work with xarray objects. However, adding support for labels on both Dataset and DataArray can be a bit of a chore.

To make this easier, xarray supplies the apply_ufunc() helper function, designed for wrapping functions that support broadcasting and vectorization on unlabeled arrays in the style of a NumPy universal function (“ufunc” for short). apply_ufunc takes care of everything needed for an idiomatic xarray wrapper, including alignment, broadcasting, looping over Dataset variables (if needed), and merging of coordinates. In fact, many internal xarray functions/methods are written using apply_ufunc.

Simple functions that act independently on each value should work without any additional arguments:

In [68]: squared_error = lambda x, y: (x - y) ** 2

In [69]: arr1 = xr.DataArray([0, 1, 2, 3], dims='x')

In [70]: xr.apply_ufunc(squared_error, arr1, 1)
Out[70]: 
<xarray.DataArray (x: 4)>
array([1, 0, 1, 4])
Dimensions without coordinates: x

For using more complex operations that consider some array values collectively, it’s important to understand the idea of “core dimensions” from NumPy’s generalized ufuncs. Core dimensions are defined as dimensions that should not be broadcast over. Usually, they correspond to the fundamental dimensions over which an operation is defined, e.g., the summed axis in np.sum. A good clue that core dimensions are needed is the presence of an axis argument on the corresponding NumPy function.

With apply_ufunc, core dimensions are recognized by name, and then moved to the last dimension of any input arguments before applying the given function. This means that for functions that accept an axis argument, you usually need to set axis=-1. As an example, here is how we would wrap numpy.linalg.norm() to calculate the vector norm:

def vector_norm(x, dim, ord=None):
    return xr.apply_ufunc(np.linalg.norm, x,
                          input_core_dims=[[dim]],
                          kwargs={'ord': ord, 'axis': -1})
In [71]: vector_norm(arr1, dim='x')
Out[71]: 
<xarray.DataArray ()>
array(3.7416573867739413)

Because apply_ufunc follows a standard convention for ufuncs, it plays nicely with tools for building vectorized functions, like numpy.broadcast_arrays() and numpy.vectorize(). For high performance needs, consider using Numba’s vectorize and guvectorize.

In addition to wrapping functions, apply_ufunc can automatically parallelize many functions when using dask by setting dask='parallelized'. See Automatic parallelization for details.

apply_ufunc() also supports some advanced options for controlling alignment of variables and the form of the result. See the docstring for full details and more examples.

GroupBy: split-apply-combine

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

  • Split your data into multiple independent groups.
  • Apply some function to each group.
  • Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable has recently been implemented. Note that for one-dimensional data, it is usually faster to rely on pandas’ implementation of the same pipeline.

Split

Let’s create a simple example dataset:

In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
   ...:                 coords={'x': [10, 20, 30, 40],
   ...:                         'letters': ('x', list('abba'))})
   ...: 

In [2]: arr = ds['foo']

In [3]: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) <U1 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a GroupBy object:

In [4]: ds.groupby('letters')
Out[4]: <xarray.core.groupby.DatasetGroupBy at 0x7f10f7c5da58>

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}

You can also iterate over groups in (label, group) pairs:

In [6]: list(ds.groupby('letters'))
Out[6]: 
[('a', <xarray.Dataset>
  Dimensions:  (x: 2, y: 3)
  Coordinates:
    * x        (x) int64 10 40
      letters  (x) <U1 'a' 'a'
  Dimensions without coordinates: y
  Data variables:
      foo      (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
 ('b', <xarray.Dataset>
  Dimensions:  (x: 2, y: 3)
  Coordinates:
    * x        (x) int64 20 30
      letters  (x) <U1 'b' 'b'
  Dimensions without coordinates: y
  Data variables:
      foo      (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]

Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.

Binning

Sometimes you don’t want to use all the unique values to determine the groups but instead want to “bin” the data into coarser groups. You could always create a customized coordinate, but xarray facilitates this via the groupby_bins() method.

In [7]: x_bins = [0,25,50]

In [8]: ds.groupby_bins('x', x_bins).groups
Out[8]: 
{Interval(0, 25, closed='right'): [0, 1],
 Interval(25, 50, closed='right'): [2, 3]}

The binning is implemented via pandas.cut, whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose float labels which identify the bin centers:

In [9]: x_bin_labels = [12.5,37.5]

In [10]: ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups
Out[10]: {12.5: [0, 1], 37.5: [2, 3]}

Apply

To apply a function to each group, you can use the flexible apply() method. The resulting objects are automatically concatenated back together along the group axis:

In [11]: def standardize(x):
   ....:     return (x - x.mean()) / x.std()
   ....: 

In [12]: arr.groupby('letters').apply(standardize)
Out[12]: 
<xarray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.229778,  1.93741 , -0.726247],
       [ 1.419796, -0.460192, -0.606579],
       [-0.190642,  1.21398 , -1.376362],
       [ 0.339417, -0.301806, -0.018995]])
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) <U1 'a' 'b' 'b' 'a'
Dimensions without coordinates: y

GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:

In [13]: arr.groupby('letters').mean(dim='x')
Out[13]: 
<xarray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.334998,  0.669865,  0.354236],
       [ 0.674306,  0.608502,  0.229662]])
Coordinates:
  * letters  (letters) object 'a' 'b'
Dimensions without coordinates: y

Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:

In [14]: ds.groupby('x').std()
Out[14]: 
<xarray.Dataset>
Dimensions:  (x: 4)
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) <U1 'a' 'b' 'b' 'a'
Data variables:
    foo      (x) float64 0.3684 0.2554 0.2931 0.06957

First and last

There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:

In [15]: ds.groupby('letters').first()
Out[15]: 
<xarray.Dataset>
Dimensions:  (letters: 2, y: 3)
Coordinates:
  * letters  (letters) object 'a' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362

By default, they skip missing values (control this with skipna).

Grouped arithmetic

GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:

In [16]: alt = arr.groupby('letters').mean()

In [17]: alt
Out[17]: 
<xarray.DataArray 'foo' (letters: 2)>
array([ 0.453033,  0.504157])
Coordinates:
  * letters  (letters) object 'a' 'b'

In [18]: ds.groupby('letters') - alt
Out[18]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) <U1 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 ...

This last line is roughly equivalent to the following:

results = []
for label, group in ds.groupby('letters'):
    results.append(group - alt.sel(x=label))
xr.concat(results, dim='x')

Squeezing

When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:

In [19]: next(iter(arr.groupby('x')))
Out[19]: 
(10, <xarray.DataArray 'foo' (y: 3)>
 array([ 0.12697 ,  0.966718,  0.260476])
 Coordinates:
     x        int64 10
     letters  <U1 'a'
 Dimensions without coordinates: y)
In [20]: next(iter(arr.groupby('x', squeeze=False)))
Out[20]: 
(10, <xarray.DataArray 'foo' (x: 1, y: 3)>
 array([[ 0.12697 ,  0.966718,  0.260476]])
 Coordinates:
   * x        (x) int64 10
     letters  (x) <U1 'a'
 Dimensions without coordinates: y)

Although xarray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.

Multidimensional Grouping

Many datasets have a multidimensional coordinate variable (e.g. longitude) which is different from the logical grid dimensions (e.g. nx, ny). Such variables are valid under the CF conventions. Xarray supports groupby operations over multidimensional coordinate variables:

In [21]: da = xr.DataArray([[0,1],[2,3]],
   ....:     coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
   ....:             'lat': (['ny','nx'], [[10,10],[20,20]] ),},
   ....:     dims=['ny','nx'])
   ....: 

In [22]: da
Out[22]: 
<xarray.DataArray (ny: 2, nx: 2)>
array([[0, 1],
       [2, 3]])
Coordinates:
    lon      (ny, nx) int64 30 40 40 50
    lat      (ny, nx) int64 10 10 20 20
Dimensions without coordinates: ny, nx

In [23]: da.groupby('lon').sum()
Out[23]: 
<xarray.DataArray (lon: 3)>
array([0, 3, 3])
Coordinates:
  * lon      (lon) int64 30 40 50

In [24]: da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)
Out[24]: 
<xarray.DataArray (ny: 2, nx: 2)>
array([[ 0. , -0.5],
       [ 0.5,  0. ]])
Coordinates:
    lon      (ny, nx) int64 30 40 40 50
    lat      (ny, nx) int64 10 10 20 20
Dimensions without coordinates: ny, nx

Because multidimensional groups have the ability to generate a very large number of bins, coarse-binning via groupby_bins() may be desirable:

In [25]: da.groupby_bins('lon', [0,45,50]).sum()
Out[25]: 
<xarray.DataArray (lon_bins: 2)>
array([3, 3])
Coordinates:
  * lon_bins  (lon_bins) object (0, 45] (45, 50]

Reshaping and reorganizing data

These methods allow you to reorganize

Reordering dimensions

To reorder dimensions on a DataArray or across all variables on a Dataset, use transpose() or the .T property:

In [1]: ds = xr.Dataset({'foo': (('x', 'y', 'z'), [[[42]]]), 'bar': (('y', 'z'), [[24]])})

In [2]: ds.transpose('y', 'z', 'x')
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (y, z, x) int64 42
    bar      (y, z) int64 24

In [3]: ds.T
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (z, y, x) int64 42
    bar      (z, y) int64 24

Expand and squeeze dimensions

To expand a DataArray or all variables on a Dataset along a new dimension, use expand_dims()

In [4]: expanded  = ds.expand_dims('w')

In [5]: expanded
Out[5]: 
<xarray.Dataset>
Dimensions:  (w: 1, x: 1, y: 1, z: 1)
Dimensions without coordinates: w, x, y, z
Data variables:
    foo      (w, x, y, z) int64 42
    bar      (w, y, z) int64 24

This method attaches a new dimension with size 1 to all data variables.

To remove such a size-1 dimension from the DataArray or Dataset, use squeeze()

In [6]: expanded.squeeze('w')
Out[6]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (x, y, z) int64 42
    bar      (y, z) int64 24

Converting between datasets and arrays

To convert from a Dataset to a DataArray, use to_array():

In [7]: arr = ds.to_array()

In [8]: arr
Out[8]: 
<xarray.DataArray (variable: 2, x: 1, y: 1, z: 1)>
array([[[[42]]],


       [[[24]]]])
Coordinates:
  * variable  (variable) <U3 'foo' 'bar'
Dimensions without coordinates: x, y, z

This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.

To convert back from a DataArray to a Dataset, use to_dataset():

In [9]: arr.to_dataset(dim='variable')
Out[9]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (x, y, z) int64 42
    bar      (x, y, z) int64 24

The broadcasting behavior of to_array means that the resulting array includes the union of data variable dimensions:

In [10]: ds2 = xr.Dataset({'a': 0, 'b': ('x', [3, 4, 5])})

# the input dataset has 4 elements
In [11]: ds2
Out[11]: 
<xarray.Dataset>
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        int64 0
    b        (x) int64 3 4 5

# the resulting array has 6 elements
In [12]: ds2.to_array()
Out[12]: 
<xarray.DataArray (variable: 2, x: 3)>
array([[0, 0, 0],
       [3, 4, 5]])
Coordinates:
  * variable  (variable) <U1 'a' 'b'
Dimensions without coordinates: x

Otherwise, the result could not be represented as an orthogonal array.

If you use to_dataset without supplying the dim argument, the DataArray will be converted into a Dataset of one variable:

In [13]: arr.to_dataset(name='combined')
Out[13]: 
<xarray.Dataset>
Dimensions:   (variable: 2, x: 1, y: 1, z: 1)
Coordinates:
  * variable  (variable) <U3 'foo' 'bar'
Dimensions without coordinates: x, y, z
Data variables:
    combined  (variable, x, y, z) int64 42 24

Stack and unstack

As part of xarray’s nascent support for pandas.MultiIndex, we have implemented stack() and unstack() method, for combining or splitting dimensions:

In [14]: array = xr.DataArray(np.random.randn(2, 3),
   ....:                      coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
   ....: 

In [15]: stacked = array.stack(z=('x', 'y'))

In [16]: stacked
Out[16]: 
<xarray.DataArray (z: 6)>
array([ 0.469112, -0.282863, -1.509059, -1.135632,  1.212112, -0.173215])
Coordinates:
  * z        (z) MultiIndex
  - x        (z) object 'a' 'a' 'a' 'b' 'b' 'b'
  - y        (z) int64 0 1 2 0 1 2

In [17]: stacked.unstack('z')
Out[17]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469112, -0.282863, -1.509059],
       [-1.135632,  1.212112, -0.173215]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2

These methods are modeled on the pandas.DataFrame methods of the same name, although in xarray they always create new dimensions rather than adding to the existing index or columns.

Like DataFrame.unstack, xarray’s unstack always succeeds, even if the multi-index being unstacked does not contain all possible levels. Missing levels are filled in with NaN in the resulting object:

In [18]: stacked2 = stacked[::2]

In [19]: stacked2
Out[19]: 
<xarray.DataArray (z: 3)>
array([ 0.469112, -1.509059,  1.212112])
Coordinates:
  * z        (z) MultiIndex
  - x        (z) object 'a' 'a' 'b'
  - y        (z) int64 0 2 1

In [20]: stacked2.unstack('z')
Out[20]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469112,       nan, -1.509059],
       [      nan,  1.212112,       nan]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2

However, xarray’s stack has an important difference from pandas: unlike pandas, it does not automatically drop missing values. Compare:

In [21]: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=['x', 'y'])

In [22]: array.stack(z=('x', 'y'))
Out[22]: 
<xarray.DataArray (z: 4)>
array([ nan,   1.,   2.,   3.])
Coordinates:
  * z        (z) MultiIndex
  - x        (z) int64 0 0 1 1
  - y        (z) int64 0 1 0 1

In [23]: array.to_pandas().stack()
Out[23]: 
x  y
0  1    1.0
1  0    2.0
   1    3.0
dtype: float64

We departed from pandas’s behavior here because predictable shapes for new array dimensions is necessary for Parallel computing with dask.

Set and reset index

Complementary to stack / unstack, xarray’s .set_index, .reset_index and .reorder_levels allow easy manipulation of DataArray or Dataset multi-indexes without modifying the data and its dimensions.

You can create a multi-index from several 1-dimensional variables and/or coordinates using set_index():

In [24]: da = xr.DataArray(np.random.rand(4),
   ....:                   coords={'band': ('x', ['a', 'a', 'b', 'b']),
   ....:                           'wavenumber': ('x', np.linspace(200, 400, 4))},
   ....:                   dims='x')
   ....: 

In [25]: da
Out[25]: 
<xarray.DataArray (x: 4)>
array([ 0.123102,  0.543026,  0.373012,  0.447997])
Coordinates:
    band        (x) <U1 'a' 'a' 'b' 'b'
    wavenumber  (x) float64 200.0 266.7 333.3 400.0
Dimensions without coordinates: x

In [26]: mda = da.set_index(x=['band', 'wavenumber'])

In [27]: mda
Out[27]: 
<xarray.DataArray (x: 4)>
array([ 0.123102,  0.543026,  0.373012,  0.447997])
Coordinates:
  * x           (x) MultiIndex
  - band        (x) object 'a' 'a' 'b' 'b'
  - wavenumber  (x) float64 200.0 266.7 333.3 400.0

These coordinates can now be used for indexing, e.g.,

In [28]: mda.sel(band='a')
Out[28]: 
<xarray.DataArray (wavenumber: 2)>
array([ 0.123102,  0.543026])
Coordinates:
  * wavenumber  (wavenumber) float64 200.0 266.7

Conversely, you can use reset_index() to extract multi-index levels as coordinates (this is mainly useful for serialization):

In [29]: mda.reset_index('x')
Out[29]: 
<xarray.DataArray (x: 4)>
array([ 0.123102,  0.543026,  0.373012,  0.447997])
Coordinates:
    band        (x) object 'a' 'a' 'b' 'b'
    wavenumber  (x) float64 200.0 266.7 333.3 400.0
Dimensions without coordinates: x

reorder_levels() allows changing the order of multi-index levels:

In [30]: mda.reorder_levels(x=['wavenumber', 'band'])
Out[30]: 
<xarray.DataArray (x: 4)>
array([ 0.123102,  0.543026,  0.373012,  0.447997])
Coordinates:
  * x           (x) MultiIndex
  - wavenumber  (x) float64 200.0 266.7 333.3 400.0
  - band        (x) object 'a' 'a' 'b' 'b'

As of xarray v0.9 coordinate labels for each dimension are optional. You can also use .set_index / .reset_index to add / remove labels for one or several dimensions:

In [31]: array = xr.DataArray([1, 2, 3], dims='x')

In [32]: array
Out[32]: 
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Dimensions without coordinates: x

In [33]: array['c'] = ('x', ['a', 'b', 'c'])

In [34]: array.set_index(x='c')
Out[34]: 
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Coordinates:
  * x        (x) object 'a' 'b' 'c'

In [35]: array.set_index(x='c', inplace=True)

In [36]: array.reset_index('x', drop=True)
Out[36]: 
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Dimensions without coordinates: x

Shift and roll

To adjust coordinate labels, you can use the shift() and roll() methods:

In [37]: array = xr.DataArray([1, 2, 3, 4], dims='x')

In [38]: array.shift(x=2)
Out[38]: 
<xarray.DataArray (x: 4)>
array([ nan,  nan,   1.,   2.])
Dimensions without coordinates: x

In [39]: array.roll(x=2)
Out[39]: 
<xarray.DataArray (x: 4)>
array([3, 4, 1, 2])
Dimensions without coordinates: x

Sort

One may sort a DataArray/Dataset via sortby() and sortby(). The input can be an individual or list of 1D DataArray objects:

In [40]: ds = xr.Dataset({'A': (('x', 'y'), [[1, 2], [3, 4]]),
   ....:                  'B': (('x', 'y'), [[5, 6], [7, 8]])},
   ....:                 coords={'x': ['b', 'a'], 'y': [1, 0]})
   ....: 

In [41]: dax = xr.DataArray([100, 99], [('x', [0, 1])])

In [42]: day = xr.DataArray([90, 80], [('y', [0, 1])])

In [43]: ds.sortby([day, dax])
Out[43]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) object 'b' 'a'
  * y        (y) int64 1 0
Data variables:
    A        (x, y) int64 1 2 3 4
    B        (x, y) int64 5 6 7 8

As a shortcut, you can refer to existing coordinates by name:

In [44]: ds.sortby('x')
Out[44]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 1 0
Data variables:
    A        (x, y) int64 3 4 1 2
    B        (x, y) int64 7 8 5 6

In [45]: ds.sortby(['y', 'x'])
Out[45]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 0 1
Data variables:
    A        (x, y) int64 4 3 2 1
    B        (x, y) int64 8 7 6 5

In [46]: ds.sortby(['y', 'x'], ascending=False)
Out[46]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) <U1 'b' 'a'
  * y        (y) int64 1 0
Data variables:
    A        (x, y) int64 1 2 3 4
    B        (x, y) int64 5 6 7 8

Combining data

  • For combining datasets or data arrays along a dimension, see concatenate.
  • For combining datasets with different variables, see merge.
  • For combining datasets or data arrays with different indexes or missing values, see combine.

Concatenate

To combine arrays along existing or new dimension into a larger array, you can use concat(). concat takes an iterable of DataArray or Dataset objects, as well as a dimension name, and concatenates along that dimension:

In [1]: arr = xr.DataArray(np.random.randn(2, 3),
   ...:                    [('x', ['a', 'b']), ('y', [10, 20, 30])])
   ...: 

In [2]: arr[:, :1]
Out[2]: 
<xarray.DataArray (x: 2, y: 1)>
array([[ 0.469112],
       [-1.135632]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10

# this resembles how you would use np.concatenate
In [3]: xr.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469112, -0.282863, -1.509059],
       [-1.135632,  1.212112, -0.173215]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

In addition to combining along an existing dimension, concat can create a new dimension by stacking lower dimensional arrays together:

In [4]: arr[0]
Out[4]: 
<xarray.DataArray (y: 3)>
array([ 0.469112, -0.282863, -1.509059])
Coordinates:
    x        <U1 'a'
  * y        (y) int64 10 20 30

# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xr.concat([arr[0], arr[1]], 'x')
Out[5]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469112, -0.282863, -1.509059],
       [-1.135632,  1.212112, -0.173215]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) <U1 'a' 'b'

If the second argument to concat is a new dimension name, the arrays will be concatenated along that new dimension, which is always inserted as the first dimension:

In [6]: xr.concat([arr[0], arr[1]], 'new_dim')
Out[6]: 
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.469112, -0.282863, -1.509059],
       [-1.135632,  1.212112, -0.173215]])
Coordinates:
  * y        (y) int64 10 20 30
    x        (new_dim) <U1 'a' 'b'
Dimensions without coordinates: new_dim

The second argument to concat can also be an Index or DataArray object as well as a string, in which case it is used to label the values along the new dimension:

In [7]: xr.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[7]: 
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.469112, -0.282863, -1.509059],
       [-1.135632,  1.212112, -0.173215]])
Coordinates:
  * y        (y) int64 10 20 30
    x        (new_dim) <U1 'a' 'b'
  * new_dim  (new_dim) int64 -90 -100

Of course, concat also works on Dataset objects:

In [8]: ds = arr.to_dataset(name='foo')

In [9]: xr.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[9]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) <U1 'a' 'b'
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

concat() has a number of options which provide deeper control over which variables are concatenated and how it handles conflicting variables between datasets. With the default parameters, xarray will load some coordinate variables into memory to compare them between datasets. This may be prohibitively expensive if you are manipulating your dataset lazily using Parallel computing with dask.

Merge

To combine variables and coordinates between multiple DataArray and/or Dataset object, use merge(). It can merge a list of Dataset, DataArray or dictionaries of objects convertible to DataArray objects:

In [10]: xr.merge([ds, ds.rename({'foo': 'bar'})])
Out[10]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    bar      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

In [11]: xr.merge([xr.DataArray(n, name='var%d' % n) for n in range(5)])
Out[11]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    var0     int64 0
    var1     int64 1
    var2     int64 2
    var3     int64 3
    var4     int64 4

If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the union of all index coordinates:

In [12]: other = xr.Dataset({'bar': ('x', [1, 2, 3, 4]), 'x': list('abcd')})

In [13]: xr.merge([ds, other])
Out[13]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) object 'a' 'b' 'c' 'd'
  * y        (y) int64 10 20 30
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 nan ...
    bar      (x) int64 1 2 3 4

This ensures that merge is non-destructive. xarray.MergeError is raised if you attempt to merge two variables with the same name but different values:

In [14]: xr.merge([ds, ds + 1])
MergeError: conflicting values for variable 'foo' on objects to be combined:
first value: <xarray.Variable (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
second value: <xarray.Variable (x: 2, y: 3)>
array([[ 1.4691123 ,  0.71713666, -0.5090585 ],
       [-0.13563237,  2.21211203,  0.82678535]])

The same non-destructive merging between DataArray index coordinates is used in the Dataset constructor:

In [15]: xr.Dataset({'a': arr[:-1], 'b': arr[1:]})
Out[15]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 10 20 30
Data variables:
    a        (x, y) float64 0.4691 -0.2829 -1.509 nan nan nan
    b        (x, y) float64 nan nan nan -1.136 1.212 -0.1732

Combine

The instance method combine_first() combines two datasets/data arrays and defaults to non-null values in the calling object, using values from the called object to fill holes. The resulting coordinates are the union of coordinate labels. Vacant cells as a result of the outer-join are filled with NaN. For example:

In [16]: ar0 = xr.DataArray([[0, 0], [0, 0]], [('x', ['a', 'b']), ('y', [-1, 0])])

In [17]: ar1 = xr.DataArray([[1, 1], [1, 1]], [('x', ['b', 'c']), ('y', [0, 1])])

In [18]: ar0.combine_first(ar1)
Out[18]: 
<xarray.DataArray (x: 3, y: 3)>
array([[  0.,   0.,  nan],
       [  0.,   0.,   1.],
       [ nan,   1.,   1.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -1 0 1

In [19]: ar1.combine_first(ar0)
Out[19]: 
<xarray.DataArray (x: 3, y: 3)>
array([[  0.,   0.,  nan],
       [  0.,   1.,   1.],
       [ nan,   1.,   1.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -1 0 1

For datasets, ds0.combine_first(ds1) works similarly to xr.merge([ds0, ds1]), except that xr.merge raises MergeError when there are conflicting values in variables to be merged, whereas .combine_first defaults to the calling object’s values.

Update

In contrast to merge, update() modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:

In [20]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[20]: 
<xarray.Dataset>
Dimensions:  (space: 3, x: 2, y: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
  * space    (space) float64 10.2 9.4 3.9
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.

update also performs automatic alignment if necessary. Unlike merge, it maintains the alignment of the original array instead of merging indexes:

In [21]: ds.update(other)
Out[21]: 
<xarray.Dataset>
Dimensions:  (space: 3, x: 2, y: 3)
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 10 20 30
  * space    (space) float64 10.2 9.4 3.9
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    bar      (x) int64 1 2

The exact same alignment logic when setting a variable with __setitem__ syntax:

In [22]: ds['baz'] = xr.DataArray([9, 9, 9, 9, 9], coords=[('x', list('abcde'))])

In [23]: ds.baz
Out[23]: 
<xarray.DataArray 'baz' (x: 2)>
array([9, 9])
Coordinates:
  * x        (x) object 'a' 'b'

Equals and identical

xarray objects can be compared by using the equals(), identical() and broadcast_equals() methods. These methods are used by the optional compat argument on concat and merge.

equals checks dimension names, indexes and array values:

In [24]: arr.equals(arr.copy())
Out[24]: True

identical also checks attributes, and the name of each object:

In [25]: arr.identical(arr.rename('bar'))
Out[25]: False

broadcast_equals does a more relaxed form of equality check that allows variables to have different dimensions, as long as values are constant along those new dimensions:

In [26]: left = xr.Dataset(coords={'x': 0})

In [27]: right = xr.Dataset({'x': [0, 0, 0]})

In [28]: left.broadcast_equals(right)
Out[28]: True

Like pandas objects, two xarray objects are still equal or identical if they have missing values marked by NaN in the same locations.

In contrast, the == operation performs element-wise comparison (like numpy):

In [29]: arr == arr.copy()
Out[29]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30

Note that NaN does not compare equal to NaN in element-wise comparison; you may need to deal with missing values explicitly.

Merging with ‘no_conflicts’

The compat argument 'no_conflicts' is only available when combining xarray objects with merge. In addition to the above comparison methods it allows the merging of xarray objects with locations where either have NaN values. This can be used to combine data with overlapping coordinates as long as any non-missing values agree or are disjoint:

In [30]: ds1 = xr.Dataset({'a': ('x', [10, 20, 30, np.nan])}, {'x': [1, 2, 3, 4]})

In [31]: ds2 = xr.Dataset({'a': ('x', [np.nan, 30, 40, 50])}, {'x': [2, 3, 4, 5]})

In [32]: xr.merge([ds1, ds2], compat='no_conflicts')
Out[32]: 
<xarray.Dataset>
Dimensions:  (x: 5)
Coordinates:
  * x        (x) int64 1 2 3 4 5
Data variables:
    a        (x) float64 10.0 20.0 30.0 40.0 50.0

Note that due to the underlying representation of missing values as floating point numbers (NaN), variable data type is not always preserved when merging in this manner.

Time series data

A major use case for xarray is multi-dimensional time-series data. Accordingly, we’ve copied many of features that make working with time-series data in pandas such a joy to xarray. In most cases, we rely on pandas for the core functionality.

Creating datetime64 data

xarray uses the numpy dtypes datetime64[ns] and timedelta64[ns] to represent datetime data, which offer vectorized (if sometimes buggy) operations with numpy and smooth integration with pandas.

To convert to or create regular arrays of datetime64 data, we recommend using pandas.to_datetime() and pandas.date_range():

In [1]: pd.to_datetime(['2000-01-01', '2000-02-02'])
Out[1]: DatetimeIndex(['2000-01-01', '2000-02-02'], dtype='datetime64[ns]', freq=None)

In [2]: pd.date_range('2000-01-01', periods=365)
Out[2]: 
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10',
               ...
               '2000-12-21', '2000-12-22', '2000-12-23', '2000-12-24',
               '2000-12-25', '2000-12-26', '2000-12-27', '2000-12-28',
               '2000-12-29', '2000-12-30'],
              dtype='datetime64[ns]', length=365, freq='D')

Alternatively, you can supply arrays of Python datetime objects. These get converted automatically when used as arguments in xarray objects:

In [3]: import datetime

In [4]: xr.Dataset({'time': datetime.datetime(2000, 1, 1)})
Out[4]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    time     datetime64[ns] 2000-01-01

When reading or writing netCDF files, xarray automatically decodes datetime and timedelta arrays using CF conventions (that is, by using a units attribute like 'days since 2000-01-01').

Note

When decoding/encoding datetimes for non-standard calendars or for dates before year 1678 or after year 2262, xarray uses the netcdftime library. netcdftime was previously packaged with the netcdf4-python package but is now distributed separately. netcdftime is an optional dependency of xarray.

You can manual decode arrays in this form by passing a dataset to decode_cf():

In [5]: attrs = {'units': 'hours since 2000-01-01'}

In [6]: ds = xr.Dataset({'time': ('time', [0, 1, 2, 3], attrs)})

In [7]: xr.decode_cf(ds)
Out[7]: 
<xarray.Dataset>
Dimensions:  (time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
    *empty*

One unfortunate limitation of using datetime64[ns] is that it limits the native representation of dates to those that fall between the years 1678 and 2262. When a netCDF file contains dates outside of these bounds, dates will be returned as arrays of netcdftime.datetime objects.

Datetime indexing

xarray borrows powerful indexing machinery from pandas (see Indexing and selecting data).

This allows for several useful and suscinct forms of indexing, particularly for datetime64 data. For example, we support indexing with strings for single items and with the slice object:

In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24)

In [9]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})

In [10]: ds.sel(time='2000-01')
Out[10]: 
<xarray.Dataset>
Dimensions:  (time: 744)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
    foo      (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10'))
Out[11]: 
<xarray.Dataset>
Dimensions:  (time: 240)
Coordinates:
  * time     (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ...
Data variables:
    foo      (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...

You can also select a particular time by indexing with a datetime.time object:

In [12]: ds.sel(time=datetime.time(12))
Out[12]: 
<xarray.Dataset>
Dimensions:  (time: 365)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01T12:00:00 2000-01-02T12:00:00 ...
Data variables:
    foo      (time) int64 12 36 60 84 108 132 156 180 204 228 252 276 300 ...

For more details, read the pandas documentation.

Datetime components

Similar to pandas, the components of datetime objects contained in a given DataArray can be quickly computed using a special .dt accessor.

In [13]: time = pd.date_range('2000-01-01', freq='6H', periods=365 * 4)

In [14]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 4)), 'time': time})

In [15]: ds.time.dt.hour
Out[15]: 
<xarray.DataArray 'hour' (time: 1460)>
array([ 0,  6, 12, ...,  6, 12, 18])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

In [16]: ds.time.dt.dayofweek
Out[16]: 
<xarray.DataArray 'dayofweek' (time: 1460)>
array([5, 5, 5, ..., 5, 5, 5])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

The .dt accessor works on both coordinate dimensions as well as multi-dimensional data.

xarray also supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”:

In [17]: ds['time.month']
Out[17]: 
<xarray.DataArray 'month' (time: 1460)>
array([ 1,  1,  1, ..., 12, 12, 12])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

In [18]: ds['time.dayofyear']
Out[18]: 
<xarray.DataArray 'dayofyear' (time: 1460)>
array([  1,   1,   1, ..., 365, 365, 365])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

For use as a derived coordinate, xarray adds 'season' to the list of datetime components supported by pandas:

In [19]: ds['time.season']
Out[19]: 
<xarray.DataArray 'season' (time: 1460)>
array(['DJF', 'DJF', 'DJF', ..., 'DJF', 'DJF', 'DJF'],
      dtype='<U3')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

In [20]: ds['time'].dt.season
Out[20]: 
<xarray.DataArray 'season' (time: 1460)>
array(['DJF', 'DJF', 'DJF', ..., 'DJF', 'DJF', 'DJF'],
      dtype='<U3')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

The set of valid seasons consists of ‘DJF’, ‘MAM’, ‘JJA’ and ‘SON’, labeled by the first letters of the corresponding months.

You can use these shortcuts with both Datasets and DataArray coordinates.

In addition, xarray supports rounding operations floor, ceil, and round. These operations require that you supply a rounding frequency as a string argument.

In [21]: ds['time'].dt.floor('D')
Out[21]: 
<xarray.DataArray 'floor' (time: 1460)>
array(['2000-01-01T00:00:00.000000000', '2000-01-01T00:00:00.000000000',
       '2000-01-01T00:00:00.000000000', ..., '2000-12-30T00:00:00.000000000',
       '2000-12-30T00:00:00.000000000', '2000-12-30T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...

Resampling and grouped operations

Datetime components couple particularly well with grouped operations (see GroupBy: split-apply-combine) for analyzing features that repeat over time. Here’s how to calculate the mean by time of day:

In [22]: ds.groupby('time.hour').mean()
Out[22]: 
<xarray.Dataset>
Dimensions:  (hour: 4)
Coordinates:
  * hour     (hour) int64 0 6 12 18
Data variables:
    foo      (hour) float64 728.0 729.0 730.0 731.0

For upsampling or downsampling temporal resolutions, xarray offers a resample() method building on the core functionality offered by the pandas method of the same name. Resample uses essentially the same api as resample in pandas.

For example, we can downsample our dataset from hourly to 6-hourly:

In [23]: ds.resample(time='6H')
Out[23]: <xarray.core.resample.DatasetResample at 0x7f10cf5049b0>

This will create a specialized Resample object which saves information necessary for resampling. All of the reduction methods which work with Resample objects can also be used for resampling:

In [24]: ds.resample(time='6H').mean()
Out[24]: 
<xarray.Dataset>
Dimensions:  (time: 1460)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
    foo      (time) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...

You can also supply an arbitrary reduction function to aggregate over each resampling group:

In [25]: ds.resample(time='6H').reduce(np.mean)
Out[25]: 
<xarray.Dataset>
Dimensions:  (time: 1460)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
    foo      (time) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...

For upsampling, xarray provides four methods: asfreq, ffill, bfill, and interpolate. interpolate extends scipy.interpolate.interp1d and supports all of its schemes. All of these resampling operations work on both Dataset and DataArray objects with an arbitrary number of dimensions.

Note

The resample api was updated in version 0.10.0 to reflect similar updates in pandas resample api to be more groupby-like. Older style calls to resample will still be supported for a short period:

In [26]: ds.resample('6H', dim='time', how='mean')
Out[26]: 
<xarray.Dataset>
Dimensions:  (time: 1460)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
    foo      (time) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...

For more examples of using grouped operations on a time dimension, see Toy weather data.

Working with pandas

One of the most important features of xarray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn.

Hierarchical and tidy data

Tabular data is easiest to work with when it meets the criteria for tidy data:

  • Each column holds a different variable.
  • Each rows holds a different observation.

In this “tidy data” format, we can represent any Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.

Dataset and DataFrame

To convert any dataset to a DataFrame in tidy form, use the Dataset.to_dataframe() method:

In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
   ...:                  coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
   ...:                          'along_x': ('x', np.random.randn(2)),
   ...:                          'scalar': 123})
   ...: 

In [2]: ds
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 10 20
  * y        (y) <U1 'a' 'b' 'c'
    along_x  (x) float64 0.1192 -1.044
    scalar   int64 123
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

In [3]: df = ds.to_dataframe()

In [4]: df
Out[4]: 
           foo   along_x  scalar
x  y                            
10 a  0.469112  0.119209     123
   b -0.282863  0.119209     123
   c -1.509059  0.119209     123
20 a -1.135632 -1.044236     123
   b  1.212112 -1.044236     123
   c -0.173215 -1.044236     123

We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().

For datasets containing dask arrays where the data should be lazily loaded, see the Dataset.to_dask_dataframe() method.

To create a Dataset from a DataFrame, use the from_dataframe() class method or the equivalent pandas.DataFrame.to_xarray method (pandas v0.18 or later):

In [5]: xr.Dataset.from_dataframe(df)
Out[5]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    along_x  (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044
    scalar   (x, y) int64 123 123 123 123 123 123

Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.

Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.

DataArray and Series

DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:

In [6]: s = ds['foo'].to_series()

In [7]: s
Out[7]: 
x   y
10  a    0.469112
    b   -0.282863
    c   -1.509059
20  a   -1.135632
    b    1.212112
    c   -0.173215
Name: foo, dtype: float64

# or equivalently, with Series.to_xarray()
In [8]: xr.DataArray.from_series(s)
Out[8]: 
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469112, -0.282863, -1.509059],
       [-1.135632,  1.212112, -0.173215]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'

Both the from_series and from_dataframe methods use reindexing, so they work even if not the hierarchical index is not a full tensor product:

In [9]: s[::2]
Out[9]: 
x   y
10  a    0.469112
    c   -1.509059
20  b    1.212112
Name: foo, dtype: float64

In [10]: s[::2].to_xarray()
Out[10]: 
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469112,       nan, -1.509059],
       [      nan,  1.212112,       nan]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'

Multi-dimensional data

Tidy data is great, but it sometimes you want to preserve dimensions instead of automatically stacking them into a MultiIndex.

DataArray.to_pandas() is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality (i.e., a 1D array is converted to a Series, 2D to DataFrame and 3D to Panel):

In [11]: arr = xr.DataArray(np.random.randn(2, 3),
   ....:                    coords=[('x', [10, 20]), ('y', ['a', 'b', 'c'])])
   ....: 

In [12]: df = arr.to_pandas()

In [13]: df
Out[13]: 
y          a         b         c
x                               
10 -0.861849 -2.104569 -0.494929
20  1.071804  0.721555 -0.706771

To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the DataArray constructor:

In [14]: xr.DataArray(df)
Out[14]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-0.861849, -2.104569, -0.494929],
       [ 1.071804,  0.721555, -0.706771]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'

Both the DataArray and Dataset constructors directly convert pandas objects into xarray objects with the same shape. This means that they preserve all use of multi-indexes:

In [15]: index = pd.MultiIndex.from_arrays([['a', 'a', 'b'], [0, 1, 2]],
   ....:                                   names=['one', 'two'])
   ....: 

In [16]: df = pd.DataFrame({'x': 1, 'y': 2}, index=index)

In [17]: ds = xr.Dataset(df)

In [18]: ds
Out[18]: 
<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  * dim_0    (dim_0) MultiIndex
  - one      (dim_0) object 'a' 'a' 'b'
  - two      (dim_0) int64 0 1 2
Data variables:
    x        (dim_0) int64 1 1 1
    y        (dim_0) int64 2 2 2

However, you will need to set dimension names explicitly, either with the dims argument on in the DataArray constructor or by calling rename on the new object.

Transitioning from pandas.Panel to xarray

Panel, pandas’s data structure for 3D arrays, has always been a second class data structure compared to the Series and DataFrame. To allow pandas developers to focus more on its core functionality built around the DataFrame, pandas plans to eventually deprecate Panel.

xarray has most of Panel’s features, a more explicit API (particularly around indexing), and the ability to scale to >3 dimensions with the same interface.

As discussed elsewhere in the docs, there are two primary data structures in xarray: DataArray and Dataset. You can imagine a DataArray as a n-dimensional pandas Series (i.e. a single typed array), and a Dataset as the DataFrame equivalent (i.e. a dict of aligned DataArray objects).

So you can represent a Panel, in two ways:

  • As a 3-dimensional DataArray,
  • Or as a Dataset containing a number of 2-dimensional DataArray objects.

Let’s take a look:

In [19]: panel = pd.Panel(np.random.rand(2, 3, 4), items=list('ab'), major_axis=list('mno'),
   ....:                  minor_axis=pd.date_range(start='2000', periods=4, name='date'))
   ....: 

In [20]: panel
Out[20]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: a to b
Major_axis axis: m to o
Minor_axis axis: 2000-01-01 00:00:00 to 2000-01-04 00:00:00

As a DataArray:

# or equivalently, with Panel.to_xarray()
In [21]: xr.DataArray(panel)
Out[21]: 
<xarray.DataArray (dim_0: 2, dim_1: 3, date: 4)>
array([[[ 0.594784,  0.137554,  0.8529  ,  0.235507],
        [ 0.146227,  0.589869,  0.574012,  0.06127 ],
        [ 0.590426,  0.24535 ,  0.340445,  0.984729]],

       [[ 0.91954 ,  0.037772,  0.861549,  0.753569],
        [ 0.405179,  0.343526,  0.170917,  0.394659],
        [ 0.641666,  0.274592,  0.462354,  0.871372]]])
Coordinates:
  * dim_0    (dim_0) object 'a' 'b'
  * dim_1    (dim_1) object 'm' 'n' 'o'
  * date     (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

As you can see, there are three dimensions (each is also a coordinate). Two of the axes of the panel were unnamed, so have been assigned dim_0 and dim_1 respectively, while the third retains its name date.

As a Dataset:

In [22]: xr.Dataset(panel)
Out[22]: 
<xarray.Dataset>
Dimensions:  (date: 4, dim_0: 3)
Coordinates:
  * dim_0    (dim_0) object 'm' 'n' 'o'
  * date     (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
Data variables:
    a        (dim_0, date) float64 0.5948 0.1376 0.8529 0.2355 0.1462 0.5899 ...
    b        (dim_0, date) float64 0.9195 0.03777 0.8615 0.7536 0.4052 ...

Here, there are two data variables, each representing a DataFrame on panel’s items axis, and labelled as such. Each variable is a 2D array of the respective values along the items dimension.

While the xarray docs are relatively complete, a few items stand out for Panel users:

  • A DataArray’s data is stored as a numpy array, and so can only contain a single type. As a result, a Panel that contains DataFrame objects with multiple types will be converted to dtype=object. A Dataset of multiple DataArray objects each with its own dtype will allow original types to be preserved.
  • Indexing is similar to pandas, but more explicit and leverages xarray’s naming of dimensions.
  • Because of those features, making much higher dimensional data is very practical.
  • Variables in Dataset objects can use a subset of its dimensions. For example, you can have one dataset with Person x Score x Time, and another with Person x Score.
  • You can use coordinates are used for both dimensions and for variables which _label_ the data variables, so you could have a coordinate Age, that labelled the Person dimension of a Dataset of Person x Score x Time.

While xarray may take some getting used to, it’s worth it! If anything is unclear, please post an issue on GitHub or StackOverflow, and we’ll endeavor to respond to the specific case or improve the general docs.

Serialization and IO

xarray supports direct serialization and IO to several file formats, from simple Pickle files to the more flexible netCDF format.

Pickle

The simplest way to serialize an xarray object is to use Python’s built-in pickle module:

In [1]: import pickle

In [2]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
   ...:                 coords={'x': [10, 20, 30, 40],
   ...:                         'y': pd.date_range('2000-01-01', periods=5),
   ...:                         'z': ('x', list('abcd'))})
   ...: 

# use the highest protocol (-1) because it is way faster than the default
# text based pickle format
In [3]: pkl = pickle.dumps(ds, protocol=-1)

In [4]: pickle.loads(pkl)
Out[4]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 10 20 30 40
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
    z        (x) <U1 'a' 'b' 'c' 'd'
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

Pickling is important because it doesn’t require any external libraries and lets you use xarray objects with Python modules like multiprocessing or Dask. However, pickling is not recommended for long-term storage.

Restoring a pickle requires that the internal structure of the types for the pickled data remain unchanged. Because the internal design of xarray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xarray will work in future versions.

Note

When pickling an object opened from a NetCDF file, the pickle file will contain a reference to the file on disk. If you want to store the actual array values, load it into memory first with load() or compute().

Dictionary

We can convert a Dataset (or a DataArray) to a dict using to_dict():

In [5]: d = ds.to_dict()

In [6]: d
Out[6]: 
{'attrs': {},
 'coords': {'x': {'attrs': {}, 'data': [10, 20, 30, 40], 'dims': ('x',)},
  'y': {'attrs': {},
   'data': [datetime.datetime(2000, 1, 1, 0, 0),
    datetime.datetime(2000, 1, 2, 0, 0),
    datetime.datetime(2000, 1, 3, 0, 0),
    datetime.datetime(2000, 1, 4, 0, 0),
    datetime.datetime(2000, 1, 5, 0, 0)],
   'dims': ('y',)},
  'z': {'attrs': {}, 'data': ['a', 'b', 'c', 'd'], 'dims': ('x',)}},
 'data_vars': {'foo': {'attrs': {},
   'data': [[0.12696983303810094,
     0.966717838482003,
     0.26047600586578334,
     0.8972365243645735,
     0.37674971618967135],
    [0.33622174433445307,
     0.45137647047539964,
     0.8402550832613813,
     0.12310214428849964,
     0.5430262020470384],
    [0.37301222522143085,
     0.4479968246859435,
     0.12944067971751294,
     0.8598787065799693,
     0.8203883631195572],
    [0.35205353914802473,
     0.2288873043216132,
     0.7767837505077176,
     0.5947835894851238,
     0.1375535565632705]],
   'dims': ('x', 'y')}},
 'dims': {'x': 4, 'y': 5}}

We can create a new xarray object from a dict using from_dict():

In [7]: ds_dict = xr.Dataset.from_dict(d)

In [8]: ds_dict
Out[8]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 10 20 30 40
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
    z        (x) <U1 'a' 'b' 'c' 'd'
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

Dictionary support allows for flexible use of xarray objects. It doesn’t require external libraries and dicts can easily be pickled, or converted to json, or geojson. All the values are converted to lists, so dicts might be quite large.

netCDF

The recommended way to store xarray data structures is netCDF, which is a binary file format for self-described datasets that originated in the geosciences. xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects.

NetCDF is supported on almost all platforms, and parsers exist for the vast majority of scientific programming languages. Recent versions of netCDF are based on the even more widely used HDF5 file-format.

Tip

If you aren’t familiar with this data format, the netCDF FAQ is a good place to start.

Reading and writing netCDF files with xarray requires scipy or the netCDF4-Python library to be installed (the later is required to read/write netCDF V4 files and use the compression options described below).

We can save a Dataset to disk using the Dataset.to_netcdf method:

In [9]: ds.to_netcdf('saved_on_disk.nc')

By default, the file is saved as netCDF4 (assuming netCDF4-Python is installed). You can control the format and engine used to write the file with the format and engine arguments.

We can load netCDF files to create a new Dataset using open_dataset():

In [10]: ds_disk = xr.open_dataset('saved_on_disk.nc')

In [11]: ds_disk
Out[11]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 10 20 30 40
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
    z        (x) object ...
Data variables:
    foo      (x, y) float64 ...

Similarly, a DataArray can be saved to disk using the DataArray.to_netcdf method, and loaded from disk using the open_dataarray() function. As netCDF files correspond to Dataset objects, these functions internally convert the DataArray to a Dataset before saving, and then convert back when loading, ensuring that the DataArray that is loaded is always exactly the same as the one that was saved.

A dataset can also be loaded or written to a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument. When writing multiple groups in one file, pass mode='a' to to_netcdf to ensure that each call does not delete the file.

Data is always loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below.

It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched.

Tip

xarray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a Dataset (or DataArray) entirely into memory by invoking the load() method.

Datasets have a close() method to close the associated netCDF file. However, it’s often cleaner to use a with statement:

# this automatically closes the dataset after use
In [12]: with xr.open_dataset('saved_on_disk.nc') as ds:
   ....:     print(ds.keys())
   ....: 
KeysView(<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 10 20 30 40
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
    z        (x) object ...
Data variables:
    foo      (x, y) float64 ...)

Although xarray provides reasonable support for incremental reads of files on disk, it does not support incremental writes, which can be a useful strategy for dealing with datasets too big to fit into memory. Instead, xarray integrates with dask.array (see Parallel computing with dask), which provides a fully featured engine for streaming computation.

It is possible to append or overwrite netCDF variables using the mode='a' argument. When using this option, all variables in the dataset will be written to the original netCDF file, regardless if they exist in the original dataset.

Reading encoded data

NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “add_offset” attributes). If the argument decode_cf=True (default) is given to open_dataset, xarray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.

You can view this encoding information (among others) in the DataArray.encoding attribute:

In [13]: ds_disk['y'].encoding
Out[13]: 
{'calendar': u'proleptic_gregorian',
 'chunksizes': None,
 'complevel': 0,
 'contiguous': True,
 'dtype': dtype('float64'),
 'fletcher32': False,
 'least_significant_digit': None,
 'shuffle': False,
 'source': 'saved_on_disk.nc',
 'units': u'days since 2000-01-01 00:00:00',
 'zlib': False}

Note that all operations that manipulate variables other than indexing will remove encoding information.

Writing encoded data

Conversely, you can customize how xarray writes netCDF files on disk by providing explicit encodings for each dataset variable. The encoding argument takes a dictionary with variable names as keys and variable specific encodings as values. These encodings are saved as attributes on the netCDF variables on disk, which allows xarray to faithfully read encoded data back into memory.

It is important to note that using encodings is entirely optional: if you do not supply any of these encoding options, xarray will write data to disk using a default encoding, or the options in the encoding attribute, if set. This works perfectly fine in most cases, but encoding can be useful for additional control, especially for enabling compression.

In the file on disk, these encodings as saved as attributes on each variable, which allow xarray and other CF-compliant tools for working with netCDF files to correctly read the data.

Scaling and type conversions

These encoding options work on any version of the netCDF file format:

  • dtype: Any valid NumPy dtype or string convertable to a dtype, e.g., 'int16' or 'float32'. This controls the type of the data written on disk.
  • _FillValue: Values of NaN in xarray variables are remapped to this value when saved on disk. This is important when converting floating point with missing values to integers on disk, because NaN is not a valid value for integer dtypes. As a default, variables with float types are attributed a _FillValue of NaN in the output file, unless explicitly disabled with an encoding {'_FillValue': None}.
  • scale_factor and add_offset: Used to convert from encoded data on disk to to the decoded data in memory, according to the formula decoded = scale_factor * encoded + add_offset.

These parameters can be fruitfully combined to compress discretized data on disk. For example, to save the variable foo with a precision of 0.1 in 16-bit integers while converting NaN to -9999, we would use encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}. Compression and decompression with such discretization is extremely fast.

String encoding

xarray can write unicode strings to netCDF files in two ways:

  • As variable length strings. This is only supported on netCDF4 (HDF5) files.
  • By encoding strings into bytes, and writing encoded bytes as a character array. The default encoding is UTF-8.

By default, we use variable length strings for compatible files and fall-back to using encoded character arrays. Character arrays can be selected even for netCDF4 files by setting the dtype field in encoding to S1 (corresponding to NumPy’s single-character bytes dtype).

If character arrays are used, the string encoding that was used is stored on disk in the _Encoding attribute, which matches an ad-hoc convention adopted by the netCDF4-Python library. At the time of this writing (October 2017), a standard convention for indicating string encoding for character arrays in netCDF files was still under discussion. Technically, you can use any string encoding recognized by Python if you feel the need to deviate from UTF-8, by setting the _Encoding field in encoding. But we don’t recommend it.

Warning

Missing values in bytes or unicode string arrays (represented by NaN in xarray) are currently written to disk as empty strings ''. This means missing values will not be restored when data is loaded from disk. This behavior is likely to change in the future (GH1647). Unfortunately, explicitly setting a _FillValue for string arrays to handle missing values doesn’t work yet either, though we also hope to fix this in the future.

Chunk based compression

zlib, complevel, fletcher32, continguous and chunksizes can be used for enabling netCDF4/HDF5’s chunk based compression, as described in the documentation for createVariable for netCDF4-Python. This only works for netCDF4 files and thus requires using format='netCDF4' and either engine='netcdf4' or engine='h5netcdf'.

Chunk based gzip compression can yield impressive space savings, especially for sparse data, but it comes with significant performance overhead. HDF5 libraries can only read complete chunks back into memory, and maximum decompression speed is in the range of 50-100 MB/s. Worse, HDF5’s compression and decompression currently cannot be parallelized with dask. For these reasons, we recommend trying discretization based compression (described above) first.

Time units

The units and calendar attributes control how xarray serializes datetime64 and timedelta64 arrays to datasets on disk as numeric values. The units encoding should be a string like 'days since 1900-01-01' for datetime64 data or a string like 'days' for timedelta64 data. calendar should be one of the calendar types supported by netCDF4-python: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’, ‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.

By default, xarray uses the ‘proleptic_gregorian’ calendar and units of the smallest time difference between values, with a reference time of the first time value.

Iris

The Iris tool allows easy reading of common meteorological and climate model formats (including GRIB and UK MetOffice PP files) into Cube objects which are in many ways very similar to DataArray objects, while enforcing a CF-compliant data model. If iris is installed xarray can convert a DataArray into a Cube using to_iris():

In [14]: da = xr.DataArray(np.random.rand(4, 5), dims=['x', 'y'],
   ....:                   coords=dict(x=[10, 20, 30, 40],
   ....:                               y=pd.date_range('2000-01-01', periods=5)))
   ....: 

In [15]: cube = da.to_iris()

In [16]: cube
Out[16]: <iris 'Cube' of unknown / (unknown) (x: 4; y: 5)>

Conversely, we can create a new DataArray object from a Cube using from_iris():

In [17]: da_cube = xr.DataArray.from_iris(cube)

In [18]: da_cube
Out[18]: 
<xarray.DataArray (x: 4, y: 5)>
array([[ 0.8529  ,  0.235507,  0.146227,  0.589869,  0.574012],
       [ 0.06127 ,  0.590426,  0.24535 ,  0.340445,  0.984729],
       [ 0.91954 ,  0.037772,  0.861549,  0.753569,  0.405179],
       [ 0.343526,  0.170917,  0.394659,  0.641666,  0.274592]])
Coordinates:
  * x        (x) int64 10 20 30 40
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
Attributes:
    units:    unknown

OPeNDAP

xarray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.

For example, we can open a connection to GBs of weather data produced by the PRISM project, and hosted by IRI at Columbia:

In [19]: remote_data = xr.open_dataset(
   ....:     'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
   ....:     decode_times=False)
   ....: 

In [20]: remote_data
Out[20]: 
<xarray.Dataset>
Dimensions:  (T: 1422, X: 1405, Y: 621)
Coordinates:
  * X        (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
  * T        (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
  * Y        (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
Data variables:
    ppt      (T, Y, X) float64 ...
    tdmean   (T, Y, X) float64 ...
    tmax     (T, Y, X) float64 ...
    tmin     (T, Y, X) float64 ...
Attributes:
    Conventions: IRIDL
    expires: 1375315200

Note

Like many real-world datasets, this dataset does not entirely follow CF conventions. Unexpected formats will usually cause xarray’s automatic decoding to fail. The way to work around this is to either set decode_cf=False in open_dataset to turn off all use of CF conventions, or by only disabling the troublesome parser. In this case, we set decode_times=False because the time axis here provides the calendar attribute in a format that xarray does not expect (the integer 360 instead of a string like '360_day').

We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:

In [21]: tmax = remote_data['tmax'][:500, ::3, ::3]

In [22]: tmax
Out[22]: 
<xarray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Coordinates:
  * Y        (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ...
  * X        (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ...
  * T        (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ...
Attributes:
    pointwidth: 120
    standard_name: air_temperature
    units: Celsius_scale
    expires: 1443657600

# the data is downloaded automatically when we make the plot
In [23]: tmax[0].plot()
_images/opendap-prism-tmax.png

Some servers require authentication before we can access the data. For this purpose we can explicitly create a PydapDataStore and pass in a Requests session object. For example for HTTP Basic authentication:

import xarray as xr
import requests

session = requests.Session()
session.auth = ('username', 'password')

store = xr.backends.PydapDataStore.open('http://example.com/data',
                                        session=session)
ds = xr.open_dataset(store)

Pydap’s cas module has functions that generate custom sessions for servers that use CAS single sign-on. For example, to connect to servers that require NASA’s URS authentication:

import xarray as xr
from pydata.cas.urs import setup_session

ds_url = 'https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc'

session = setup_session('username', 'password', check_url=ds_url)
store = xr.backends.PydapDataStore.open(ds_url, session=session)

ds = xr.open_dataset(store)

Rasterio

GeoTIFFs and other gridded raster datasets can be opened using rasterio, if rasterio is installed. Here is an example of how to use open_rasterio() to read one of rasterio’s test files:

In [24]: rio = xr.open_rasterio('RGB.byte.tif')

In [25]: rio
Out[25]: 
<xarray.DataArray (band: 3, y: 718, x: 791)>
[1703814 values with dtype=uint8]
Coordinates:
  * band     (band) int64 1 2 3
  * y        (y) float64 2.827e+06 2.826e+06 2.826e+06 2.826e+06 2.826e+06 ...
  * x        (x) float64 1.021e+05 1.024e+05 1.027e+05 1.03e+05 1.033e+05 ...
Attributes:
    res:        (300.0379266750948, 300.041782729805)
    transform:  (300.0379266750948, 0.0, 101985.0, 0.0, -300.041782729805, 28...
    is_tiled:   0
    crs:        +init=epsg:32618

The x and y coordinates are generated out of the file’s metadata (bounds, width, height), and they can be understood as cartesian coordinates defined in the file’s projection provided by the crs attribute. crs is a PROJ4 string which can be parsed by e.g. pyproj or rasterio. See Parsing rasterio’s geocoordinates for an example of how to convert these to longitudes and latitudes.

Warning

This feature has been added in xarray v0.9.6 and should still be considered as being experimental. Please report any bug you may find on xarray’s github repository.

Zarr

Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities.

Warning

Zarr support is still an experimental feature. Please report any bugs or unexepected behavior via github issues.

Xarray can’t open just any zarr dataset, because xarray requires special metadata (attributes) describing the dataset dimensions and coordinates. At this time, xarray can only open zarr datasets that have been written by xarray. To write a dataset with zarr, we use the Dataset.to_zarr method. To write to a local directory, we pass a path to a directory

In [26]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
   ....:                 coords={'x': [10, 20, 30, 40],
   ....:                         'y': pd.date_range('2000-01-01', periods=5),
   ....:                         'z': ('x', list('abcd'))})
   ....: 

In [27]: ds.to_zarr('path/to/directory.zarr')
Out[27]: <xarray.backends.zarr.ZarrStore at 0x7f10cf0b2a20>

(The suffix .zarr is optional–just a reminder that a zarr store lives there.) If the directory does not exist, it will be created. If a zarr store is already present at that path, an error will be raised, preventing it from being overwritten. To override this behavior and overwrite an existing store, add mode='w' when invoking to_zarr.

To read back a zarr dataset that has been created this way, we use the open_zarr() method:

In [28]: ds_zarr = xr.open_zarr('path/to/directory.zarr')

In [29]: ds_zarr
Out[29]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 10 20 30 40
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
    z        (x) <U1 dask.array<shape=(4,), chunksize=(4,)>
Data variables:
    foo      (x, y) float64 dask.array<shape=(4, 5), chunksize=(4, 5)>
Cloud Storage Buckets

It is possible to read and write xarray datasets directly from / to cloud storage buckets using zarr. This example uses the gcsfs package to provide a MutableMapping interface to Google Cloud Storage, which we can then pass to xarray:

import gcsfs
fs = gcsfs.GCSFileSystem(project='<project-name>', token=None)
gcsmap = gcsfs.mapping.GCSMap('<bucket-name>', gcs=fs, check=True, create=False)
# write to the bucket
ds.to_zarr(store=gcsmap)
# read it back
ds_gcs = xr.open_zarr(gcsmap, mode='r')
Zarr Compressors and Filters

There are many different options for compression and filtering possible with zarr. These are described in the zarr documentation. These options can be passed to the to_zarr method as variable encoding. For example:

In [30]: import zarr

In [31]: compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)

In [32]: ds.to_zarr('foo.zarr', encoding={'foo': {'compressor': compressor}})
Out[32]: <xarray.backends.zarr.ZarrStore at 0x7f10ce1736a0>

Note

Not all native zarr compression and filtering options have been tested with xarray.

Formats supported by PyNIO

xarray can also read GRIB, HDF4 and other file formats supported by PyNIO, if PyNIO is installed. To use PyNIO to read such files, supply engine='pynio' to open_dataset().

We recommend installing PyNIO via conda:

conda install -c conda-forge pynio

Formats supported by Pandas

For more options (tabular formats and CSV files in particular), consider exporting your objects to pandas and using its broad range of IO tools.

Combining multiple files

NetCDF files are often encountered in collections, e.g., with different files corresponding to different model runs. xarray can straightforwardly combine such files into a single Dataset by making use of concat().

Note

Xarray includes support for manipulating datasets that don’t fit into memory with dask. If you have dask installed, you can open multiple files simultaneously using open_mfdataset():

xr.open_mfdataset('my/files/*.nc')

This function automatically concatenates and merges multiple files into a single xarray dataset. It is the recommended way to open multiple files with xarray. For more details, see Reading and writing data and a blog post by Stephan Hoyer.

For example, here’s how we could approximate MFDataset from the netCDF4 library:

from glob import glob
import xarray as xr

def read_netcdfs(files, dim):
    # glob expands paths with * to a list of files, like the unix shell
    paths = sorted(glob(files))
    datasets = [xr.open_dataset(p) for p in paths]
    combined = xr.concat(dataset, dim)
    return combined

combined = read_netcdfs('/all/my/files/*.nc', dim='time')

This function will work in many cases, but it’s not very robust. First, it never closes files, which means it will fail one you need to load more than a few thousands file. Second, it assumes that you want all the data from each file and that it can all fit into memory. In many situations, you only need a small subset or an aggregated summary of the data from each file.

Here’s a slightly more sophisticated example of how to remedy these deficiencies:

def read_netcdfs(files, dim, transform_func=None):
    def process_one_path(path):
        # use a context manager, to ensure the file gets closed after use
        with xr.open_dataset(path) as ds:
            # transform_func should do some sort of selection or
            # aggregation
            if transform_func is not None:
                ds = transform_func(ds)
            # load all data from the transformed dataset, to ensure we can
            # use it after closing each original file
            ds.load()
            return ds

    paths = sorted(glob(files))
    datasets = [process_one_path(p) for p in paths]
    combined = xr.concat(datasets, dim)
    return combined

# here we suppose we only care about the combined mean of each file;
# you might also use indexing operations like .sel to subset datasets
combined = read_netcdfs('/all/my/files/*.nc', dim='time',
                        transform_func=lambda ds: ds.mean())

This pattern works well and is very robust. We’ve used similar code to process tens of thousands of files constituting 100s of GB of data.

Parallel computing with dask

xarray integrates with dask to support parallel computations and streaming computation on datasets that don’t fit into memory.

Currently, dask is an entirely optional feature for xarray. However, the benefits of using dask are sufficiently strong that dask may become a required dependency in a future version of xarray.

For a full example of how to use xarray’s dask integration, read the blog post introducing xarray and dask.

What is a dask array?

A dask array

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.

Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of tasks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.

The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processors available on most modern computers.

For more details on dask, read its documentation.

Reading and writing data

The usual way to create a dataset filled with dask arrays is to load the data from a netCDF file or files. You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function.

In [1]: ds = xr.open_dataset('example-data.nc', chunks={'time': 10})

In [2]: ds
Out[2]: 
<xarray.Dataset>
Dimensions:      (latitude: 180, longitude: 360, time: 365)
Coordinates:
  * time         (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude    (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * latitude     (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
Data variables:
    temperature  (time, latitude, longitude) float64 dask.array<shape=(365, 180, 360), chunksize=(10, 180, 360)>

In this example latitude and longitude do not appear in the chunks dict, so only one chunk will be used along those dimensions. It is also entirely equivalent to open a dataset using open_dataset and then chunk the data use the chunk method, e.g., xr.open_dataset('example-data.nc').chunk({'time': 10}).

To open multiple files simultaneously, use open_mfdataset():

xr.open_mfdataset('my/files/*.nc')

This function will automatically concatenate and merge dataset into one in the simple cases that it understands (see auto_combine() for the full disclaimer). By default, open_mfdataset will chunk each netCDF file into a single dask array; again, supply the chunks argument to control the size of the resulting dask arrays. In more complex cases, you can open each file individually using open_dataset and merge the result, as described in Combining data.

You’ll notice that printing a dataset still shows a preview of array values, even if they are actually dask arrays. We can do this quickly with dask because we only need to the compute the first few values (typically from the first block). To reveal the true nature of an array, print a DataArray:

In [3]: ds.temperature
Out[3]: 
<xarray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
dask.array<shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude  (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
  * latitude   (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...

Once you’ve manipulated a dask array, you can still write a dataset too big to fit into memory back to disk by using to_netcdf() in the usual way.

Note

When using dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable HDF5_USE_FILE_LOCKING=FALSE to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with dask’s distributed scheduler is only supported for the netcdf4 backend.

A dataset can also be converted to a dask DataFrame using to_dask_dataframe().

In [4]: df = ds.to_dask_dataframe()

In [5]: df
Out[5]: 
Dask DataFrame Structure:
               latitude longitude            time temperature
npartitions=44                                               
0               float64     int64  datetime64[ns]     float64
525600              ...       ...             ...         ...
...                 ...       ...             ...         ...
22600800            ...       ...             ...         ...
23651999            ...       ...             ...         ...
Dask Name: concat-indexed, 1625 tasks

Dask DataFrames do not support multi-indexes so the coordinate variables from the dataset are included as columns in the dask DataFrame.

Using dask with xarray

Nearly all existing xarray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with dask arrays. When you load data as a dask array in an xarray data structure, almost all xarray operations will keep it as a dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a dask array into memory generally requires an explicit conversion step. One noteable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.

The easiest way to convert an xarray data structure from lazy dask arrays into eager, in-memory numpy arrays is to use the load() method:

In [6]: ds.load()
Out[6]: 
<xarray.Dataset>
Dimensions:      (latitude: 180, longitude: 360, time: 365)
Coordinates:
  * time         (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude    (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * latitude     (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
Data variables:
    temperature  (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...

You can also access values, which will always be a numpy array:

In [7]: ds.temperature.values
Out[7]: 
array([[[  4.691e-01,  -2.829e-01, ...,  -5.577e-01,   3.814e-01],
        [  1.337e+00,  -1.531e+00, ...,   8.726e-01,  -1.538e+00],
        ...
# truncated for brevity

Explicit conversion by wrapping a DataArray with np.asarray also works:

In [8]: np.asarray(ds.temperature)
Out[8]: 
array([[[  4.691e-01,  -2.829e-01, ...,  -5.577e-01,   3.814e-01],
        [  1.337e+00,  -1.531e+00, ...,   8.726e-01,  -1.538e+00],
        ...

Alternatively you can load the data into memory but keep the arrays as dask arrays using the persist() method:

This is particularly useful when using a distributed cluster because the data will be loaded into distributed memory across your machines and be much faster to use than reading repeatedly from disk. Warning that on a single machine this operation will try to load all of your data into memory. You should make sure that your dataset is not larger than available memory.

For performance you may wish to consider chunk sizes. The correct choice of chunk size depends both on your data and on the operations you want to perform. With xarray, both converting data to a dask arrays and converting the chunk sizes of dask arrays is done with the chunk() method:

In [9]: rechunked = ds.chunk({'latitude': 100, 'longitude': 100})

You can view the size of existing chunks on an array by viewing the chunks attribute:

In [10]: rechunked.chunks
Out[10]: Frozen(SortedKeysDict({'time': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 5), 'latitude': (100, 80), 'longitude': (100, 100, 100, 60)}))

If there are not consistent chunksizes between all the arrays in a dataset along a particular dimension, an exception is raised when you try to access .chunks.

Note

In the future, we would like to enable automatic alignment of dask chunksizes (but not the other way around). We might also require that all arrays in a dataset share the same chunking alignment. Neither of these are currently done.

NumPy ufuncs like np.sin currently only work on eagerly evaluated arrays (this will change with the next major NumPy release). We have provided replacements that also work on all xarray objects, including those that store lazy dask arrays, in the xarray.ufuncs module:

In [11]: import xarray.ufuncs as xu

In [12]: xu.sin(rechunked)
Out[12]: 
<xarray.Dataset>
Dimensions:      (latitude: 180, longitude: 360, time: 365)
Coordinates:
  * time         (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude    (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * latitude     (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
Data variables:
    temperature  (time, latitude, longitude) float64 dask.array<shape=(365, 180, 360), chunksize=(10, 100, 100)>

To access dask arrays directly, use the new DataArray.data attribute. This attribute exposes array data either as a dask array or as a numpy array, depending on whether it has been loaded into dask or not:

In [13]: ds.temperature.data
Out[13]: dask.array<xarray-temperature, shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>

Note

In the future, we may extend .data to support other “computable” array backends beyond dask and numpy (e.g., to support sparse arrays).

Automatic parallelization

Almost all of xarray’s built-in operations work on dask arrays. If you want to use a function that isn’t wrapped by xarray, one option is to extract dask arrays from xarray objects (.data) and use dask directly.

Another option is to use xarray’s apply_ufunc(), which can automate embarrassingly parallel “map” type operations where a functions written for processing NumPy arrays should be repeatedly applied to xarray objects containing dask arrays. It works similarly to dask.array.map_blocks() and dask.array.atop(), but without requiring an intermediate layer of abstraction.

For the best performance when using dask’s multi-threaded scheduler, wrap a function that already releases the global interpreter lock, which fortunately already includes most NumPy and Scipy functions. Here we show an example using NumPy operations and a fast function from bottleneck, which we use to calculate Spearman’s rank-correlation coefficient:

import numpy as np
import xarray as xr
import bottleneck

def covariance_gufunc(x, y):
    return ((x - x.mean(axis=-1, keepdims=True))
            * (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)

def pearson_correlation_gufunc(x, y):
    return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))

def spearman_correlation_gufunc(x, y):
    x_ranks = bottleneck.rankdata(x, axis=-1)
    y_ranks = bottleneck.rankdata(y, axis=-1)
    return pearson_correlation_gufunc(x_ranks, y_ranks)

def spearman_correlation(x, y, dim):
    return xr.apply_ufunc(
        spearman_correlation_gufunc, x, y,
        input_core_dims=[[dim], [dim]],
        dask='parallelized',
        output_dtypes=[float])

The only aspect of this example that is different from standard usage of apply_ufunc() is that we needed to supply the output_dtypes arguments. (Read up on Wrapping custom computation for an explanation of the “core dimensions” listed in input_core_dims.)

Our new spearman_correlation() function achieves near linear speedup when run on large arrays across the four cores on my laptop. It would also work as a streaming operation, when run on arrays loaded from disk:

In [14]: rs = np.random.RandomState(0)

In [15]: array1 = xr.DataArray(rs.randn(1000, 100000), dims=['place', 'time'])  # 800MB

In [16]: array2 = array1 + 0.5 * rs.randn(1000, 100000)

# using one core, on numpy arrays
In [17]: %time _ = spearman_correlation(array1, array2, 'time')
CPU times: user 21.6 s, sys: 2.84 s, total: 24.5 s
Wall time: 24.9 s

In [18]: chunked1 = array1.chunk({'place': 10})

In [19]: chunked2 = array2.chunk({'place': 10})

# using all my laptop's cores, with dask
In [20]: r = spearman_correlation(chunked1, chunked2, 'time').compute()

In [21]: %time _ = r.compute()
CPU times: user 30.9 s, sys: 1.74 s, total: 32.6 s
Wall time: 4.59 s

One limitation of apply_ufunc() is that it cannot be applied to arrays with multiple chunks along a core dimension:

In [22]: spearman_correlation(chunked1, chunked2, 'place')
ValueError: dimension 'place' on 0th function argument to apply_ufunc with
dask='parallelized' consists of multiple chunks, but is also a core
dimension. To fix, rechunk into a single dask array chunk along this
dimension, i.e., ``.rechunk({'place': -1})``, but beware that this may
significantly increase memory usage.

The reflects the nature of core dimensions, in contrast to broadcast (non-core) dimensions that allow operations to be split into arbitrary chunks for application.

Tip

For the majority of NumPy functions that are already wrapped by dask, it’s usually a better idea to use the pre-existing dask.array function, by using either a pre-existing xarray methods or apply_ufunc() with dask='allowed'. Dask can often have a more efficient implementation that makes use of the specialized structure of a problem, unlike the generic speedups offered by dask='parallelized'.

Chunking and performance

The chunks parameter has critical performance implications when using dask arrays. If your chunks are too small, queueing up operations will be extremely slow, because dask will translates each operation into a huge number of operations mapped across chunks. Computation on dask arrays with small chunks can also be slow, because each operation on a chunk has some fixed overhead from the Python interpreter and the dask task executor.

Conversely, if your chunks are too big, some of your computation may be wasted, because dask only computes results one chunk at a time.

A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.

Optimization Tips

With analysis pipelines involving both spatial subsetting and temporal resampling, dask performance can become very slow in certain cases. Here are some optimization tips we have found through experience:

  1. Do your spatial and temporal indexing (e.g. .sel() or .isel()) early in the pipeline, especially before calling resample() or groupby(). Grouping and rasampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn’t been implemented in dask yet. (See dask issue #746).
  2. Save intermediate results to disk as a netCDF files (using to_netcdf()) and then load them again with open_dataset() for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See dask issue #874)
  3. Specify smaller chunks across space when using open_mfdataset() (e.g., chunks={'latitude': 10, 'longitude': 10}). This makes spatial subsetting easier, because there’s no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).

Plotting

Introduction

Labeled data enables expressive computations. These same labels can also be used to easily create informative plots.

xarray’s plotting capabilities are centered around xarray.DataArray objects. To plot xarray.Dataset objects simply access the relevant DataArrays, ie dset['var1']. Here we focus mostly on arrays 2d or larger. If your data fits nicely into a pandas DataFrame then you’re better off using one of the more developed tools there.

xarray plotting functionality is a thin wrapper around the popular matplotlib library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two. Matplotlib must be installed before xarray can plot.

For more extensive plotting applications consider the following projects:

  • Seaborn: “provides a high-level interface for drawing attractive statistical graphics.” Integrates well with pandas.
  • HoloViews and GeoViews: “Composable, declarative data structures for building even complex visualizations easily.” Includes native support for xarray objects.
  • Cartopy: Provides cartographic tools.
Imports

The following imports are necessary for all of the examples.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import matplotlib.pyplot as plt

In [4]: import xarray as xr

For these examples we’ll use the North American air temperature dataset.

In [5]: airtemps = xr.tutorial.load_dataset('air_temperature')

In [6]: airtemps
Out[6]: 
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
  * time     (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ...
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 244.0 244.1 243.89 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

# Convert to celsius
In [7]: air = airtemps.air - 273.15

One Dimension

Simple Example

xarray uses the coordinate name to label the x axis.

In [8]: air1d = air.isel(lat=10, lon=10)

In [9]: air1d.plot()
Out[9]: [<matplotlib.lines.Line2D at 0x7f10cf4541d0>]
_images/plotting_1d_simple.png
Additional Arguments

Additional arguments are passed directly to the matplotlib function which does the work. For example, xarray.plot.line() calls matplotlib.pyplot.plot passing in the index and the array values as x and y, respectively. So to make a line plot with blue triangles a matplotlib format string can be used:

In [10]: air1d[:200].plot.line('b-^')
Out[10]: [<matplotlib.lines.Line2D at 0x7f10cf4d66a0>]
_images/plotting_1d_additional_args.png

Note

Not all xarray plotting methods support passing positional arguments to the wrapped matplotlib functions, but they do all support keyword arguments.

Keyword arguments work the same way, and are more explicit.

In [11]: air1d[:200].plot.line(color='purple', marker='o')
Out[11]: [<matplotlib.lines.Line2D at 0x7f10ce0e7cf8>]
_images/plotting_example_sin3.png
Adding to Existing Axis

To add the plot to an existing axis pass in the axis as a keyword argument ax. This works for all xarray plotting methods. In this example axes is an array consisting of the left and right axes created by plt.subplots.

In [12]: fig, axes = plt.subplots(ncols=2)

In [13]: axes
Out[13]: 
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f10ce0e1780>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f10ce13aa20>], dtype=object)

In [14]: air1d.plot(ax=axes[0])
Out[14]: [<matplotlib.lines.Line2D at 0x7f10ce12c5f8>]

In [15]: air1d.plot.hist(ax=axes[1])
Out[15]: 
(array([   9.,   38.,  255.,  584.,  542.,  489.,  368.,  258.,  327.,   50.]),
 array([  0.95 ,   2.719,   4.488, ...,  15.102,  16.871,  18.64 ]),
 <a list of 10 Patch objects>)

In [16]: plt.tight_layout()

In [17]: plt.show()
_images/plotting_example_existing_axes.png

On the right is a histogram created by xarray.plot.hist().

Controlling the figure size

You can pass a figsize argument to all xarray’s plotting methods to control the figure size. For convenience, xarray’s plotting methods also support the aspect and size arguments which control the size of the resulting image via the formula figsize = (aspect * size, size):

In [18]: air1d.plot(aspect=2, size=3)
Out[18]: [<matplotlib.lines.Line2D at 0x7f10ce12cdd8>]

In [19]: plt.tight_layout()
_images/plotting_example_size_and_aspect.png

This feature also works with Faceting. For facet plots, size and aspect refer to a single panel (so that aspect * size gives the width of each facet in inches), while figsize refers to the entire figure (as for matplotlib’s figsize argument).

Note

If figsize or size are used, a new figure is created, so this is mutually exclusive with the ax argument.

Note

The convention used by xarray (figsize = (aspect * size, size)) is borrowed from seaborn: it is therefore not equivalent to matplotlib’s.

Multiple lines showing variation along a dimension

It is possible to make line plots of two-dimensional data by calling xarray.plot.line() with appropriate arguments. Consider the 3D variable air defined above. We can use line plots to check the variation of air temperature at three different latitudes along a longitude line:

In [20]: air.isel(lon=10, lat=[19,21,22]).plot.line(x='time')
Out[20]: 
[<matplotlib.lines.Line2D at 0x7f10cf43cac8>,
 <matplotlib.lines.Line2D at 0x7f10cf4af588>,
 <matplotlib.lines.Line2D at 0x7f10cf4af160>]
_images/plotting_example_multiple_lines_x_kwarg.png

It is required to explicitly specify either

  1. x: the dimension to be used for the x-axis, or
  2. hue: the dimension you want to represent by multiple lines.

Thus, we could have made the previous plot by specifying hue='lat' instead of x='time'. If required, the automatic legend can be turned off using add_legend=False.

Dimension along y-axis

It is also possible to make line plots such that the data are on the x-axis and a dimension is on the y-axis. This can be done by specifying the appropriate y keyword argument.

In [21]: air.isel(time=10, lon=[10, 11]).plot.line(y='lat', hue='lon')
Out[21]: 
[<matplotlib.lines.Line2D at 0x7f10cf504048>,
 <matplotlib.lines.Line2D at 0x7f10cf504fd0>]
_images/plotting_example_xy_kwarg.png

Two Dimensions

Simple Example

The default method xarray.DataArray.plot() calls xarray.plot.pcolormesh() by default when the data is two-dimensional.

In [22]: air2d = air.isel(time=500)

In [23]: air2d.plot()
Out[23]: <matplotlib.collections.QuadMesh at 0x7f10cf3c1fd0>
_images/2d_simple.png

All 2d plots in xarray allow the use of the keyword arguments yincrease and xincrease.

In [24]: air2d.plot(yincrease=False)
Out[24]: <matplotlib.collections.QuadMesh at 0x7f10cf6ae198>
_images/2d_simple_yincrease.png

Note

We use xarray.plot.pcolormesh() as the default two-dimensional plot method because it is more flexible than xarray.plot.imshow(). However, for large arrays, imshow can be much faster than pcolormesh. If speed is important to you and you are plotting a regular mesh, consider using imshow.

Missing Values

xarray plots data with Missing values.

In [25]: bad_air2d = air2d.copy()

In [26]: bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan

In [27]: bad_air2d.plot()
Out[27]: <matplotlib.collections.QuadMesh at 0x7f10cf76a320>
_images/plotting_missing_values.png
Nonuniform Coordinates

It’s not necessary for the coordinates to be evenly spaced. Both xarray.plot.pcolormesh() (default) and xarray.plot.contourf() can produce plots with nonuniform coordinates.

In [28]: b = air2d.copy()

# Apply a nonlinear transformation to one of the coords
In [29]: b.coords['lat'] = np.log(b.coords['lat'])

In [30]: b.plot()
Out[30]: <matplotlib.collections.QuadMesh at 0x7f10cf784438>
_images/plotting_nonuniform_coords.png
Calling Matplotlib

Since this is a thin wrapper around matplotlib, all the functionality of matplotlib is available.

In [31]: air2d.plot(cmap=plt.cm.Blues)
Out[31]: <matplotlib.collections.QuadMesh at 0x7f10cf75cb38>

In [32]: plt.title('These colors prove North America\nhas fallen in the ocean')
Out[32]: Text(0.5,1,'These colors prove North America\nhas fallen in the ocean')

In [33]: plt.ylabel('latitude')
Out[33]: Text(0,0.5,'latitude')

In [34]: plt.xlabel('longitude')
Out[34]: Text(0.5,0,'longitude')

In [35]: plt.tight_layout()

In [36]: plt.show()
_images/plotting_2d_call_matplotlib.png

Note

xarray methods update label information and generally play around with the axes. So any kind of updates to the plot should be done after the call to the xarray’s plot. In the example below, plt.xlabel effectively does nothing, since d_ylog.plot() updates the xlabel.

In [37]: plt.xlabel('Never gonna see this.')
Out[37]: Text(0.5,0,'Never gonna see this.')

In [38]: air2d.plot()
Out[38]: <matplotlib.collections.QuadMesh at 0x7f10ce0ed4a8>

In [39]: plt.show()
_images/plotting_2d_call_matplotlib2.png
Colormaps

xarray borrows logic from Seaborn to infer what kind of color map to use. For example, consider the original data in Kelvins rather than Celsius:

In [40]: airtemps.air.isel(time=0).plot()
Out[40]: <matplotlib.collections.QuadMesh at 0x7f10ccee70f0>
_images/plotting_kelvin.png

The Celsius data contain 0, so a diverging color map was used. The Kelvins do not have 0, so the default color map was used.

Robust

Outliers often have an extreme effect on the output of the plot. Here we add two bad data points. This affects the color scale, washing out the plot.

In [41]: air_outliers = airtemps.air.isel(time=0).copy()

In [42]: air_outliers[0, 0] = 100

In [43]: air_outliers[-1, -1] = 400

In [44]: air_outliers.plot()
Out[44]: <matplotlib.collections.QuadMesh at 0x7f10cce721d0>
_images/plotting_robust1.png

This plot shows that we have outliers. The easy way to visualize the data without the outliers is to pass the parameter robust=True. This will use the 2nd and 98th percentiles of the data to compute the color limits.

In [45]: air_outliers.plot(robust=True)
Out[45]: <matplotlib.collections.QuadMesh at 0x7f10cce0a940>
_images/plotting_robust2.png

Observe that the ranges of the color bar have changed. The arrows on the color bar indicate that the colors include data points outside the bounds.

Discrete Colormaps

It is often useful, when visualizing 2d data, to use a discrete colormap, rather than the default continuous colormaps that matplotlib uses. The levels keyword argument can be used to generate plots with discrete colormaps. For example, to make a plot with 8 discrete color intervals:

In [46]: air2d.plot(levels=8)
Out[46]: <matplotlib.collections.QuadMesh at 0x7f10ccd94400>
_images/plotting_discrete_levels.png

It is also possible to use a list of levels to specify the boundaries of the discrete colormap:

In [47]: air2d.plot(levels=[0, 12, 18, 30])
Out[47]: <matplotlib.collections.QuadMesh at 0x7f10ccd0f588>
_images/plotting_listed_levels.png

You can also specify a list of discrete colors through the colors argument:

In [48]: flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]

In [49]: air2d.plot(levels=[0, 12, 18, 30], colors=flatui)
Out[49]: <matplotlib.collections.QuadMesh at 0x7f10cf6bb320>
_images/plotting_custom_colors_levels.png

Finally, if you have Seaborn installed, you can also specify a seaborn color palette to the cmap argument. Note that levels must be specified with seaborn color palettes if using imshow or pcolormesh (but not with contour or contourf, since levels are chosen automatically).

In [50]: air2d.plot(levels=10, cmap='husl')
Out[50]: <matplotlib.collections.QuadMesh at 0x7f10ccd4aac8>
_images/plotting_seaborn_palette.png

Faceting

Faceting here refers to splitting an array along one or two dimensions and plotting each group. xarray’s basic plotting is useful for plotting two dimensional arrays. What about three or four dimensional arrays? That’s where facets become helpful.

Consider the temperature data set. There are 4 observations per day for two years which makes for 2920 values along the time dimension. One way to visualize this data is to make a seperate plot for each time period.

The faceted dimension should not have too many values; faceting on the time dimension will produce 2920 plots. That’s too much to be helpful. To handle this situation try performing an operation that reduces the size of the data in some way. For example, we could compute the average air temperature for each month and reduce the size of this dimension from 2920 -> 12. A simpler way is to just take a slice on that dimension. So let’s use a slice to pick 6 times throughout the first year.

In [51]: t = air.isel(time=slice(0, 365 * 4, 250))

In [52]: t.coords
Out[52]: 
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
  * time     (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00 2013-05-06 ...
Simple Example

The easiest way to create faceted plots is to pass in row or col arguments to the xarray plotting methods/functions. This returns a xarray.plot.FacetGrid object.

In [53]: g_simple = t.plot(x='lon', y='lat', col='time', col_wrap=3)
_images/plot_facet_dataarray.png
4 dimensional

For 4 dimensional arrays we can use the rows and columns of the grids. Here we create a 4 dimensional array by taking the original data and adding a fixed amount. Now we can see how the temperature maps would compare if one were much hotter.

In [54]: t2 = t.isel(time=slice(0, 2))

In [55]: t4d = xr.concat([t2, t2 + 40], pd.Index(['normal', 'hot'], name='fourth_dim'))

# This is a 4d array
In [56]: t4d.coords
Out[56]: 
Coordinates:
  * lat         (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 ...
  * lon         (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 ...
  * time        (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00
  * fourth_dim  (fourth_dim) object 'normal' 'hot'

In [57]: t4d.plot(x='lon', y='lat', col='time', row='fourth_dim')
Out[57]: <xarray.plot.facetgrid.FacetGrid at 0x7f10ccb29a58>
_images/plot_facet_4d.png
Other features

Faceted plotting supports other arguments common to xarray 2d plots.

In [58]: hasoutliers = t.isel(time=slice(0, 5)).copy()

In [59]: hasoutliers[0, 0, 0] = -100

In [60]: hasoutliers[-1, -1, -1] = 400

In [61]: g = hasoutliers.plot.pcolormesh('lon', 'lat', col='time', col_wrap=3,
   ....:                                 robust=True, cmap='viridis')
   ....: 
_images/plot_facet_robust.png
FacetGrid Objects

xarray.plot.FacetGrid is used to control the behavior of the multiple plots. It borrows an API and code from Seaborn’s FacetGrid. The structure is contained within the axes and name_dicts attributes, both 2d Numpy object arrays.

In [62]: g.axes
Out[62]: 
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f10cc999160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f10cc908a20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f10cc866278>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f10cc890fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f10cc8457f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f10cc7fe080>]], dtype=object)

In [63]: g.name_dicts
Out[63]: 
array([[{'time': numpy.datetime64('2013-01-01T00:00:00.000000000')},
        {'time': numpy.datetime64('2013-03-04T12:00:00.000000000')},
        {'time': numpy.datetime64('2013-05-06T00:00:00.000000000')}],
       [{'time': numpy.datetime64('2013-07-07T12:00:00.000000000')},
        {'time': numpy.datetime64('2013-09-08T00:00:00.000000000')}, None]], dtype=object)

It’s possible to select the xarray.DataArray or xarray.Dataset corresponding to the FacetGrid through the name_dicts.

In [64]: g.data.loc[g.name_dicts[0, 0]]
Out[64]: 
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[-100.      ,  -30.649994,  -29.649994, ...,  -40.350006,  -37.649994,
         -34.550003],
       [ -29.350006,  -28.649994,  -28.449997, ...,  -40.350006,  -37.850006,
         -33.850006],
       [ -23.149994,  -23.350006,  -24.259995, ...,  -39.949997,  -36.759995,
         -31.449997],
       ..., 
       [  23.450012,   23.049988,   23.25    , ...,   22.25    ,   21.950012,
          21.549988],
       [  22.75    ,   23.049988,   23.640015, ...,   22.75    ,   22.75    ,
          22.049988],
       [  23.140015,   23.640015,   23.950012, ...,   23.75    ,   23.640015,
          23.450012]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
    time     datetime64[ns] 2013-01-01

Here is an example of using the lower level API and then modifying the axes after they have been plotted.

In [65]: g = t.plot.imshow('lon', 'lat', col='time', col_wrap=3, robust=True)

In [66]: for i, ax in enumerate(g.axes.flat):
   ....:     ax.set_title('Air Temperature %d' % i)
   ....: 

In [67]: bottomright = g.axes[-1, -1]

In [68]: bottomright.annotate('bottom right', (240, 40))
Out[68]: Text(240,40,'bottom right')

In [69]: plt.show()
_images/plot_facet_iterator.png

TODO: add an example of using the map method to plot dataset variables (e.g., with plt.quiver).

Maps

To follow this section you’ll need to have Cartopy installed and working.

This script will plot the air temperature on a map.

In [70]: import cartopy.crs as ccrs

In [71]: air = xr.tutorial.load_dataset('air_temperature').air

In [72]: ax = plt.axes(projection=ccrs.Orthographic(-80, 35))

In [73]: air.isel(time=0).plot.contourf(ax=ax, transform=ccrs.PlateCarree());

In [74]: ax.set_global(); ax.coastlines();
_images/plotting_maps_cartopy.png

When faceting on maps, the projection can be transferred to the plot function using the subplot_kws keyword. The axes for the subplots created by faceting are accessible in the object returned by plot:

In [75]: p = air.isel(time=[0, 4]).plot(transform=ccrs.PlateCarree(), col='time',
   ....:                                subplot_kws={'projection': ccrs.Orthographic(-80, 35)})
   ....: 

In [76]: for ax in p.axes.flat:
   ....:     ax.coastlines()
   ....:     ax.gridlines()
   ....: 

In [77]: plt.show();
_images/plotting_maps_cartopy_facetting.png

Details

Ways to Use

There are three ways to use the xarray plotting functionality:

  1. Use plot as a convenience method for a DataArray.
  2. Access a specific plotting method from the plot attribute of a DataArray.
  3. Directly from the xarray plot submodule.

These are provided for user convenience; they all call the same code.

In [78]: import xarray.plot as xplt

In [79]: da = xr.DataArray(range(5))

In [80]: fig, axes = plt.subplots(ncols=2, nrows=2)

In [81]: da.plot(ax=axes[0, 0])
Out[81]: [<matplotlib.lines.Line2D at 0x7f10cc4210b8>]

In [82]: da.plot.line(ax=axes[0, 1])
Out[82]: [<matplotlib.lines.Line2D at 0x7f10cc421160>]

In [83]: xplt.plot(da, ax=axes[1, 0])
Out[83]: [<matplotlib.lines.Line2D at 0x7f10cc421198>]

In [84]: xplt.line(da, ax=axes[1, 1])
Out[84]: [<matplotlib.lines.Line2D at 0x7f10cc421208>]

In [85]: plt.tight_layout()

In [86]: plt.show()
_images/plotting_ways_to_use.png

Here the output is the same. Since the data is 1 dimensional the line plot was used.

The convenience method xarray.DataArray.plot() dispatches to an appropriate plotting function based on the dimensions of the DataArray and whether the coordinates are sorted and uniformly spaced. This table describes what gets plotted:

Dimensions Plotting function
1 xarray.plot.line()
2 xarray.plot.pcolormesh()
Anything else xarray.plot.hist()
Coordinates

If you’d like to find out what’s really going on in the coordinate system, read on.

In [87]: a0 = xr.DataArray(np.zeros((4, 3, 2)), dims=('y', 'x', 'z'),
   ....:                   name='temperature')
   ....: 

In [88]: a0[0, 0, 0] = 1

In [89]: a = a0.isel(z=0)

In [90]: a
Out[90]: 
<xarray.DataArray 'temperature' (y: 4, x: 3)>
array([[ 1.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
Dimensions without coordinates: y, x

The plot will produce an image corresponding to the values of the array. Hence the top left pixel will be a different color than the others. Before reading on, you may want to look at the coordinates and think carefully about what the limits, labels, and orientation for each of the axes should be.

In [91]: a.plot()
Out[91]: <matplotlib.collections.QuadMesh at 0x7f10cc364550>
_images/plotting_example_2d_simple.png

It may seem strange that the values on the y axis are decreasing with -0.5 on the top. This is because the pixels are centered over their coordinates, and the axis labels and ranges correspond to the values of the coordinates.

Multidimensional coordinates

See also: Working with Multidimensional Coordinates.

You can plot irregular grids defined by multidimensional coordinates with xarray, but you’ll have to tell the plot function to use these coordinates instead of the default ones:

In [92]: lon, lat = np.meshgrid(np.linspace(-20, 20, 5), np.linspace(0, 30, 4))

In [93]: lon += lat/10

In [94]: lat += lon/10

In [95]: da = xr.DataArray(np.arange(20).reshape(4, 5), dims=['y', 'x'],
   ....:                   coords = {'lat': (('y', 'x'), lat),
   ....:                             'lon': (('y', 'x'), lon)})
   ....: 

In [96]: da.plot.pcolormesh('lon', 'lat');
_images/plotting_example_2d_irreg.png

Note that in this case, xarray still follows the pixel centered convention. This might be undesirable in some cases, for example when your data is defined on a polar projection (GH781). This is why the default is to not follow this convention when plotting on a map:

In [97]: import cartopy.crs as ccrs

In [98]: ax = plt.subplot(projection=ccrs.PlateCarree());

In [99]: da.plot.pcolormesh('lon', 'lat', ax=ax);

In [100]: ax.scatter(lon, lat, transform=ccrs.PlateCarree());

In [101]: ax.coastlines(); ax.gridlines(draw_labels=True);
_images/plotting_example_2d_irreg_map.png

You can however decide to infer the cell boundaries and use the infer_intervals keyword:

In [102]: ax = plt.subplot(projection=ccrs.PlateCarree());

In [103]: da.plot.pcolormesh('lon', 'lat', ax=ax, infer_intervals=True);

In [104]: ax.scatter(lon, lat, transform=ccrs.PlateCarree());

In [105]: ax.coastlines(); ax.gridlines(draw_labels=True);
_images/plotting_example_2d_irreg_map_infer.png

Note

The data model of xarray does not support datasets with cell boundaries yet. If you want to use these coordinates, you’ll have to make the plots outside the xarray framework.

Help & reference

What’s New

Warning

Xarray plans to drop support for python 2.7 at the end of 2018. This means that new releases of xarray published after this date will only be installable on python 3+ environments, but older versions of xarray will always be available to python 2.7 users. For more information see the following references

v0.10.3 (April 13, 2018)

The minor release includes a number of bug-fixes and backwards compatible enhancements.

Enhancements
  • isin() and isin() methods, which test each value in the array for whether it is contained in the supplied list, returning a bool array. See Selecting values with isin for full details. Similar to the np.isin function. By Maximilian Roos.
  • Some speed improvement to construct DataArrayRolling object (GH1993) By Keisuke Fujii.
  • Handle variables with different values for missing_value and _FillValue by masking values for both attributes; previously this resulted in a ValueError. (GH2016) By Ryan May.
Bug fixes
  • Fixed decode_cf function to operate lazily on dask arrays (GH1372). By Ryan Abernathey.
  • Fixed labeled indexing with slice bounds given by xarray objects with datetime64 or timedelta64 dtypes (GH1240). By Stephan Hoyer.
  • Attempting to convert an xarray.Dataset into a numpy array now raises an informative error message. By Stephan Hoyer.
  • Fixed a bug in decode_cf_datetime where int32 arrays weren’t parsed correctly (GH2002). By Fabien Maussion.
  • When calling xr.auto_combine() or xr.open_mfdataset() with a concat_dim, the resulting dataset will have that one-element dimension (it was silently dropped, previously) (GH1988). By Ben Root.

v0.10.2 (13 March 2018)

The minor release includes a number of bug-fixes and enhancements, along with one possibly backwards incompatible change.

Backwards incompatible changes
  • The addition of __array_ufunc__ for xarray objects (see below) means that NumPy ufunc methods (e.g., np.add.reduce) that previously worked on xarray.DataArray objects by converting them into NumPy arrays will now raise NotImplementedError instead. In all cases, the work-around is simple: convert your objects explicitly into NumPy arrays before calling the ufunc (e.g., with .values).
Enhancements
  • Added dot(), equivalent to np.einsum(). Also, dot() now supports dims option, which specifies the dimensions to sum over. (GH1951) By Keisuke Fujii.

  • Support for writing xarray datasets to netCDF files (netcdf4 backend only) when using the dask.distributed scheduler (GH1464). By Joe Hamman.

  • Support lazy vectorized-indexing. After this change, flexible indexing such as orthogonal/vectorized indexing, becomes possible for all the backend arrays. Also, lazy transpose is now also supported. (GH1897) By Keisuke Fujii.

  • Implemented NumPy’s __array_ufunc__ protocol for all xarray objects (GH1617). This enables using NumPy ufuncs directly on xarray.Dataset objects with recent versions of NumPy (v1.13 and newer):

    In [1]: ds = xr.Dataset({'a': 1})
    
    In [2]: np.sin(ds)
    Out[2]: 
    <xarray.Dataset>
    Dimensions:  ()
    Data variables:
        a        float64 0.8415
    

    This obliviates the need for the xarray.ufuncs module, which will be deprecated in the future when xarray drops support for older versions of NumPy. By Stephan Hoyer.

  • Improve rolling() logic. DataArrayRolling() object now supports construct() method that returns a view of the DataArray / Dataset object with the rolling-window dimension added to the last axis. This enables more flexible operation, such as strided rolling, windowed rolling, ND-rolling, short-time FFT and convolution. (GH1831, GH1142, GH819) By Keisuke Fujii.

  • line() learned to make plots with data on x-axis if so specified. (GH575) By Deepak Cherian.

Bug fixes

v0.10.1 (25 February 2018)

The minor release includes a number of bug-fixes and backwards compatible enhancements.

Documentation
Enhancements

New functions and methods:

Plotting enhancements:

Other enhancements:

  • Reduce methods such as DataArray.sum() now handles object-type array.

    In [3]: da = xr.DataArray(np.array([True, False, np.nan], dtype=object), dims='x')
    
    In [4]: da.sum()
    Out[4]: 
    <xarray.DataArray ()>
    array(1)
    

    (GH1866) By Keisuke Fujii.

  • Reduce methods such as DataArray.sum() now accepts dtype arguments. (GH1838) By Keisuke Fujii.

  • Added nodatavals attribute to DataArray when using open_rasterio(). (GH1736). By Alan Snow.

  • Use pandas.Grouper class in xarray resample methods rather than the deprecated pandas.TimeGrouper class (GH1766). By Joe Hamman.

  • Experimental support for parsing ENVI metadata to coordinates and attributes in xarray.open_rasterio(). By Matti Eskelinen.

  • Reduce memory usage when decoding a variable with a scale_factor, by converting 8-bit and 16-bit integers to float32 instead of float64 (PR1840), and keeping float16 and float32 as float32 (GH1842). Correspondingly, encoded variables may also be saved with a smaller dtype. By Zac Hatfield-Dodds.

  • Speed of reindexing/alignment with dask array is orders of magnitude faster when inserting missing values (GH1847). By Stephan Hoyer.

  • Fix axis keyword ignored when applying np.squeeze to DataArray (GH1487). By Florian Pinault.

  • netcdf4-python has moved the its time handling in the netcdftime module to a standalone package (netcdftime). As such, xarray now considers netcdftime an optional dependency. One benefit of this change is that it allows for encoding/decoding of datetimes with non-standard calendars without the netcdf4-python dependency (GH1084). By Joe Hamman.

New functions/methods

Bug fixes
  • Rolling aggregation with center=True option now gives the same result with pandas including the last element (GH1046). By Keisuke Fujii.
  • Support indexing with a 0d-np.ndarray (GH1921). By Keisuke Fujii.
  • Added warning in api.py of a netCDF4 bug that occurs when the filepath has 88 characters (GH1745). By Liam Brannigan.
  • Fixed encoding of multi-dimensional coordinates in to_netcdf() (GH1763). By Mike Neish.
  • Fixed chunking with non-file-based rasterio datasets (GH1816) and refactored rasterio test suite. By Ryan Abernathey
  • Bug fix in open_dataset(engine=’pydap’) (GH1775) By Keisuke Fujii.
  • Bug fix in vectorized assignment (GH1743, GH1744). Now item assignment to __setitem__() checks
  • Bug fix in vectorized assignment (GH1743, GH1744). Now item assignment to DataArray.__setitem__() checks coordinates of target, destination and keys. If there are any conflict among these coordinates, IndexError will be raised. By Keisuke Fujii.
  • Properly point DataArray.__dask_scheduler__() to dask.threaded.get. By Matthew Rocklin.
  • Bug fixes in DataArray.plot.imshow(): all-NaN arrays and arrays with size one in some dimension can now be plotted, which is good for exploring satellite imagery (GH1780). By Zac Hatfield-Dodds.
  • Fixed UnboundLocalError when opening netCDF file (GH1781). By Stephan Hoyer.
  • The variables, attrs, and dimensions properties have been deprecated as part of a bug fix addressing an issue where backends were unintentionally loading the datastores data and attributes repeatedly during writes (GH1798). By Joe Hamman.
  • Compatibility fixes to plotting module for Numpy 1.14 and Pandas 0.22 (GH1813). By Joe Hamman.
  • Bug fix in encoding coordinates with {'_FillValue': None} in netCDF metadata (GH1865). By Chris Roth.
  • Fix indexing with lists for arrays loaded from netCDF files with engine='h5netcdf (GH1864). By Stephan Hoyer.
  • Corrected a bug with incorrect coordinates for non-georeferenced geotiff files (GH1686). Internally, we now use the rasterio coordinate transform tool instead of doing the computations ourselves. A parse_coordinates kwarg has beed added to open_rasterio() (set to True per default). By Fabien Maussion.
  • The colors of discrete colormaps are now the same regardless if seaborn is installed or not (GH1896). By Fabien Maussion.
  • Fixed dtype promotion rules in where() and concat() to match pandas (GH1847). A combination of strings/numbers or unicode/bytes now promote to object dtype, instead of strings or unicode. By Stephan Hoyer.
  • Fixed bug where isnull() was loading data stored as dask arrays (GH1937). By Joe Hamman.

v0.10.0 (20 November 2017)

This is a major release that includes bug fixes, new features and a few backwards incompatible changes. Highlights include:

  • Indexing now supports broadcasting over dimensions, similar to NumPy’s vectorized indexing (but better!).
  • resample() has a new groupby-like API like pandas.
  • apply_ufunc() facilitates wrapping and parallelizing functions written for NumPy arrays.
  • Performance improvements, particularly for dask and open_mfdataset().
Breaking changes
  • xarray now supports a form of vectorized indexing with broadcasting, where the result of indexing depends on dimensions of indexers, e.g., array.sel(x=ind) with ind.dims == ('y',). Alignment between coordinates on indexed and indexing objects is also now enforced. Due to these changes, existing uses of xarray objects to index other xarray objects will break in some cases.

    The new indexing API is much more powerful, supporting outer, diagonal and vectorized indexing in a single interface. The isel_points and sel_points methods are deprecated, since they are now redundant with the isel / sel methods. See Vectorized Indexing for the details (GH1444, GH1436). By Keisuke Fujii and Stephan Hoyer.

  • A new resampling interface to match pandas’ groupby-like API was added to Dataset.resample() and DataArray.resample() (GH1272). Timeseries resampling is fully supported for data with arbitrary dimensions as is both downsampling and upsampling (including linear, quadratic, cubic, and spline interpolation).

    Old syntax:

    In [5]: ds.resample('24H', dim='time', how='max')
    Out[5]: 
    <xarray.Dataset>
    [...]
    

    New syntax:

    In [6]: ds.resample(time='24H').max()
    Out[6]: 
    <xarray.Dataset>
    [...]
    

    Note that both versions are currently supported, but using the old syntax will produce a warning encouraging users to adopt the new syntax. By Daniel Rothenberg.

  • Calling repr() or printing xarray objects at the command line or in a Jupyter Notebook will not longer automatically compute dask variables or load data on arrays lazily loaded from disk (GH1522). By Guido Imperiale.

  • Supplying coords as a dictionary to the DataArray constructor without also supplying an explicit dims argument is no longer supported. This behavior was deprecated in version 0.9 but will now raise an error (GH727).

  • Several existing features have been deprecated and will change to new behavior in xarray v0.11. If you use any of them with xarray v0.10, you should see a FutureWarning that describes how to update your code:

    • Dataset.T has been deprecated an alias for Dataset.transpose() (GH1232). In the next major version of xarray, it will provide short- cut lookup for variables or attributes with name 'T'.
    • DataArray.__contains__ (e.g., key in data_array) currently checks for membership in DataArray.coords. In the next major version of xarray, it will check membership in the array data found in DataArray.values instead (GH1267).
    • Direct iteration over and counting a Dataset (e.g., [k for k in ds], ds.keys(), ds.values(), len(ds) and if ds) currently includes all variables, both data and coordinates. For improved usability and consistency with pandas, in the next major version of xarray these will change to only include data variables (GH884). Use ds.variables, ds.data_vars or ds.coords as alternatives.
  • Changes to minimum versions of dependencies:

    • Old numpy < 1.11 and pandas < 0.18 are no longer supported (GH1512). By Keisuke Fujii.
    • The minimum supported version bottleneck has increased to 1.1 (GH1279). By Joe Hamman.
Enhancements

New functions/methods

  • New helper function apply_ufunc() for wrapping functions written to work on NumPy arrays to support labels on xarray objects (GH770). apply_ufunc also support automatic parallelization for many functions with dask. See Wrapping custom computation and Automatic parallelization for details. By Stephan Hoyer.

  • Added new method Dataset.to_dask_dataframe(), convert a dataset into a dask dataframe. This allows lazy loading of data from a dataset containing dask arrays (GH1462). By James Munroe.

  • New function where() for conditionally switching between values in xarray objects, like numpy.where():

    In [7]: import xarray as xr
    
    In [8]: arr = xr.DataArray([[1, 2, 3], [4, 5, 6]], dims=('x', 'y'))
    
    In [9]: xr.where(arr % 2, 'even', 'odd')
    Out[9]: 
    <xarray.DataArray (x: 2, y: 3)>
    array([['even', 'odd', 'even'],
           ['odd', 'even', 'odd']],
          dtype='<U4')
    Dimensions without coordinates: x, y
    

    Equivalently, the where() method also now supports the other argument, for filling with a value other than NaN (GH576). By Stephan Hoyer.

  • Added show_versions() function to aid in debugging (GH1485). By Joe Hamman.

Performance improvements

  • concat() was computing variables that aren’t in memory (e.g. dask-based) multiple times; open_mfdataset() was loading them multiple times from disk. Now, both functions will instead load them at most once and, if they do, store them in memory in the concatenated array/dataset (GH1521). By Guido Imperiale.
  • Speed-up (x 100) of decode_cf_datetime(). By Christian Chwala.

IO related improvements

  • Unicode strings (str on Python 3) are now round-tripped successfully even when written as character arrays (e.g., as netCDF3 files or when using engine='scipy') (GH1638). This is controlled by the _Encoding attribute convention, which is also understood directly by the netCDF4-Python interface. See String encoding for full details. By Stephan Hoyer.

  • Support for data_vars and coords keywords from concat() added to open_mfdataset() (GH438). Using these keyword arguments can significantly reduce memory usage and increase speed. By Oleksandr Huziy.

  • Support for pathlib.Path objects added to open_dataset(), open_mfdataset(), to_netcdf(), and save_mfdataset() (GH799):

    In [10]: from pathlib import Path  # In Python 2, use pathlib2!
    
    In [11]: data_dir = Path("data/")
    
    In [12]: one_file = data_dir / "dta_for_month_01.nc"
    
    In [13]: xr.open_dataset(one_file)
    Out[13]: 
    <xarray.Dataset>
    [...]
    

    By Willi Rath.

  • You can now explicitly disable any default _FillValue (NaN for floating point values) by passing the enconding {'_FillValue': None} (GH1598). By Stephan Hoyer.

  • More attributes available in attrs dictionary when raster files are opened with open_rasterio(). By Greg Brener.

  • Support for NetCDF files using an _Unsigned attribute to indicate that a a signed integer data type should be interpreted as unsigned bytes (GH1444). By Eric Bruning.

  • Support using an existing, opened netCDF4 Dataset with NetCDF4DataStore. This permits creating an Dataset from a netCDF4 Dataset that has been opened using other means (GH1459). By Ryan May.

  • Changed PydapDataStore to take a Pydap dataset. This permits opening Opendap datasets that require authentication, by instantiating a Pydap dataset with a session object. Also added xarray.backends.PydapDataStore.open() which takes a url and session object (GH1068). By Philip Graae.

  • Support reading and writing unlimited dimensions with h5netcdf (GH1636). By Joe Hamman.

Other improvements

  • Added _ipython_key_completions_ to xarray objects, to enable autocompletion for dictionary-like access in IPython, e.g., ds['tem + tab -> ds['temperature'] (GH1628). By Keisuke Fujii.
  • Support passing keyword arguments to load, compute, and persist methods. Any keyword arguments supplied to these methods are passed on to the corresponding dask function (GH1523). By Joe Hamman.
  • Encoding attributes are now preserved when xarray objects are concatenated. The encoding is copied from the first object (GH1297). By Joe Hamman and Gerrit Holl.
  • Support applying rolling window operations using bottleneck’s moving window functions on data stored as dask arrays (GH1279). By Joe Hamman.
  • Experimental support for the Dask collection interface (GH1674). By Matthew Rocklin.
Bug fixes
Bug fixes after rc1
  • Suppress warning in IPython autocompletion, related to the deprecation of .T attributes (GH1675). By Keisuke Fujii.
  • Fix a bug in lazily-indexing netCDF array. (GH1688) By Keisuke Fujii.
  • (Internal bug) MemoryCachedArray now supports the orthogonal indexing. Also made some internal cleanups around array wrappers (GH1429). By Keisuke Fujii.
  • (Internal bug) MemoryCachedArray now always wraps np.ndarray by NumpyIndexingAdapter. (GH1694) By Keisuke Fujii.
  • Fix importing xarray when running Python with -OO (GH1706). By Stephan Hoyer.
  • Saving a netCDF file with a coordinates with a spaces in its names now raises an appropriate warning (GH1689). By Stephan Hoyer.
  • Fix two bugs that were preventing dask arrays from being specified as coordinates in the DataArray constructor (GH1684). By Joe Hamman.
  • Fixed apply_ufunc with dask='parallelized' for scalar arguments (GH1697). By Stephan Hoyer.
  • Fix “Chunksize cannot exceed dimension size” error when writing netCDF4 files loaded from disk (GH1225). By Stephan Hoyer.
  • Validate the shape of coordinates with names matching dimensions in the DataArray constructor (GH1709). By Stephan Hoyer.
  • Raise NotImplementedError when attempting to save a MultiIndex to a netCDF file (GH1547). By Stephan Hoyer.
  • Remove netCDF dependency from rasterio backend tests. By Matti Eskelinen
Bug fixes after rc2
  • Fixed unexpected behavior in Dataset.set_index() and DataArray.set_index() introduced by Pandas 0.21.0. Setting a new index with a single variable resulted in 1-level pandas.MultiIndex instead of a simple pandas.Index (GH1722). By Benoit Bovy.
  • Fixed unexpected memory loading of backend arrays after print. (GH1720). By Keisuke Fujii.

v0.9.6 (8 June 2017)

This release includes a number of backwards compatible enhancements and bug fixes.

Enhancements
Bug fixes
  • Fix error from repeated indexing of datasets loaded from disk (GH1374). By Stephan Hoyer.
  • Fix a bug where .isel_points wrongly assigns unselected coordinate to data_vars. By Keisuke Fujii.
  • Tutorial datasets are now checked against a reference MD5 sum to confirm successful download (GH1392). By Matthew Gidden.
  • DataArray.chunk() now accepts dask specific kwargs like Dataset.chunk() does. By Fabien Maussion.
  • Support for engine='pydap' with recent releases of Pydap (3.2.2+), including on Python 3 (GH1174).
Documentation
Testing
  • Fix test suite failure caused by changes to pandas.cut function (GH1386). By Ryan Abernathey.
  • Enhanced tests suite by use of @network decorator, which is controlled via --run-network-tests command line argument to py.test (GH1393). By Matthew Gidden.

v0.9.5 (17 April, 2017)

Remove an inadvertently introduced print statement.

v0.9.3 (16 April, 2017)

This minor release includes bug-fixes and backwards compatible enhancements.

Enhancements
Bug fixes
  • Fix .where() with drop=True when arguments do not have indexes (GH1350). This bug, introduced in v0.9, resulted in xarray producing incorrect results in some cases. By Stephan Hoyer.
  • Fixed writing to file-like objects with to_netcdf() (GH1320). Stephan Hoyer.
  • Fixed explicitly setting engine='scipy' with to_netcdf when not providing a path (GH1321). Stephan Hoyer.
  • Fixed open_dataarray does not pass properly its parameters to open_dataset (GH1359). Stephan Hoyer.
  • Ensure test suite works when runs from an installed version of xarray (GH1336). Use @pytest.mark.slow instead of a custom flag to mark slow tests. By Stephan Hoyer

v0.9.2 (2 April 2017)

The minor release includes bug-fixes and backwards compatible enhancements.

Enhancements
  • rolling on Dataset is now supported (GH859).
  • .rolling() on Dataset is now supported (GH859). By Keisuke Fujii.
  • When bottleneck version 1.1 or later is installed, use bottleneck for rolling var, argmin, argmax, and rank computations. Also, rolling median now accepts a min_periods argument (GH1276). By Joe Hamman.
  • When .plot() is called on a 2D DataArray and only one dimension is specified with x= or y=, the other dimension is now guessed (GH1291). By Vincent Noel.
  • Added new method assign_attrs() to DataArray and Dataset, a chained-method compatible implementation of the dict.update method on attrs (GH1281). By Henry S. Harrison.
  • Added new autoclose=True argument to open_mfdataset() to explicitly close opened files when not in use to prevent occurrence of an OS Error related to too many open files (GH1198). Note, the default is autoclose=False, which is consistent with previous xarray behavior. By Phillip J. Wolfram.
  • The repr() of Dataset and DataArray attributes uses a similar format to coordinates and variables, with vertically aligned entries truncated to fit on a single line (GH1319). Hopefully this will stop people writing data.attrs = {} and discarding metadata in notebooks for the sake of cleaner output. The full metadata is still available as data.attrs. By Zac Hatfield-Dodds.
  • Enhanced tests suite by use of @slow and @flaky decorators, which are controlled via --run-flaky and --skip-slow command line arguments to py.test (GH1336). By Stephan Hoyer and Phillip J. Wolfram.
  • New aggregation on rolling objects DataArray.rolling(...).count() which providing a rolling count of valid values (GH1138).
Bug fixes

v0.9.1 (30 January 2017)

Renamed the “Unindexed dimensions” section in the Dataset and DataArray repr (added in v0.9.0) to “Dimensions without coordinates” (GH1199).

v0.9.0 (25 January 2017)

This major release includes five months worth of enhancements and bug fixes from 24 contributors, including some significant changes that are not fully backwards compatible. Highlights include:

Breaking changes
  • Index coordinates for each dimensions are now optional, and no longer created by default GH1017. You can identify such dimensions without coordinates by their appearance in list of “Dimensions without coordinates” in the Dataset or DataArray repr:

    In [14]: xr.Dataset({'foo': (('x', 'y'), [[1, 2]])})
    Out[14]: 
    <xarray.Dataset>
    Dimensions:  (x: 1, y: 2)
    Dimensions without coordinates: x, y
    Data variables:
        foo      (x, y) int64 1 2
    

    This has a number of implications:

    • align() and reindex() can now error, if dimensions labels are missing and dimensions have different sizes.
    • Because pandas does not support missing indexes, methods such as to_dataframe/from_dataframe and stack/unstack no longer roundtrip faithfully on all inputs. Use reset_index() to remove undesired indexes.
    • Dataset.__delitem__ and drop() no longer delete/drop variables that have dimensions matching a deleted/dropped variable.
    • DataArray.coords.__delitem__ is now allowed on variables matching dimension names.
    • .sel and .loc now handle indexing along a dimension without coordinate labels by doing integer based indexing. See Missing coordinate labels for an example.
    • indexes is no longer guaranteed to include all dimensions names as keys. The new method get_index() has been added to get an index for a dimension guaranteed, falling back to produce a default RangeIndex if necessary.
  • The default behavior of merge is now compat='no_conflicts', so some merges will now succeed in cases that previously raised xarray.MergeError. Set compat='broadcast_equals' to restore the previous default. See Merging with ‘no_conflicts’ for more details.

  • Reading values no longer always caches values in a NumPy array GH1128. Caching of .values on variables read from netCDF files on disk is still the default when open_dataset() is called with cache=True. By Guido Imperiale and Stephan Hoyer.

  • Pickling a Dataset or DataArray linked to a file on disk no longer caches its values into memory before pickling (GH1128). Instead, pickle stores file paths and restores objects by reopening file references. This enables preliminary, experimental use of xarray for opening files with dask.distributed. By Stephan Hoyer.

  • Coordinates used to index a dimension are now loaded eagerly into pandas.Index objects, instead of loading the values lazily. By Guido Imperiale.

  • Automatic levels for 2d plots are now guaranteed to land on vmin and vmax when these kwargs are explicitly provided (GH1191). The automated level selection logic also slightly changed. By Fabien Maussion.

  • DataArray.rename() behavior changed to strictly change the DataArray.name if called with string argument, or strictly change coordinate names if called with dict-like argument. By Markus Gonser.

  • By default to_netcdf() add a _FillValue = NaN attributes to float types. By Frederic Laliberte.

  • repr on DataArray objects uses an shortened display for NumPy array data that is less likely to overflow onto multiple pages (GH1207). By Stephan Hoyer.

  • xarray no longer supports python 3.3, versions of dask prior to v0.9.0, or versions of bottleneck prior to v1.0.

Deprecations
  • Renamed the Coordinate class from xarray’s low level API to IndexVariable. Variable.to_variable and Variable.to_coord have been renamed to to_base_variable() and to_index_variable().
  • Deprecated supplying coords as a dictionary to the DataArray constructor without also supplying an explicit dims argument. The old behavior encouraged relying on the iteration order of dictionaries, which is a bad practice (GH727).
  • Removed a number of methods deprecated since v0.7.0 or earlier: load_data, vars, drop_vars, dump, dumps and the variables keyword argument to Dataset.
  • Removed the dummy module that enabled import xray.
Enhancements
Bug fixes
  • groupby_bins now restores empty bins by default (GH1019). By Ryan Abernathey.
  • Fix issues for dates outside the valid range of pandas timestamps (GH975). By Mathias Hauser.
  • Unstacking produced flipped array after stacking decreasing coordinate values (GH980). By Stephan Hoyer.
  • Setting dtype via the encoding parameter of to_netcdf failed if the encoded dtype was the same as the dtype of the original array (GH873). By Stephan Hoyer.
  • Fix issues with variables where both attributes _FillValue and missing_value are set to NaN (GH997). By Marco Zühlke.
  • .where() and .fillna() now preserve attributes (GH1009). By Fabien Maussion.
  • Applying broadcast() to an xarray object based on the dask backend won’t accidentally convert the array from dask to numpy anymore (GH978). By Guido Imperiale.
  • Dataset.concat() now preserves variables order (GH1027). By Fabien Maussion.
  • Fixed an issue with pcolormesh (GH781). A new infer_intervals keyword gives control on whether the cell intervals should be computed or not. By Fabien Maussion.
  • Grouping over an dimension with non-unique values with groupby gives correct groups. By Stephan Hoyer.
  • Fixed accessing coordinate variables with non-string names from .coords. By Stephan Hoyer.
  • rename() now simultaneously renames the array and any coordinate with the same name, when supplied via a dict (GH1116). By Yves Delley.
  • Fixed sub-optimal performance in certain operations with object arrays (GH1121). By Yves Delley.
  • Fix .groupby(group) when group has datetime dtype (GH1132). By Jonas Sølvsteen.
  • Fixed a bug with facetgrid (the norm keyword was ignored, GH1159). By Fabien Maussion.
  • Resolved a concurrency bug that could cause Python to crash when simultaneously reading and writing netCDF4 files with dask (GH1172). By Stephan Hoyer.
  • Fix to make .copy() actually copy dask arrays, which will be relevant for future releases of dask in which dask arrays will be mutable (GH1180). By Stephan Hoyer.
  • Fix opening NetCDF files with multi-dimensional time variables (GH1229). By Stephan Hoyer.
Performance improvements
  • isel_points() and sel_points() now use vectorised indexing in numpy and dask (GH1161), which can result in several orders of magnitude speedup. By Jonathan Chambers.

v0.8.2 (18 August 2016)

This release includes a number of bug fixes and minor enhancements.

Breaking changes
Enhancements
Bug fixes
  • Ensure xarray works with h5netcdf v0.3.0 for arrays with dtype=str (GH953). By Stephan Hoyer.
  • Dataset.__dir__() (i.e. the method python calls to get autocomplete options) failed if one of the dataset’s keys was not a string (GH852). By Maximilian Roos.
  • Dataset constructor can now take arbitrary objects as values (GH647). By Maximilian Roos.
  • Clarified copy argument for reindex() and align(), which now consistently always return new xarray objects (GH927).
  • Fix open_mfdataset with engine='pynio' (GH936). By Stephan Hoyer.
  • groupby_bins sorted bin labels as strings (GH952). By Stephan Hoyer.
  • Fix bug introduced by v0.8.0 that broke assignment to datasets when both the left and right side have the same non-unique index values (GH956).

v0.8.1 (5 August 2016)

Bug fixes
  • Fix bug in v0.8.0 that broke assignment to Datasets with non-unique indexes (GH943). By Stephan Hoyer.

v0.8.0 (2 August 2016)

This release includes four months of new features and bug fixes, including several breaking changes.

Breaking changes
  • Dropped support for Python 2.6 (GH855).
  • Indexing on multi-index now drop levels, which is consistent with pandas. It also changes the name of the dimension / coordinate when the multi-index is reduced to a single index (GH802).
  • Contour plots no longer add a colorbar per default (GH866). Filled contour plots are unchanged.
  • DataArray.values and .data now always returns an NumPy array-like object, even for 0-dimensional arrays with object dtype (GH867). Previously, .values returned native Python objects in such cases. To convert the values of scalar arrays to Python objects, use the .item() method.
Enhancements
  • Groupby operations now support grouping over multidimensional variables. A new method called groupby_bins() has also been added to allow users to specify bins for grouping. The new features are described in Multidimensional Grouping and Working with Multidimensional Coordinates. By Ryan Abernathey.
  • DataArray and Dataset method where() now supports a drop=True option that clips coordinate elements that are fully masked. By Phillip J. Wolfram.
  • New top level merge() function allows for combining variables from any number of Dataset and/or DataArray variables. See Merge for more details. By Stephan Hoyer.
  • DataArray and Dataset method resample() now supports the keep_attrs=False option that determines whether variable and dataset attributes are retained in the resampled object. By Jeremy McGibbon.
  • Better multi-index support in DataArray and Dataset sel() and loc() methods, which now behave more closely to pandas and which also accept dictionaries for indexing based on given level names and labels (see Multi-level indexing). By Benoit Bovy.
  • New (experimental) decorators register_dataset_accessor() and register_dataarray_accessor() for registering custom xarray extensions without subclassing. They are described in the new documentation page on xarray Internals. By Stephan Hoyer.
  • Round trip boolean datatypes. Previously, writing boolean datatypes to netCDF formats would raise an error since netCDF does not have a bool datatype. This feature reads/writes a dtype attribute to boolean variables in netCDF files. By Joe Hamman.
  • 2D plotting methods now have two new keywords (cbar_ax and cbar_kwargs), allowing more control on the colorbar (GH872). By Fabien Maussion.
  • New Dataset method filter_by_attrs(), akin to netCDF4.Dataset.get_variables_by_attributes, to easily filter data variables using its attributes. Filipe Fernandes.
Bug fixes
  • Attributes were being retained by default for some resampling operations when they should not. With the keep_attrs=False option, they will no longer be retained by default. This may be backwards-incompatible with some scripts, but the attributes may be kept by adding the keep_attrs=True option. By Jeremy McGibbon.
  • Concatenating xarray objects along an axis with a MultiIndex or PeriodIndex preserves the nature of the index (GH875). By Stephan Hoyer.
  • Fixed bug in arithmetic operations on DataArray objects whose dimensions are numpy structured arrays or recarrays GH861, GH837. By Maciek Swat.
  • decode_cf_timedelta now accepts arrays with ndim >1 (GH842).
    This fixes issue GH665. Filipe Fernandes.
  • Fix a bug where xarray.ufuncs that take two arguments would incorrectly use to numpy functions instead of dask.array functions (GH876). By Stephan Hoyer.
  • Support for pickling functions from xarray.ufuncs (GH901). By Stephan Hoyer.
  • Variable.copy(deep=True) no longer converts MultiIndex into a base Index (GH769). By Benoit Bovy.
  • Fixes for groupby on dimensions with a multi-index (GH867). By Stephan Hoyer.
  • Fix printing datasets with unicode attributes on Python 2 (GH892). By Stephan Hoyer.
  • Fixed incorrect test for dask version (GH891). By Stephan Hoyer.
  • Fixed dim argument for isel_points/sel_points when a pandas.Index is passed. By Stephan Hoyer.
  • contour() now plots the correct number of contours (GH866). By Fabien Maussion.

v0.7.2 (13 March 2016)

This release includes two new, entirely backwards compatible features and several bug fixes.

Enhancements
  • New DataArray method DataArray.dot() for calculating the dot product of two DataArrays along shared dimensions. By Dean Pospisil.

  • Rolling window operations on DataArray objects are now supported via a new DataArray.rolling() method. For example:

    In [15]: import xarray as xr; import numpy as np
    
    In [16]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
                               dims=('x', 'y'))
    
    In [17]: arr
    Out[17]: 
    <xarray.DataArray (x: 3, y: 5)>
    array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
           [ 2.5,  3. ,  3.5,  4. ,  4.5],
           [ 5. ,  5.5,  6. ,  6.5,  7. ]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1 2 3 4
    
    In [18]: arr.rolling(y=3, min_periods=2).mean()
    Out[18]: 
    <xarray.DataArray (x: 3, y: 5)>
    array([[  nan,  0.25,  0.5 ,  1.  ,  1.5 ],
           [  nan,  2.75,  3.  ,  3.5 ,  4.  ],
           [  nan,  5.25,  5.5 ,  6.  ,  6.5 ]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1 2 3 4
    

    See Rolling window operations for more details. By Joe Hamman.

Bug fixes
  • Fixed an issue where plots using pcolormesh and Cartopy axes were being distorted by the inference of the axis interval breaks. This change chooses not to modify the coordinate variables when the axes have the attribute projection, allowing Cartopy to handle the extent of pcolormesh plots (GH781). By Joe Hamman.
  • 2D plots now better handle additional coordinates which are not DataArray dimensions (GH788). By Fabien Maussion.

v0.7.1 (16 February 2016)

This is a bug fix release that includes two small, backwards compatible enhancements. We recommend that all users upgrade.

Enhancements
  • Numerical operations now return empty objects on no overlapping labels rather than raising ValueError (GH739).
  • Series is now supported as valid input to the Dataset constructor (GH740).
Bug fixes
  • Restore checks for shape consistency between data and coordinates in the DataArray constructor (GH758).
  • Single dimension variables no longer transpose as part of a broader .transpose. This behavior was causing pandas.PeriodIndex dimensions to lose their type (GH749)
  • Dataset labels remain as their native type on .to_dataset. Previously they were coerced to strings (GH745)
  • Fixed a bug where replacing a DataArray index coordinate would improperly align the coordinate (GH725).
  • DataArray.reindex_like now maintains the dtype of complex numbers when reindexing leads to NaN values (GH738).
  • Dataset.rename and DataArray.rename support the old and new names being the same (GH724).
  • Fix from_dataset() for DataFrames with Categorical column and a MultiIndex index (GH737).
  • Fixes to ensure xarray works properly after the upcoming pandas v0.18 and NumPy v1.11 releases.
Acknowledgments

The following individuals contributed to this release:

  • Edward Richards
  • Maximilian Roos
  • Rafael Guedes
  • Spencer Hill
  • Stephan Hoyer

v0.7.0 (21 January 2016)

This major release includes redesign of DataArray internals, as well as new methods for reshaping, rolling and shifting data. It includes preliminary support for pandas.MultiIndex, as well as a number of other features and bug fixes, several of which offer improved compatibility with pandas.

New name

The project formerly known as “xray” is now “xarray”, pronounced “x-array”! This avoids a namespace conflict with the entire field of x-ray science. Renaming our project seemed like the right thing to do, especially because some scientists who work with actual x-rays are interested in using this project in their work. Thanks for your understanding and patience in this transition. You can now find our documentation and code repository at new URLs:

To ease the transition, we have simultaneously released v0.7.0 of both xray and xarray on the Python Package Index. These packages are identical. For now, import xray still works, except it issues a deprecation warning. This will be the last xray release. Going forward, we recommend switching your import statements to import xarray as xr.

Breaking changes
  • The internal data model used by DataArray has been rewritten to fix several outstanding issues (GH367, GH634, this stackoverflow report). Internally, DataArray is now implemented in terms of ._variable and ._coords attributes instead of holding variables in a Dataset object.

    This refactor ensures that if a DataArray has the same name as one of its coordinates, the array and the coordinate no longer share the same data.

    In practice, this means that creating a DataArray with the same name as one of its dimensions no longer automatically uses that array to label the corresponding coordinate. You will now need to provide coordinate labels explicitly. Here’s the old behavior:

    In [19]: xray.DataArray([4, 5, 6], dims='x', name='x')
    Out[19]: 
    <xray.DataArray 'x' (x: 3)>
    array([4, 5, 6])
    Coordinates:
      * x        (x) int64 4 5 6
    

    and the new behavior (compare the values of the x coordinate):

    In [20]: xray.DataArray([4, 5, 6], dims='x', name='x')
    Out[20]: 
    <xray.DataArray 'x' (x: 3)>
    array([4, 5, 6])
    Coordinates:
      * x        (x) int64 0 1 2
    
  • It is no longer possible to convert a DataArray to a Dataset with xray.DataArray.to_dataset() if it is unnamed. This will now raise ValueError. If the array is unnamed, you need to supply the name argument.

Enhancements
  • Basic support for MultiIndex coordinates on xray objects, including indexing, stack() and unstack():

    In [21]: df = pd.DataFrame({'foo': range(3),
       ....:                    'x': ['a', 'b', 'b'],
       ....:                    'y': [0, 0, 1]})
       ....: 
    
    In [22]: s = df.set_index(['x', 'y'])['foo']
    
    In [23]: arr = xray.DataArray(s, dims='z')
    
    In [24]: arr
    Out[24]: 
    <xray.DataArray 'foo' (z: 3)>
    array([0, 1, 2])
    Coordinates:
      * z        (z) object ('a', 0) ('b', 0) ('b', 1)
    
    In [25]: arr.indexes['z']
    Out[25]: 
    MultiIndex(levels=[[u'a', u'b'], [0, 1]],
               labels=[[0, 1, 1], [0, 0, 1]],
               names=[u'x', u'y'])
    
    In [26]: arr.unstack('z')
    Out[26]: 
    <xray.DataArray 'foo' (x: 2, y: 2)>
    array([[  0.,  nan],
           [  1.,   2.]])
    Coordinates:
      * x        (x) object 'a' 'b'
      * y        (y) int64 0 1
    
    In [27]: arr.unstack('z').stack(z=('x', 'y'))
    Out[27]: 
    <xray.DataArray 'foo' (z: 4)>
    array([  0.,  nan,   1.,   2.])
    Coordinates:
      * z        (z) object ('a', 0) ('a', 1) ('b', 0) ('b', 1)
    

    See Stack and unstack for more details.

    Warning

    xray’s MultiIndex support is still experimental, and we have a long to- do list of desired additions (GH719), including better display of multi-index levels when printing a Dataset, and support for saving datasets with a MultiIndex to a netCDF file. User contributions in this area would be greatly appreciated.

  • Support for reading GRIB, HDF4 and other file formats via PyNIO. See Formats supported by PyNIO for more details.

  • Better error message when a variable is supplied with the same name as one of its dimensions.

  • Plotting: more control on colormap parameters (GH642). vmin and vmax will not be silently ignored anymore. Setting center=False prevents automatic selection of a divergent colormap.

  • New shift() and roll() methods for shifting/rotating datasets or arrays along a dimension:

    In [28]: array = xray.DataArray([5, 6, 7, 8], dims='x')
    
    In [29]: array.shift(x=2)
    Out[29]: 
    <xarray.DataArray (x: 4)>
    array([ nan,  nan,   5.,   6.])
    Dimensions without coordinates: x
    
    In [30]: array.roll(x=2)
    Out[30]: 
    <xarray.DataArray (x: 4)>
    array([7, 8, 5, 6])
    Dimensions without coordinates: x
    

    Notice that shift moves data independently of coordinates, but roll moves both data and coordinates.

  • Assigning a pandas object directly as a Dataset variable is now permitted. Its index names correspond to the dims of the Dataset, and its data is aligned.

  • Passing a pandas.DataFrame or pandas.Panel to a Dataset constructor is now permitted.

  • New function broadcast() for explicitly broadcasting DataArray and Dataset objects against each other. For example:

    In [31]: a = xray.DataArray([1, 2, 3], dims='x')
    
    In [32]: b = xray.DataArray([5, 6], dims='y')
    
    In [33]: a
    Out[33]: 
    <xarray.DataArray (x: 3)>
    array([1, 2, 3])
    Dimensions without coordinates: x
    
    In [34]: b
    Out[34]: 
    <xarray.DataArray (y: 2)>
    array([5, 6])
    Dimensions without coordinates: y
    
    In [35]: a2, b2 = xray.broadcast(a, b)
    
    In [36]: a2
    Out[36]: 
    <xarray.DataArray (x: 3, y: 2)>
    array([[1, 1],
           [2, 2],
           [3, 3]])
    Dimensions without coordinates: x, y
    
    In [37]: b2
    Out[37]: 
    <xarray.DataArray (x: 3, y: 2)>
    array([[5, 6],
           [5, 6],
           [5, 6]])
    Dimensions without coordinates: x, y
    
Bug fixes
  • Fixes for several issues found on DataArray objects with the same name as one of their coordinates (see Breaking changes for more details).
  • DataArray.to_masked_array always returns masked array with mask being an array (not a scalar value) (GH684)
  • Allows for (imperfect) repr of Coords when underlying index is PeriodIndex (GH645).
  • Fixes for several issues found on DataArray objects with the same name as one of their coordinates (see Breaking changes for more details).
  • Attempting to assign a Dataset or DataArray variable/attribute using attribute-style syntax (e.g., ds.foo = 42) now raises an error rather than silently failing (GH656, GH714).
  • You can now pass pandas objects with non-numpy dtypes (e.g., categorical or datetime64 with a timezone) into xray without an error (GH716).
Acknowledgments

The following individuals contributed to this release:

  • Antony Lee
  • Fabien Maussion
  • Joe Hamman
  • Maximilian Roos
  • Stephan Hoyer
  • Takeshi Kanmae
  • femtotrader

v0.6.1 (21 October 2015)

This release contains a number of bug and compatibility fixes, as well as enhancements to plotting, indexing and writing files to disk.

Note that the minimum required version of dask for use with xray is now version 0.6.

API Changes
  • The handling of colormaps and discrete color lists for 2D plots in plot() was changed to provide more compatibility with matplotlib’s contour and contourf functions (GH538). Now discrete lists of colors should be specified using colors keyword, rather than cmap.
Enhancements
  • Faceted plotting through FacetGrid and the plot() method. See Faceting for more details and examples.

  • sel() and reindex() now support the tolerance argument for controlling nearest-neighbor selection (GH629):

    In [38]: array = xray.DataArray([1, 2, 3], dims='x')
    
    In [39]: array.reindex(x=[0.9, 1.5], method='nearest', tolerance=0.2)
    Out[39]: 
    <xray.DataArray (x: 2)>
    array([  2.,  nan])
    Coordinates:
      * x        (x) float64 0.9 1.5
    

    This feature requires pandas v0.17 or newer.

  • New encoding argument in to_netcdf() for writing netCDF files with compression, as described in the new documentation section on Writing encoded data.

  • Add real and imag attributes to Dataset and DataArray (GH553).

  • More informative error message with from_dataframe() if the frame has duplicate columns.

  • xray now uses deterministic names for dask arrays it creates or opens from disk. This allows xray users to take advantage of dask’s nascent support for caching intermediate computation results. See GH555 for an example.

Bug fixes
  • Forwards compatibility with the latest pandas release (v0.17.0). We were using some internal pandas routines for datetime conversion, which unfortunately have now changed upstream (GH569).
  • Aggregation functions now correctly skip NaN for data for complex128 dtype (GH554).
  • Fixed indexing 0d arrays with unicode dtype (GH568).
  • name() and Dataset keys must be a string or None to be written to netCDF (GH533).
  • where() now uses dask instead of numpy if either the array or other is a dask array. Previously, if other was a numpy array the method was evaluated eagerly.
  • Global attributes are now handled more consistently when loading remote datasets using engine='pydap' (GH574).
  • It is now possible to assign to the .data attribute of DataArray objects.
  • coordinates attribute is now kept in the encoding dictionary after decoding (GH610).
  • Compatibility with numpy 1.10 (GH617).
Acknowledgments

The following individuals contributed to this release:

  • Ryan Abernathey
  • Pete Cable
  • Clark Fitzgerald
  • Joe Hamman
  • Stephan Hoyer
  • Scott Sinclair

v0.6.0 (21 August 2015)

This release includes numerous bug fixes and enhancements. Highlights include the introduction of a plotting module and the new Dataset and DataArray methods isel_points(), sel_points(), where() and diff(). There are no breaking changes from v0.5.2.

Enhancements
  • Plotting methods have been implemented on DataArray objects plot() through integration with matplotlib (GH185). For an introduction, see Plotting.

  • Variables in netCDF files with multiple missing values are now decoded as NaN after issuing a warning if open_dataset is called with mask_and_scale=True.

  • We clarified our rules for when the result from an xray operation is a copy vs. a view (see copies vs views for more details).

  • Dataset variables are now written to netCDF files in order of appearance when using the netcdf4 backend (GH479).

  • Added isel_points() and sel_points() to support pointwise indexing of Datasets and DataArrays (GH475).

    In [40]: da = xray.DataArray(np.arange(56).reshape((7, 8)),
       ....:                     coords={'x': list('abcdefg'),
       ....:                             'y': 10 * np.arange(8)},
       ....:                     dims=['x', 'y'])
       ....: 
    
    In [41]: da
    Out[41]: 
    <xray.DataArray (x: 7, y: 8)>
    array([[ 0,  1,  2,  3,  4,  5,  6,  7],
           [ 8,  9, 10, 11, 12, 13, 14, 15],
           [16, 17, 18, 19, 20, 21, 22, 23],
           [24, 25, 26, 27, 28, 29, 30, 31],
           [32, 33, 34, 35, 36, 37, 38, 39],
           [40, 41, 42, 43, 44, 45, 46, 47],
           [48, 49, 50, 51, 52, 53, 54, 55]])
    Coordinates:
    * y        (y) int64 0 10 20 30 40 50 60 70
    * x        (x) |S1 'a' 'b' 'c' 'd' 'e' 'f' 'g'
    
    # we can index by position along each dimension
    In [42]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points')
    Out[42]: 
    <xray.DataArray (points: 3)>
    array([ 0,  9, 48])
    Coordinates:
        y        (points) int64 0 10 0
        x        (points) |S1 'a' 'b' 'g'
      * points   (points) int64 0 1 2
    
    # or equivalently by label
    In [43]: da.sel_points(x=['a', 'b', 'g'], y=[0, 10, 0], dim='points')
    Out[43]: 
    <xray.DataArray (points: 3)>
    array([ 0,  9, 48])
    Coordinates:
        y        (points) int64 0 10 0
        x        (points) |S1 'a' 'b' 'g'
      * points   (points) int64 0 1 2
    
  • New where() method for masking xray objects according to some criteria. This works particularly well with multi-dimensional data:

    In [44]: ds = xray.Dataset(coords={'x': range(100), 'y': range(100)})
    
    In [45]: ds['distance'] = np.sqrt(ds.x ** 2 + ds.y ** 2)
    
    In [46]: ds.distance.where(ds.distance < 100).plot()
    Out[46]: <matplotlib.collections.QuadMesh at 0x7f10ecd3d550>
    
    _images/where_example.png
  • Added new methods DataArray.diff and Dataset.diff for finite difference calculations along a given axis.

  • New to_masked_array() convenience method for returning a numpy.ma.MaskedArray.

    In [47]: da = xray.DataArray(np.random.random_sample(size=(5, 4)))
    
    In [48]: da.where(da < 0.5)
    Out[48]: 
    <xarray.DataArray (dim_0: 5, dim_1: 4)>
    array([[ 0.12697 ,       nan,  0.260476,       nan],
           [ 0.37675 ,  0.336222,  0.451376,       nan],
           [ 0.123102,       nan,  0.373012,  0.447997],
           [ 0.129441,       nan,       nan,  0.352054],
           [ 0.228887,       nan,       nan,  0.137554]])
    Dimensions without coordinates: dim_0, dim_1
    
    In [49]: da.where(da < 0.5).to_masked_array(copy=True)
    Out[49]: 
    masked_array(data =
     [[0.12696983303810094 -- 0.26047600586578334 --]
     [0.37674971618967135 0.33622174433445307 0.45137647047539964 --]
     [0.12310214428849964 -- 0.37301222522143085 0.4479968246859435]
     [0.12944067971751294 -- -- 0.35205353914802473]
     [0.2288873043216132 -- -- 0.1375535565632705]],
                 mask =
     [[False  True False  True]
     [False False False  True]
     [False  True False False]
     [False  True  True False]
     [False  True  True False]],
           fill_value = 1e+20)
    
  • Added new flag “drop_variables” to open_dataset() for excluding variables from being parsed. This may be useful to drop variables with problems or inconsistent values.

Bug fixes
  • Fixed aggregation functions (e.g., sum and mean) on big-endian arrays when bottleneck is installed (GH489).
  • Dataset aggregation functions dropped variables with unsigned integer dtype (GH505).
  • .any() and .all() were not lazy when used on xray objects containing dask arrays.
  • Fixed an error when attempting to saving datetime64 variables to netCDF files when the first element is NaT (GH528).
  • Fix pickle on DataArray objects (GH515).
  • Fixed unnecessary coercion of float64 to float32 when using netcdf3 and netcdf4_classic formats (GH526).

v0.5.2 (16 July 2015)

This release contains bug fixes, several additional options for opening and saving netCDF files, and a backwards incompatible rewrite of the advanced options for xray.concat.

Backwards incompatible changes
  • The optional arguments concat_over and mode in concat() have been removed and replaced by data_vars and coords. The new arguments are both more easily understood and more robustly implemented, and allowed us to fix a bug where concat accidentally loaded data into memory. If you set values for these optional arguments manually, you will need to update your code. The default behavior should be unchanged.
Enhancements
  • open_mfdataset() now supports a preprocess argument for preprocessing datasets prior to concatenaton. This is useful if datasets cannot be otherwise merged automatically, e.g., if the original datasets have conflicting index coordinates (GH443).

  • open_dataset() and open_mfdataset() now use a global thread lock by default for reading from netCDF files with dask. This avoids possible segmentation faults for reading from netCDF4 files when HDF5 is not configured properly for concurrent access (GH444).

  • Added support for serializing arrays of complex numbers with engine=’h5netcdf’.

  • The new save_mfdataset() function allows for saving multiple datasets to disk simultaneously. This is useful when processing large datasets with dask.array. For example, to save a dataset too big to fit into memory to one file per year, we could write:

    In [50]: years, datasets = zip(*ds.groupby('time.year'))
    
    In [51]: paths = ['%s.nc' % y for y in years]
    
    In [52]: xray.save_mfdataset(datasets, paths)
    
Bug fixes
  • Fixed min, max, argmin and argmax for arrays with string or unicode types (GH453).
  • open_dataset() and open_mfdataset() support supplying chunks as a single integer.
  • Fixed a bug in serializing scalar datetime variable to netCDF.
  • Fixed a bug that could occur in serialization of 0-dimensional integer arrays.
  • Fixed a bug where concatenating DataArrays was not always lazy (GH464).
  • When reading datasets with h5netcdf, bytes attributes are decoded to strings. This allows conventions decoding to work properly on Python 3 (GH451).

v0.5.1 (15 June 2015)

This minor release fixes a few bugs and an inconsistency with pandas. It also adds the pipe method, copied from pandas.

Enhancements
  • Added pipe(), replicating the new pandas method in version 0.16.2. See Transforming datasets for more details.
  • assign() and assign_coords() now assign new variables in sorted (alphabetical) order, mirroring the behavior in pandas. Previously, the order was arbitrary.
Bug fixes
  • xray.concat fails in an edge case involving identical coordinate variables (GH425)
  • We now decode variables loaded from netCDF3 files with the scipy engine using native endianness (GH416). This resolves an issue when aggregating these arrays with bottleneck installed.

v0.5 (1 June 2015)

Highlights

The headline feature in this release is experimental support for out-of-core computing (data that doesn’t fit into memory) with dask. This includes a new top-level function open_mfdataset() that makes it easy to open a collection of netCDF (using dask) as a single xray.Dataset object. For more on dask, read the blog post introducing xray + dask and the new documentation section Parallel computing with dask.

Dask makes it possible to harness parallelism and manipulate gigantic datasets with xray. It is currently an optional dependency, but it may become required in the future.

Backwards incompatible changes
  • The logic used for choosing which variables are concatenated with concat() has changed. Previously, by default any variables which were equal across a dimension were not concatenated. This lead to some surprising behavior, where the behavior of groupby and concat operations could depend on runtime values (GH268). For example:

    In [53]: ds = xray.Dataset({'x': 0})
    
    In [54]: xray.concat([ds, ds], dim='y')
    Out[54]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        *empty*
    Data variables:
        x        int64 0
    

    Now, the default always concatenates data variables:

    In [55]: xray.concat([ds, ds], dim='y')
    Out[55]: 
    <xarray.Dataset>
    Dimensions:  (y: 2)
    Dimensions without coordinates: y
    Data variables:
        x        (y) int64 0 0
    

    To obtain the old behavior, supply the argument concat_over=[].

Enhancements
  • New to_array() and enhanced to_dataset() methods make it easy to switch back and forth between arrays and datasets:

    In [56]: ds = xray.Dataset({'a': 1, 'b': ('x', [1, 2, 3])},
       ....:                   coords={'c': 42}, attrs={'Conventions': 'None'})
       ....: 
    
    In [57]: ds.to_array()
    Out[57]: 
    <xarray.DataArray (variable: 2, x: 3)>
    array([[1, 1, 1],
           [1, 2, 3]])
    Coordinates:
        c         int64 42
      * variable  (variable) <U1 'a' 'b'
    Dimensions without coordinates: x
    Attributes:
        Conventions:  None
    
    In [58]: ds.to_array().to_dataset(dim='variable')
    Out[58]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
        c        int64 42
    Dimensions without coordinates: x
    Data variables:
        a        (x) int64 1 1 1
        b        (x) int64 1 2 3
    Attributes:
        Conventions:  None
    
  • New fillna() method to fill missing values, modeled off the pandas method of the same name:

    In [59]: array = xray.DataArray([np.nan, 1, np.nan, 3], dims='x')
    
    In [60]: array.fillna(0)
    Out[60]: 
    <xarray.DataArray (x: 4)>
    array([ 0.,  1.,  0.,  3.])
    Dimensions without coordinates: x
    

    fillna works on both Dataset and DataArray objects, and uses index based alignment and broadcasting like standard binary operations. It also can be applied by group, as illustrated in Fill missing values with climatology.

  • New assign() and assign_coords() methods patterned off the new DataFrame.assign method in pandas:

    In [61]: ds = xray.Dataset({'y': ('x', [1, 2, 3])})
    
    In [62]: ds.assign(z = lambda ds: ds.y ** 2)
    Out[62]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Dimensions without coordinates: x
    Data variables:
        y        (x) int64 1 2 3
        z        (x) int64 1 4 9
    
    In [63]: ds.assign_coords(z = ('x', ['a', 'b', 'c']))
    Out[63]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
        z        (x) <U1 'a' 'b' 'c'
    Dimensions without coordinates: x
    Data variables:
        y        (x) int64 1 2 3
    

    These methods return a new Dataset (or DataArray) with updated data or coordinate variables.

  • sel() now supports the method parameter, which works like the paramter of the same name on reindex(). It provides a simple interface for doing nearest-neighbor interpolation:

    In [64]: ds.sel(x=1.1, method='nearest')
    Out[64]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        x        int64 1
    Data variables:
        y        int64 2
    
    In [65]: ds.sel(x=[1.1, 2.1], method='pad')
    Out[65]: 
    <xray.Dataset>
    Dimensions:  (x: 2)
    Coordinates:
      * x        (x) int64 1 2
    Data variables:
        y        (x) int64 2 3
    

    See Nearest neighbor lookups for more details.

  • You can now control the underlying backend used for accessing remote datasets (via OPeNDAP) by specifying engine='netcdf4' or engine='pydap'.

  • xray now provides experimental support for reading and writing netCDF4 files directly via h5py with the h5netcdf package, avoiding the netCDF4-Python package. You will need to install h5netcdf and specify engine='h5netcdf' to try this feature.

  • Accessing data from remote datasets now has retrying logic (with exponential backoff) that should make it robust to occasional bad responses from DAP servers.

  • You can control the width of the Dataset repr with xray.set_options. It can be used either as a context manager, in which case the default is restored outside the context:

    In [66]: ds = xray.Dataset({'x': np.arange(1000)})
    
    In [67]: with xray.set_options(display_width=40):
       ....:     print(ds)
       ....: 
    <xarray.Dataset>
    Dimensions:  (x: 1000)
    Coordinates:
      * x        (x) int64 0 1 2 3 4 5 6 ...
    Data variables:
        *empty*
    

    Or to set a global option:

    In [68]: xray.set_options(display_width=80)
    

    The default value for the display_width option is 80.

Deprecations
  • The method load_data() has been renamed to the more succinct load().

v0.4.1 (18 March 2015)

The release contains bug fixes and several new features. All changes should be fully backwards compatible.

Enhancements
  • New documentation sections on Time series data and Formats supported by Pandas.

  • resample() lets you resample a dataset or data array to a new temporal resolution. The syntax is the same as pandas, except you need to supply the time dimension explicitly:

    In [69]: time = pd.date_range('2000-01-01', freq='6H', periods=10)
    
    In [70]: array = xray.DataArray(np.arange(10), [('time', time)])
    
    In [71]: array.resample('1D', dim='time')
    Out[71]: 
    <xarray.DataArray (time: 3)>
    array([ 1.5,  5.5,  8.5])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
    

    You can specify how to do the resampling with the how argument and other options such as closed and label let you control labeling:

    In [72]: array.resample('1D', dim='time', how='sum', label='right')
    Out[72]: 
    <xarray.DataArray (time: 3)>
    array([ 6, 22, 17])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
    

    If the desired temporal resolution is higher than the original data (upsampling), xray will insert missing values:

    In [73]: array.resample('3H', 'time')
    Out[73]: 
    <xarray.DataArray (time: 19)>
    array([  0.,  nan,   1.,  nan,   2.,  nan,   3.,  nan,   4.,  nan,   5.,  nan,
             6.,  nan,   7.,  nan,   8.,  nan,   9.])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-01T03:00:00 ...
    
  • first and last methods on groupby objects let you take the first or last examples from each group along the grouped axis:

    In [74]: array.groupby('time.day').first()
    Out[74]: 
    <xarray.DataArray (day: 3)>
    array([0, 4, 8])
    Coordinates:
      * day      (day) int64 1 2 3
    

    These methods combine well with resample:

    In [75]: array.resample('1D', dim='time', how='first')
    Out[75]: 
    <xarray.DataArray (time: 3)>
    array([0, 4, 8])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
    
  • swap_dims() allows for easily swapping one dimension out for another:

    In [76]: ds = xray.Dataset({'x': range(3), 'y': ('x', list('abc'))})
    
    In [77]: ds
    Out[77]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
    Data variables:
        y        (x) <U1 'a' 'b' 'c'
    
    In [78]: ds.swap_dims({'x': 'y'})
    Out[78]: 
    <xarray.Dataset>
    Dimensions:  (y: 3)
    Coordinates:
        x        (y) int64 0 1 2
      * y        (y) <U1 'a' 'b' 'c'
    Data variables:
        *empty*
    

    This was possible in earlier versions of xray, but required some contortions.

  • open_dataset() and to_netcdf() now accept an engine argument to explicitly select which underlying library (netcdf4 or scipy) is used for reading/writing a netCDF file.

Bug fixes
  • Fixed a bug where data netCDF variables read from disk with engine='scipy' could still be associated with the file on disk, even after closing the file (GH341). This manifested itself in warnings about mmapped arrays and segmentation faults (if the data was accessed).
  • Silenced spurious warnings about all-NaN slices when using nan-aware aggregation methods (GH344).
  • Dataset aggregations with keep_attrs=True now preserve attributes on data variables, not just the dataset itself.
  • Tests for xray now pass when run on Windows (GH360).
  • Fixed a regression in v0.4 where saving to netCDF could fail with the error ValueError: could not automatically determine time units.

v0.4 (2 March, 2015)

This is one of the biggest releases yet for xray: it includes some major changes that may break existing code, along with the usual collection of minor enhancements and bug fixes. On the plus side, this release includes all hitherto planned breaking changes, so the upgrade path for xray should be smoother going forward.

Breaking changes
  • We now automatically align index labels in arithmetic, dataset construction, merging and updating. This means the need for manually invoking methods like align() and reindex_like() should be vastly reduced.

    For arithmetic, we align based on the intersection of labels:

    In [79]: lhs = xray.DataArray([1, 2, 3], [('x', [0, 1, 2])])
    
    In [80]: rhs = xray.DataArray([2, 3, 4], [('x', [1, 2, 3])])
    
    In [81]: lhs + rhs
    Out[81]: 
    <xarray.DataArray (x: 2)>
    array([4, 6])
    Coordinates:
      * x        (x) int64 1 2
    

    For dataset construction and merging, we align based on the union of labels:

    In [82]: xray.Dataset({'foo': lhs, 'bar': rhs})
    Out[82]: 
    <xarray.Dataset>
    Dimensions:  (x: 4)
    Coordinates:
      * x        (x) int64 0 1 2 3
    Data variables:
        foo      (x) float64 1.0 2.0 3.0 nan
        bar      (x) float64 nan 2.0 3.0 4.0
    

    For update and __setitem__, we align based on the original object:

    In [83]: lhs.coords['rhs'] = rhs
    
    In [84]: lhs
    Out[84]: 
    <xarray.DataArray (x: 3)>
    array([1, 2, 3])
    Coordinates:
      * x        (x) int64 0 1 2
        rhs      (x) float64 nan 2.0 3.0
    
  • Aggregations like mean or median now skip missing values by default:

    In [85]: xray.DataArray([1, 2, np.nan, 3]).mean()
    Out[85]: 
    <xarray.DataArray ()>
    array(2.0)
    

    You can turn this behavior off by supplying the keyword arugment skipna=False.

    These operations are lightning fast thanks to integration with bottleneck, which is a new optional dependency for xray (numpy is used if bottleneck is not installed).

  • Scalar coordinates no longer conflict with constant arrays with the same value (e.g., in arithmetic, merging datasets and concat), even if they have different shape (GH243). For example, the coordinate c here persists through arithmetic, even though it has different shapes on each DataArray:

    In [86]: a = xray.DataArray([1, 2], coords={'c': 0}, dims='x')
    
    In [87]: b = xray.DataArray([1, 2], coords={'c': ('x', [0, 0])}, dims='x')
    
    In [88]: (a + b).coords
    Out[88]: 
    Coordinates:
        c        (x) int64 0 0
    

    This functionality can be controlled through the compat option, which has also been added to the Dataset constructor.

  • Datetime shortcuts such as 'time.month' now return a DataArray with the name 'month', not 'time.month' (GH345). This makes it easier to index the resulting arrays when they are used with groupby:

    In [89]: time = xray.DataArray(pd.date_range('2000-01-01', periods=365),
       ....:                       dims='time', name='time')
       ....: 
    
    In [90]: counts = time.groupby('time.month').count()
    
    In [91]: counts.sel(month=2)
    Out[91]: 
    <xarray.DataArray 'time' ()>
    array(29)
    Coordinates:
        month    int64 2
    

    Previously, you would need to use something like counts.sel(**{'time.month': 2}}), which is much more awkward.

  • The season datetime shortcut now returns an array of string labels such ‘DJF’:

    In [92]: ds = xray.Dataset({'t': pd.date_range('2000-01-01', periods=12, freq='M')})
    
    In [93]: ds['t.season']
    Out[93]: 
    <xarray.DataArray 'season' (t: 12)>
    array(['DJF', 'DJF', 'MAM', 'MAM', 'MAM', 'JJA', 'JJA', 'JJA', 'SON', 'SON',
           'SON', 'DJF'],
          dtype='<U3')
    Coordinates:
      * t        (t) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 2000-04-30 ...
    

    Previously, it returned numbered seasons 1 through 4.

  • We have updated our use of the terms of “coordinates” and “variables”. What were known in previous versions of xray as “coordinates” and “variables” are now referred to throughout the documentation as “coordinate variables” and “data variables”. This brings xray in closer alignment to CF Conventions. The only visible change besides the documentation is that Dataset.vars has been renamed Dataset.data_vars.

  • You will need to update your code if you have been ignoring deprecation warnings: methods and attributes that were deprecated in xray v0.3 or earlier (e.g., dimensions, attributes`) have gone away.

Enhancements
  • Support for reindex() with a fill method. This provides a useful shortcut for upsampling:

    In [94]: data = xray.DataArray([1, 2, 3], [('x', range(3))])
    
    In [95]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
    Out[95]: 
    <xarray.DataArray (x: 5)>
    array([1, 2, 2, 3, 3])
    Coordinates:
      * x        (x) float64 0.5 1.0 1.5 2.0 2.5
    

    This will be especially useful once pandas 0.16 is released, at which point xray will immediately support reindexing with method=’nearest’.

  • Use functions that return generic ndarrays with DataArray.groupby.apply and Dataset.apply (GH327 and GH329). Thanks Jeff Gerard!

  • Consolidated the functionality of dumps (writing a dataset to a netCDF3 bytestring) into to_netcdf() (GH333).

  • to_netcdf() now supports writing to groups in netCDF4 files (GH333). It also finally has a full docstring – you should read it!

  • open_dataset() and to_netcdf() now work on netCDF3 files when netcdf4-python is not installed as long as scipy is available (GH333).

  • The new Dataset.drop and DataArray.drop methods makes it easy to drop explicitly listed variables or index labels:

    # drop variables
    In [96]: ds = xray.Dataset({'x': 0, 'y': 1})
    
    In [97]: ds.drop('x')
    Out[97]: 
    <xarray.Dataset>
    Dimensions:  ()
    Data variables:
        y        int64 1
    
    # drop index labels
    In [98]: arr = xray.DataArray([1, 2, 3], coords=[('x', list('abc'))])
    
    In [99]: arr.drop(['a', 'c'], dim='x')
    Out[99]: 
    <xarray.DataArray (x: 1)>
    array([2])
    Coordinates:
      * x        (x) <U1 'b'
    
  • broadcast_equals() has been added to correspond to the new compat option.

  • Long attributes are now truncated at 500 characters when printing a dataset (GH338). This should make things more convenient for working with datasets interactively.

  • Added a new documentation example, Calculating Seasonal Averages from Timeseries of Monthly Means. Thanks Joe Hamman!

Bug fixes
  • Several bug fixes related to decoding time units from netCDF files (GH316, GH330). Thanks Stefan Pfenninger!
  • xray no longer requires decode_coords=False when reading datasets with unparseable coordinate attributes (GH308).
  • Fixed DataArray.loc indexing with ... (GH318).
  • Fixed an edge case that resulting in an error when reindexing multi-dimensional variables (GH315).
  • Slicing with negative step sizes (GH312).
  • Invalid conversion of string arrays to numeric dtype (GH305).
  • Fixed``repr()`` on dataset objects with non-standard dates (GH347).
Deprecations
  • dump and dumps have been deprecated in favor of to_netcdf().
  • drop_vars has been deprecated in favor of drop().
Future plans

The biggest feature I’m excited about working toward in the immediate future is supporting out-of-core operations in xray using Dask, a part of the Blaze project. For a preview of using Dask with weather data, read this blog post by Matthew Rocklin. See GH328 for more details.

v0.3.2 (23 December, 2014)

This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.

There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.

Warning

xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.

Backwards incompatible changes
  • Arrays of datetime.datetime objects are now automatically cast to datetime64[ns] arrays when stored in an xray object, using machinery borrowed from pandas:

    In [100]: from datetime import datetime
    
    In [101]: xray.Dataset({'t': [datetime(2000, 1, 1)]})
    Out[101]: 
    <xarray.Dataset>
    Dimensions:  (t: 1)
    Coordinates:
      * t        (t) datetime64[ns] 2000-01-01
    Data variables:
        *empty*
    
  • xray now has support (including serialization to netCDF) for TimedeltaIndex. datetime.timedelta objects are thus accordingly cast to timedelta64[ns] objects when appropriate.

  • Masked arrays are now properly coerced to use NaN as a sentinel value (GH259).

Enhancements
  • Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:

    In [102]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})})
    
    In [103]: ds.tmin.units
    Out[103]: 'celcius'
    

    Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).

  • You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:

    In [104]: array = xray.DataArray(np.zeros(5), dims=['x'])
    
    In [105]: array[dict(x=slice(3))] = 1
    
    In [106]: array
    Out[106]: 
    <xarray.DataArray (x: 5)>
    array([ 1.,  1.,  1.,  0.,  0.])
    Dimensions without coordinates: x
    
  • Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the coordinates attribute on a data variable. When not possible, xray defines a global coordinates attribute.

  • Preliminary support for converting xray.DataArray objects to and from CDAT cdms2 variables.

  • We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.

Bug fixes
  • Fix for to_dataframe() with 0d string/object coordinates (GH287)
  • Fix for to_netcdf with 0d string variable (GH284)
  • Fix writing datetime64 arrays to netcdf if NaT is present (GH270)
  • Fix align silently upcasts data arrays when NaNs are inserted (GH264)
Future plans
  • I am contemplating switching to the terms “coordinate variables” and “data variables” instead of the (currently used) “coordinates” and “variables”, following their use in CF Conventions (GH293). This would mostly have implications for the documentation, but I would also change the Dataset attribute vars to data.
  • I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
  • The main API breakage that I do anticipate in the next release is finally making all aggregation operations skip missing values by default (GH130). I’m pretty sick of writing ds.reduce(np.nanmean, 'time').
  • The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.

If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.

v0.3.1 (22 October, 2014)

This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).

We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.

Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.

New features
  • Added count() and dropna() methods, copied from pandas, for working with missing values (GH247, GH58).
  • Added DataArray.to_pandas for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255).
  • Support for reading gzipped netCDF3 files (GH239).
  • Reduced memory usage when writing netCDF files (GH251).
  • ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
  • Trivial indexes, equivalent to range(n) where n is the length of the dimension, are no longer written to disk (GH245).
Bug fixes
  • Compatibility fixes for pandas v0.15 (GH262).
  • Fixes for display and indexing of NaT (not-a-time) (GH238, GH240)
  • Fix slicing by label was an argument is a data array (GH250).
  • Test data is now shipped with the source distribution (GH253).
  • Ensure order does not matter when doing arithmetic with scalar data arrays (GH254).
  • Order of dimensions preserved with DataArray.to_dataframe (GH260).

v0.3 (21 September 2014)

New features
  • Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
  • Dataset math Dataset objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name.
  • GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
  • The dataset __repr__ method has been entirely overhauled; dataset objects now show their values when printed.
  • You can now index a dataset with a list of variables to return a new dataset: ds[['foo', 'bar']].
Backwards incompatible changes
  • Dataset.__eq__ and Dataset.__ne__ are now element-wise operations instead of comparing all values to obtain a single boolean. Use the method equals() instead.
Deprecations
  • Dataset.noncoords is deprecated: use Dataset.vars instead.
  • Dataset.select_vars deprecated: index a Dataset with a list of variable names instead.
  • DataArray.select_vars and DataArray.drop_vars deprecated: use reset_coords() instead.

v0.2 (14 August 2014)

This is major release that includes some new features and quite a few bug fixes. Here are the highlights:

  • There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial.
  • You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
  • xray now works on Python 2.6, thanks to Anna Kuznetsova.
  • A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
  • New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.

v0.1.1 (20 May 2014)

xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:

  • Python 3 support (GH53)
  • Required numpy version relaxed to 1.7 (GH129)
  • Return numpy.datetime64 arrays for non-standard calendars (GH126)
  • Support for opening datasets associated with NetCDF4 groups (GH127)
  • Bug-fixes for concatenating datetime arrays (GH134)

Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.

v0.1 (2 May 2014)

Initial release.

API reference

This page provides an auto-generated summary of xarray’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

Top-level functions

apply_ufunc(func, *args, input_core_dims, …) Apply a vectorized function for unlabeled arrays on xarray objects.
align(*objects[, join, copy, indexes, exclude]) Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes.
broadcast(*args, **kwargs) Explicitly broadcast any number of DataArray or Dataset objects against one another.
concat(objs[, dim, data_vars, coords, …]) Concatenate xarray objects along a new or existing dimension.
merge(objects[, compat, join]) Merge any number of xarray objects into a single Dataset as variables.
where(cond, x, y) Return elements from x or y depending on cond.
set_options(**kwargs) Set options for xarray in a controlled context.
full_like(other, fill_value[, dtype]) Return a new object with the same shape and type as a given object.
zeros_like(other[, dtype]) Shorthand for full_like(other, 0, dtype)
ones_like(other[, dtype]) Shorthand for full_like(other, 1, dtype)
dot(*arrays[, dims]) Generalized dot product for xarray objects.

Dataset

Creating a dataset
Dataset([data_vars, coords, attrs, compat]) A multi-dimensional, in memory, array database.
decode_cf(obj[, concat_characters, …]) Decode the given Dataset or Datastore according to CF conventions into a new Dataset.
Attributes
Dataset.dims Mapping from dimension names to lengths.
Dataset.sizes Mapping from dimension names to lengths.
Dataset.data_vars Dictionary of xarray.DataArray objects corresponding to data variables
Dataset.coords Dictionary of xarray.DataArray objects corresponding to coordinate variables
Dataset.attrs Dictionary of global attributes on this dataset
Dataset.encoding Dictionary of global encoding attributes on this dataset
Dataset.indexes OrderedDict of pandas.Index objects used for label based indexing
Dataset.get_index(key) Get an index for a dimension, with fall-back to a default RangeIndex
Dataset.chunks Block dimensions for this dataset’s data or None if it’s not a dask array.
Dataset.nbytes
Dictionary interface

Datasets implement the mapping interface with keys given by variable names and values given by DataArray objects.

Dataset.__getitem__(key) Access variables or coordinates this dataset as a DataArray.
Dataset.__setitem__(key, value) Add an array to this dataset.
Dataset.__delitem__(key) Remove a variable from this dataset.
Dataset.update(other[, inplace]) Update this dataset’s variables with those from another dataset.
Dataset.items()
Dataset.values()
Dataset contents
Dataset.copy([deep]) Returns a copy of this dataset.
Dataset.assign(**kwargs) Assign new data variables to a Dataset, returning a new object with all the original variables in addition to the new ones.
Dataset.assign_coords(**kwargs) Assign new coordinates to this object.
Dataset.assign_attrs(*args, **kwargs) Assign new attrs to this object.
Dataset.pipe(func, *args, **kwargs) Apply func(self, *args, **kwargs)
Dataset.merge(other[, inplace, …]) Merge the arrays of two datasets into a single dataset.
Dataset.rename(name_dict[, inplace]) Returns a new object with renamed variables and dimensions.
Dataset.swap_dims(dims_dict[, inplace]) Returns a new object with swapped dimensions.
Dataset.expand_dims(dim[, axis]) Return a new object with an additional axis (or axes) inserted at the corresponding position in the array shape.
Dataset.drop(labels[, dim]) Drop variables or index labels from this dataset.
Dataset.set_coords(names[, inplace]) Given names of one or more variables, set them as coordinates
Dataset.reset_coords([names, drop, inplace]) Given names of coordinates, reset them to become variables
Comparisons
Dataset.equals(other) Two Datasets are equal if they have matching variables and coordinates, all of which are equal.
Dataset.identical(other) Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates.
Dataset.broadcast_equals(other) Two Datasets are broadcast equal if they are equal after broadcasting all variables against each other.
Indexing
Dataset.loc Attribute for location based indexing.
Dataset.isel([drop]) Returns a new dataset with each array indexed along the specified dimension(s).
Dataset.sel([method, tolerance, drop]) Returns a new dataset with each array indexed by tick labels along the specified dimension(s).
Dataset.squeeze([dim, drop, axis]) Return a new object with squeezed data.
Dataset.reindex([indexers, method, …]) Conform this object onto a new set of indexes, filling in missing values with NaN.
Dataset.reindex_like(other[, method, …]) Conform this object onto the indexes of another object, filling in missing values with NaN.
Dataset.set_index([append, inplace]) Set Dataset (multi-)indexes using one or more existing coordinates or variables.
Dataset.reset_index(dims_or_levels[, drop, …]) Reset the specified index(es) or multi-index level(s).
Dataset.reorder_levels([inplace]) Rearrange index levels using input order.
Missing value handling
Dataset.isnull(*args, **kwargs)
Dataset.notnull(*args, **kwargs)
Dataset.combine_first(other) Combine two Datasets, default to data_vars of self.
Dataset.count([dim, keep_attrs]) Reduce this Dataset’s data by applying count along some dimension(s).
Dataset.dropna(dim[, how, thresh, subset]) Returns a new dataset with dropped labels for missing values along the provided dimension.
Dataset.fillna(value) Fill missing values in this object.
Dataset.ffill(dim[, limit]) Fill NaN values by propogating values forward
Dataset.bfill(dim[, limit]) Fill NaN values by propogating values backward
Dataset.interpolate_na([dim, method, limit, …]) Interpolate values according to different methods.
Dataset.where(cond[, other, drop]) Filter elements from this object according to a condition.
Dataset.isin(test_elements) Tests each value in the array for whether it is in the supplied list.
Computation
Dataset.apply(func[, keep_attrs, args]) Apply a function over the data variables in this dataset.
Dataset.reduce(func[, dim, keep_attrs, …]) Reduce this dataset by applying func along some dimension(s).
Dataset.groupby(group[, squeeze]) Returns a GroupBy object for performing grouped operations.
Dataset.groupby_bins(group, bins[, right, …]) Returns a GroupBy object for performing grouped operations.
Dataset.rolling([min_periods, center]) Rolling window object.
Dataset.resample([freq, dim, how, skipna, …]) Returns a Resample object for performing resampling operations.
Dataset.diff(dim[, n, label]) Calculate the n-th order discrete difference along given axis.
Dataset.quantile(q[, dim, interpolation, …]) Compute the qth quantile of the data along the specified dimension.

Aggregation: all any argmax argmin max mean median min prod sum std var

ndarray methods: astype argsort clip conj conjugate imag round real cumsum cumprod rank

Grouped operations: assign assign_coords first last fillna where

Reshaping and reorganizing
Dataset.transpose(*dims) Return a new Dataset object with all array dimensions transposed.
Dataset.stack(**dimensions) Stack any number of existing dimensions into a single new dimension.
Dataset.unstack(dim) Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions.
Dataset.shift(**shifts) Shift this dataset by an offset along one or more dimensions.
Dataset.roll(**shifts) Roll this dataset by an offset along one or more dimensions.
Dataset.sortby(variables[, ascending]) Sort object by labels or values (along an axis).

DataArray

DataArray(data[, coords, dims, name, attrs, …]) N-dimensional array with labeled coordinates and dimensions.
Attributes
DataArray.values The array’s data as a numpy.ndarray
DataArray.data The array’s data as a dask or numpy array
DataArray.coords Dictionary-like container of coordinate arrays.
DataArray.dims Tuple of dimension names associated with this array.
DataArray.sizes Ordered mapping from dimension names to lengths.
DataArray.name The name of this array.
DataArray.attrs Dictionary storing arbitrary metadata with this array.
DataArray.encoding Dictionary of format-specific settings for how this array should be serialized.
DataArray.indexes OrderedDict of pandas.Index objects used for label based indexing
DataArray.get_index(key) Get an index for a dimension, with fall-back to a default RangeIndex

ndarray attributes: ndim shape size dtype nbytes chunks

DataArray contents
DataArray.assign_coords(**kwargs) Assign new coordinates to this object.
DataArray.assign_attrs(*args, **kwargs) Assign new attrs to this object.
DataArray.pipe(func, *args, **kwargs) Apply func(self, *args, **kwargs)
DataArray.rename(new_name_or_name_dict) Returns a new DataArray with renamed coordinates or a new name.
DataArray.swap_dims(dims_dict) Returns a new DataArray with swapped dimensions.
DataArray.expand_dims(dim[, axis]) Return a new object with an additional axis (or axes) inserted at the corresponding position in the array shape.
DataArray.drop(labels[, dim]) Drop coordinates or index labels from this DataArray.
DataArray.reset_coords([names, drop, inplace]) Given names of coordinates, reset them to become variables.
DataArray.copy([deep]) Returns a copy of this array.

ndarray methods: astype item

Indexing
DataArray.__getitem__(key)
DataArray.__setitem__(key, value)
DataArray.loc Attribute for location based indexing like pandas.
DataArray.isel([drop]) Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s).
DataArray.sel([method, tolerance, drop]) Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s).
DataArray.squeeze([dim, drop, axis]) Return a new object with squeezed data.
DataArray.reindex([method, tolerance, copy]) Conform this object onto a new set of indexes, filling in missing values with NaN.
DataArray.reindex_like(other[, method, …]) Conform this object onto the indexes of another object, filling in missing values with NaN.
DataArray.set_index([append, inplace]) Set DataArray (multi-)indexes using one or more existing coordinates.
DataArray.reset_index(dims_or_levels[, …]) Reset the specified index(es) or multi-index level(s).
DataArray.reorder_levels([inplace]) Rearrange index levels using input order.
Missing value handling
DataArray.isnull(*args, **kwargs)
DataArray.notnull(*args, **kwargs)
DataArray.combine_first(other) Combine two DataArray objects, with union of coordinates.
DataArray.count([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying count along some dimension(s).
DataArray.dropna(dim[, how, thresh]) Returns a new array with dropped labels for missing values along the provided dimension.
DataArray.fillna(value) Fill missing values in this object.
DataArray.ffill(dim[, limit]) Fill NaN values by propogating values forward
DataArray.bfill(dim[, limit]) Fill NaN values by propogating values backward
DataArray.interpolate_na([dim, method, …]) Interpolate values according to different methods.
DataArray.where(cond[, other, drop]) Filter elements from this object according to a condition.
DataArray.isin(test_elements) Tests each value in the array for whether it is in the supplied list.
Comparisons
DataArray.equals(other) True if two DataArrays have the same dimensions, coordinates and values; otherwise False.
DataArray.identical(other) Like equals, but also checks the array name and attributes, and attributes on all coordinates.
DataArray.broadcast_equals(other) Two DataArrays are broadcast equal if they are equal after broadcasting them against each other such that they have the same dimensions.
Computation
DataArray.reduce(func[, dim, axis, keep_attrs]) Reduce this array by applying func along some dimension(s).
DataArray.groupby(group[, squeeze]) Returns a GroupBy object for performing grouped operations.
DataArray.groupby_bins(group, bins[, right, …]) Returns a GroupBy object for performing grouped operations.
DataArray.rolling([min_periods, center]) Rolling window object.
DataArray.dt Access datetime fields for DataArrays with datetime-like dtypes.
DataArray.resample([freq, dim, how, skipna, …]) Returns a Resample object for performing resampling operations.
DataArray.get_axis_num(dim) Return axis number(s) corresponding to dimension(s) in this array.
DataArray.diff(dim[, n, label]) Calculate the n-th order discrete difference along given axis.
DataArray.dot(other[, dims]) Perform dot product of two DataArrays along their shared dims.
DataArray.quantile(q[, dim, interpolation, …]) Compute the qth quantile of the data along the specified dimension.

Aggregation: all any argmax argmin max mean median min prod sum std var

ndarray methods: argsort clip conj conjugate imag searchsorted round real T cumsum cumprod rank

Grouped operations: assign_coords first last fillna where

Reshaping and reorganizing
DataArray.transpose(*dims) Return a new DataArray object with transposed dimensions.
DataArray.stack(**dimensions) Stack any number of existing dimensions into a single new dimension.
DataArray.unstack(dim) Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions.
DataArray.shift(**shifts) Shift this array by an offset along one or more dimensions.
DataArray.roll(**shifts) Roll this array by an offset along one or more dimensions.
DataArray.sortby(variables[, ascending]) Sort object by labels or values (along an axis).

Universal functions

Warning

With recent versions of numpy, dask and xarray, NumPy ufuncs are now supported directly on all xarray and dask objects. This obliviates the need for the xarray.ufuncs module, which should not be used for new code unless compatibility with versions of NumPy prior to v1.13 is required.

This functions are copied from NumPy, but extended to work on NumPy arrays, dask arrays and all xarray objects. You can find them in the xarray.ufuncs module:

angle arccos arccosh arcsin arcsinh arctan arctan2 arctanh ceil conj copysign cos cosh deg2rad degrees exp expm1 fabs fix floor fmax fmin fmod fmod frexp hypot imag iscomplex isfinite isinf isnan isreal ldexp log log10 log1p log2 logaddexp logaddexp2 logical_and logical_not logical_or logical_xor maximum minimum nextafter rad2deg radians real rint sign signbit sin sinh sqrt square tan tanh trunc

IO / Conversion

Dataset methods
open_dataset(filename_or_obj[, group, …]) Load and decode a dataset from a file or file-like object.
open_mfdataset(paths[, chunks, concat_dim, …]) Open multiple files as a single dataset.
open_rasterio(filename[, parse_coordinates, …]) Open a file with rasterio (experimental).
open_zarr(store[, group, synchronizer, …]) Load and decode a dataset from a Zarr store.
Dataset.to_netcdf([path, mode, format, …]) Write dataset contents to a netCDF file.
Dataset.to_zarr([store, mode, synchronizer, …]) Write dataset contents to a zarr group.
save_mfdataset(datasets, paths[, mode, …]) Write multiple datasets to disk as netCDF files simultaneously.
Dataset.to_array([dim, name]) Convert this dataset into an xarray.DataArray
Dataset.to_dataframe() Convert this dataset into a pandas.DataFrame.
Dataset.to_dask_dataframe([dim_order, set_index]) Convert this dataset into a dask.dataframe.DataFrame.
Dataset.to_dict() Convert this dataset to a dictionary following xarray naming conventions.
Dataset.from_dataframe(dataframe) Convert a pandas.DataFrame into an xarray.Dataset
Dataset.from_dict(d) Convert a dictionary into an xarray.Dataset.
Dataset.close() Close any files linked to this object
Dataset.compute(**kwargs) Manually trigger loading of this dataset’s data from disk or a remote source into memory and return a new dataset.
Dataset.persist(**kwargs) Trigger computation, keeping data as dask arrays
Dataset.load(**kwargs) Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset.
Dataset.chunk([chunks, name_prefix, token, lock]) Coerce all arrays in this dataset into dask arrays with the given chunks.
Dataset.filter_by_attrs(**kwargs) Returns a Dataset with variables that match specific conditions.
Dataset.info([buf]) Concise summary of a Dataset variables and attributes.
DataArray methods
open_dataarray(filename_or_obj[, group, …]) Open an DataArray from a netCDF file containing a single data variable.
DataArray.to_dataset([dim, name]) Convert a DataArray to a Dataset.
DataArray.to_netcdf(*args, **kwargs) Write DataArray contents to a netCDF file.
DataArray.to_pandas() Convert this array into a pandas object with the same shape.
DataArray.to_series() Convert this array into a pandas.Series.
DataArray.to_dataframe([name]) Convert this array and its coordinates into a tidy pandas.DataFrame.
DataArray.to_index() Convert this variable to a pandas.Index.
DataArray.to_masked_array([copy]) Convert this array into a numpy.ma.MaskedArray
DataArray.to_cdms2() Convert this array into a cdms2.Variable
DataArray.to_iris() Convert this array into a iris.cube.Cube
DataArray.from_iris(cube) Convert a iris.cube.Cube into an xarray.DataArray
DataArray.to_dict() Convert this xarray.DataArray into a dictionary following xarray naming conventions.
DataArray.from_series(series) Convert a pandas.Series into an xarray.DataArray.
DataArray.from_cdms2(variable) Convert a cdms2.Variable into an xarray.DataArray
DataArray.from_dict(d) Convert a dictionary into an xarray.DataArray
DataArray.close() Close any files linked to this object
DataArray.compute(**kwargs) Manually trigger loading of this array’s data from disk or a remote source into memory and return a new array.
DataArray.persist(**kwargs) Trigger computation in constituent dask arrays
DataArray.load(**kwargs) Manually trigger loading of this array’s data from disk or a remote source into memory and return this array.
DataArray.chunk([chunks, name_prefix, …]) Coerce this array’s data into a dask arrays with the given chunks.

Rolling objects

core.rolling.DataArrayRolling(obj[, …])
core.rolling.DataArrayRolling.construct(…) Convert this rolling object to xr.DataArray, where the window dimension is stacked as a new dimension
core.rolling.DataArrayRolling.reduce(func, …) Reduce the items in this group by applying func along some dimension(s).
core.rolling.DatasetRolling(obj[, …])
core.rolling.DatasetRolling.construct(window_dim) Convert this rolling object to xr.Dataset, where the window dimension is stacked as a new dimension
core.rolling.DatasetRolling.reduce(func, …) Reduce the items in this group by applying func along some dimension(s).

GroupByObjects

core.groupby.DataArrayGroupBy(obj, group[, …]) GroupBy object specialized to grouping DataArray objects
core.groupby.DataArrayGroupBy.apply(func[, …]) Apply a function over each array in the group and concatenate them together into a new array.
core.groupby.DataArrayGroupBy.reduce(func[, …]) Reduce the items in this group by applying func along some dimension(s).
core.groupby.DatasetGroupBy(obj, group[, …])
core.groupby.DatasetGroupBy.apply(func, **kwargs) Apply a function over each Dataset in the group and concatenate them together into a new Dataset.
core.groupby.DatasetGroupBy.reduce(func[, …]) Reduce the items in this group by applying func along some dimension(s).

Plotting

DataArray.plot Access plotting functions
plot.plot(darray[, row, col, col_wrap, ax, …]) Default plot of DataArray using matplotlib.pyplot.
plot.contourf(x, y, z, ax, **kwargs) Filled contour plot of 2d DataArray
plot.contour(x, y, z, ax, **kwargs) Contour plot of 2d DataArray
plot.hist(darray[, figsize, size, aspect, ax]) Histogram of DataArray
plot.imshow(x, y, z, ax, **kwargs) Image plot of 2d DataArray using matplotlib.pyplot
plot.line(darray, *args, **kwargs) Line plot of DataArray index against values
plot.pcolormesh(x, y, z, ax[, infer_intervals]) Pseudocolor plot of 2d DataArray
plot.FacetGrid(data[, col, row, col_wrap, …]) Initialize the matplotlib figure and FacetGrid object.

Testing

testing.assert_equal(a, b) Like numpy.testing.assert_array_equal(), but for xarray objects.
testing.assert_identical(a, b) Like xarray.testing.assert_equal(), but also matches the objects’ names and attributes.
testing.assert_allclose(a, b[, rtol, atol, …]) Like numpy.testing.assert_allclose(), but for xarray objects.

Exceptions

MergeError Error class for merge failures due to incompatible arguments.
SerializationWarning Warnings about encoding/decoding issues in serialization.

Advanced API

Dataset.variables Low level interface to Dataset contents as dict of Variable objects.
DataArray.variable Low level interface to the Variable object for this DataArray.
Variable(dims, data[, attrs, encoding, fastpath]) A netcdf-like variable consisting of dimensions, data and attributes which describe a single Array.
IndexVariable(dims, data[, attrs, encoding, …]) Wrapper for accommodating a pandas.Index in an xarray.Variable.
as_variable(obj[, name]) Convert an object into a Variable.
register_dataset_accessor(name) Register a custom property on xarray.Dataset objects.
register_dataarray_accessor(name) Register a custom accessor on xarray.DataArray objects.

These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the from_store and dump_to_store Dataset methods:

backends.NetCDF4DataStore(netcdf4_dataset[, …]) Store for reading and writing data via the Python-NetCDF4 library.
backends.H5NetCDFStore(filename[, mode, …]) Store for reading and writing data via h5netcdf
backends.PydapDataStore(ds) Store for accessing OpenDAP datasets with pydap.
backends.ScipyDataStore(filename_or_obj[, …]) Store for reading and writing data via scipy.io.netcdf.

xarray Internals

xarray builds upon two of the foundational libraries of the scientific Python stack, NumPy and pandas. It is written in pure Python (no C or Cython extensions), which makes it easy to develop and extend. Instead, we push compiled code to optional dependencies.

Variable objects

The core internal data structure in xarray is the Variable, which is used as the basic building block behind xarray’s Dataset and DataArray types. A Variable consists of:

  • dims: A tuple of dimension names.
  • data: The N-dimensional array (typically, a NumPy or Dask array) storing the Variable’s data. It must have the same number of dimensions as the length of dims.
  • attrs: An ordered dictionary of metadata associated with this array. By convention, xarray’s built-in operations never use this metadata.
  • encoding: Another ordered dictionary used to store information about how these variable’s data is represented on disk. See Reading encoded data for more details.

Variable has an interface similar to NumPy arrays, but extended to make use of named dimensions. For example, it uses dim in preference to an axis argument for methods like mean, and supports Broadcasting by dimension name.

However, unlike Dataset and DataArray, the basic Variable does not include coordinate labels along each axis.

Variable is public API, but because of its incomplete support for labeled data, it is mostly intended for advanced uses, such as in xarray itself or for writing new backends. You can access the variable objects that correspond to xarray objects via the (readonly) Dataset.variables and DataArray.variable attributes.

Extending xarray

xarray is designed as a general purpose library, and hence tries to avoid including overly domain specific functionality. But inevitably, the need for more domain specific logic arises.

One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust. It’s easy to inadvertently use internal APIs when subclassing, which means that your code may break when xarray upgrades. Furthermore, many builtin methods will only return native xarray objects.

The standard advice is to use composition over inheritance, but reimplementing an API as large as xarray’s on your own objects can be an onerous task, even if most methods are only forwarding to xarray implementations.

If you simply want the ability to call a function with the syntax of a method call, then the builtin pipe() method (copied from pandas) may suffice.

To resolve this issue for more complex cases, xarray has the register_dataset_accessor() and register_dataarray_accessor() decorators for adding custom “accessors” on xarray objects. Here’s how you might use these decorators to write a custom “geo” accessor implementing a geography specific extension to xarray:

import xarray as xr


@xr.register_dataset_accessor('geo')
class GeoAccessor(object):
    def __init__(self, xarray_obj):
        self._obj = xarray_obj
        self._center = None

    @property
    def center(self):
        """Return the geographic center point of this dataset."""
        if self._center is None:
            # we can use a cache on our accessor objects, because accessors
            # themselves are cached on instances that access them.
            lon = self._obj.latitude
            lat = self._obj.longitude
            self._center = (float(lon.mean()), float(lat.mean()))
        return self._center

    def plot(self):
        """Plot data on a map."""
        return 'plotting!'

This achieves the same result as if the Dataset class had a cached property defined that returns an instance of your class:

class Dataset:
    ...
    @property
    def geo(self)
        return GeoAccessor(self)

However, using the register accessor decorators is preferable to simply adding your own ad-hoc property (i.e., Dataset.geo = property(...)), for several reasons:

  1. It ensures that the name of your property does not accidentally conflict with any other attributes or methods (including other accessors).
  2. Instances of accessor object will be cached on the xarray object that creates them. This means you can save state on them (e.g., to cache computed properties).
  3. Using an accessor provides an implicit namespace for your custom functionality that clearly identifies it as separate from built-in xarray methods.

Back in an interactive IPython session, we can use these properties:

In [1]: ds = xr.Dataset({'longitude': np.linspace(0, 10),
   ...:                  'latitude': np.linspace(0, 20)})
   ...: 

In [2]: ds.geo.center
Out[2]: (10.0, 5.0)

In [3]: ds.geo.plot()
Out[3]: 'plotting!'

The intent here is that libraries that extend xarray could add such an accessor to implement subclass specific functionality rather than using actual subclasses or patching in a large number of domain specific methods. For further reading on ways to write new accessors and the philosophy behind the approach, see GH1080.

To help users keep things straight, please let us know if you plan to write a new accessor for an open source library. In the future, we will maintain a list of accessors and the libraries that implement them on this page.

Here are several existing libraries that build functionality upon xarray. They may be useful points of reference for your work:

  • xgcm: General Circulation Model Postprocessing. Uses subclassing and custom xarray backends.
  • PyGDX: Python 3 package for accessing data stored in GAMS Data eXchange (GDX) files. Also uses a custom subclass.
  • windspharm: Spherical harmonic wind analysis in Python.
  • eofs: EOF analysis in Python.
  • salem: Adds geolocalised subsetting, masking, and plotting operations to xarray’s data structures via accessors.

Contributing to xarray

Note

Large parts of this document came from the Pandas Contributing Guide.

Where to start?

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

If you are brand new to xarray or open-source development, we recommend going through the GitHub “issues” tab to find issues that interest you. There are a number of issues listed under Documentation and good first issue where you could start out. Once you’ve found an interesting issue, you can return here to get your development environment setup.

Feel free to ask questions on the mailing list.

Bug reports and enhancement requests

Bug reports are an important part of making xarray more stable. Having a complete bug report will allow others to reproduce the bug and provide insight into fixing. See this stackoverflow article for tips on writing a good bug report.

Trying the bug-producing code out on the master branch is often a worthwhile exercise to confirm the bug still exists. It is also worth searching existing bug reports and pull requests to see if the issue has already been reported and/or fixed.

Bug reports must:

  1. Include a short, self-contained Python snippet reproducing the problem. You can format the code nicely by using GitHub Flavored Markdown:

    ```python
    >>> from xarray import Dataset
    >>> df = Dataset(...)
    ...
    ```
    
  2. Include the full version string of xarray and its dependencies. You can use the built in function:

    >>> import xarray as xr
    >>> xr.show_versions()
    
  3. Explain why the current behavior is wrong/not desired and what you expect instead.

The issue will then show up to the xarray community and be open to comments/ideas from others.

Working with the code

Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need to learn how to work with GitHub and the xarray code base.

Version control, Git, and GitHub

To the new user, working with Git is one of the more daunting aspects of contributing to xarray. It can very quickly become overwhelming, but sticking to the guidelines below will help keep the process straightforward and mostly trouble free. As always, if you are having difficulties please feel free to ask for help.

The code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on the project.

Some great resources for learning Git:

Getting started with Git

GitHub has instructions for installing git, setting up your SSH key, and configuring git. All these steps need to be completed before you can work seamlessly between your local repository and GitHub.

Forking

You will need your own fork to work on the code. Go to the xarray project page and hit the Fork button. You will want to clone your fork to your machine:

git clone https://github.com/your-user-name/xarray.git
cd xarray
git remote add upstream https://github.com/pydata/xarray.git

This creates the directory xarray and connects your repository to the upstream (main project) xarray repository.

Creating a development environment

To test out code changes, you’ll need to build xarray from source, which requires a Python environment. If you’re making documentation changes, you can skip to Contributing to the documentation but you won’t be able to build the documentation locally before pushing your changes.

Creating a Python Environment

Before starting any development, you’ll need to create an isolated xarray development environment:

We’ll now kick off a two-step process:

  1. Install the build dependencies
  2. Build and install xarray
# Create and activate the build environment
conda env create -f ci/requirements-py36.yml
conda activate test_env

# or with older versions of Anaconda:
source activate test_env

# Build and install xarray
pip install -e .

At this point you should be able to import xarray from your locally built version:

$ python  # start an interpreter
>>> import xarray
>>> xarray.__version__
'0.10.0+dev46.g015daca'

This will create the new environment, and not touch any of your existing environments, nor any existing Python installation.

To view your environments:

conda info -e

To return to your root environment:

conda deactivate

See the full conda docs here.

Creating a branch

You want your master branch to reflect only production-ready code, so create a feature branch for making your changes. For example:

git branch shiny-new-feature
git checkout shiny-new-feature

The above can be simplified to:

git checkout -b shiny-new-feature

This changes your working directory to the shiny-new-feature branch. Keep any changes in this branch specific to one bug or feature so it is clear what the branch brings to xarray. You can have many “shiny-new-features” and switch in between them using the git checkout command.

To update this branch, you need to retrieve the changes from the master branch:

git fetch upstream
git rebase upstream/master

This will replay your commits on top of the latest xarray git master. If this leads to merge conflicts, you must resolve these before submitting your pull request. If you have uncommitted changes, you will need to stash them prior to updating. This will effectively store your changes and they can be reapplied after updating.

Contributing to the documentation

If you’re not the developer type, contributing to the documentation is still of huge value. You don’t even have to be an expert on xarray to do so! In fact, there are sections of the docs that are worse off after being written by experts. If something in the docs doesn’t make sense to you, updating the relevant section after you figure it out is a great way to ensure it will help the next person.

About the xarray documentation

The documentation is written in reStructuredText, which is almost like writing in plain English, and built using Sphinx. The Sphinx Documentation has an excellent introduction to reST. Review the Sphinx docs to perform more complex changes to the documentation as well.

Some other important things to know about the docs:

  • The xarray documentation consists of two parts: the docstrings in the code itself and the docs in this folder xarray/doc/.

    The docstrings are meant to provide a clear explanation of the usage of the individual functions, while the documentation in this folder consists of tutorial-like overviews per topic together with some other information (what’s new, installation, etc).

  • The docstrings follow the Numpy Docstring Standard, which is used widely in the Scientific Python community. This standard specifies the format of the different sections of the docstring. See this document for a detailed explanation, or look at some of the existing functions to extend it in a similar manner.

  • The tutorials make heavy use of the ipython directive sphinx extension. This directive lets you put code in the documentation which will be run during the doc build. For example:

    .. ipython:: python
    
        x = 2
        x**3
    

    will be rendered as:

    In [1]: x = 2
    
    In [2]: x**3
    Out[2]: 8
    

    Almost all code examples in the docs are run (and the output saved) during the doc build. This approach means that code examples will always be up to date, but it does make the doc building a bit more complex.

  • Our API documentation in doc/api.rst houses the auto-generated documentation from the docstrings. For classes, there are a few subtleties around controlling which methods and attributes have pages auto-generated.

    Every method should be included in a toctree in api.rst, else Sphinx will emit a warning.

How to build the xarray documentation
Requirements

First, you need to have a development environment to be able to build xarray (see the docs on creating a development environment above).

Building the documentation

In your development environment, install sphinx, sphinx_rtd_theme, sphinx-gallery and numpydoc:

conda install -c conda-forge sphinx sphinx_rtd_theme sphinx-gallery numpydoc

Navigate to your local xarray/doc/ directory in the console and run:

make html

Then you can find the HTML output in the folder xarray/doc/build/html/.

The first time you build the docs, it will take quite a while because it has to run all the code examples and build all the generated docstring pages. In subsequent evocations, sphinx will try to only build the pages that have been modified.

If you want to do a full clean build, do:

make clean
make html

Contributing to the code base

Code standards

Writing good code is not just about what you write. It is also about how you write it. During Continuous Integration testing, several tools will be run to check your code for stylistic errors. Generating any warnings will cause the test to fail. Thus, good style is a requirement for submitting code to xarray.

In addition, because a lot of people use our library, it is important that we do not make sudden changes to the code that could have the potential to break a lot of user code as a result, that is, we need it to be as backwards compatible as possible to avoid mass breakages.

Python (PEP8)

xarray uses the PEP8 standard. There are several tools to ensure you abide by this standard. Here are some of the more common PEP8 issues:

  • we restrict line-length to 79 characters to promote readability
  • passing arguments should have spaces after commas, e.g. foo(arg1, arg2, kw1='bar')

Continuous Integration will run the flake8 tool and report any stylistic errors in your code. Therefore, it is helpful before submitting code to run the check yourself:

flake8

If you install isort and flake8-isort, this will also show any errors from incorrectly sorted imports. These aren’t currently enforced in CI. To automatically sort imports, you can run:

isort -y
Backwards Compatibility

Please try to maintain backward compatibility. xarray has growing number of users with lots of existing code, so don’t break it if at all possible. If you think breakage is, required clearly state why as part of the pull request. Also, be careful when changing method signatures and add deprecation warnings where needed. Also, add the deprecated sphinx directive to the deprecated functions or methods.

Testing With Continuous Integration

The xarray test suite will run automatically on Travis-CI, and Appveyor, continuous integration services, once your pull request is submitted. However, if you wish to run the test suite on a branch prior to submitting the pull request, then the continuous integration services need to be hooked to your GitHub repository. Instructions are here for Travis-CI, and Appveyor.

A pull-request will be considered for merging when you have an all ‘green’ build. If any tests are failing, then you will get a red ‘X’, where you can click through to see the individual failed tests. This is an example of a green build.

_images/ci.png

Note

Each time you push to your fork, a new run of the tests will be triggered on the CI. Appveyor will auto-cancel any non-currently-running tests for that same pull-request. You can also enable the auto-cancel feature for Travis-CI here.

Test-driven development/code writing

xarray is serious about testing and strongly encourages contributors to embrace test-driven development (TDD). This development process “relies on the repetition of a very short development cycle: first the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.” So, before actually writing any code, you should write your tests. Often the test can be taken from the original GitHub issue. However, it is always worth considering additional use cases and writing corresponding tests.

Adding tests is one of the most common requests after code is pushed to xarray. Therefore, it is worth getting in the habit of writing tests ahead of time so this is never an issue.

Like many packages, xarray uses pytest and the convenient extensions in numpy.testing.

Writing tests

All tests should go into the tests subdirectory of the specific package. This folder contains many current examples of tests, and we suggest looking to these for inspiration. If your test requires working with files or network connectivity, there is more information on the testing page of the wiki.

The xarray.testing module has many special assert functions that make it easier to make statements about whether DataArray or Dataset objects are equivalent. The easiest way to verify that your code is correct is to explicitly construct the result you expect, then compare the actual result to the expected correct result:

def test_constructor_from_0d(self):
    expected = Dataset({None: ([], 0)})[None]
    actual = DataArray(0)
    assert_identical(expected, actual)
Transitioning to pytest

xarray existing test structure is mostly classed based, meaning that you will typically find tests wrapped in a class.

class TestReallyCoolFeature(object):
    ....

Going forward, we are moving to a more functional style using the pytest framework, which offers a richer testing framework that will facilitate testing and developing. Thus, instead of writing test classes, we will write test functions like this:

def test_really_cool_feature():
    ....
Using pytest

Here is an example of a self-contained set of tests that illustrate multiple features that we like to use.

  • functional style: tests are like test_* and only take arguments that are either fixtures or parameters
  • pytest.mark can be used to set metadata on test functions, e.g. skip or xfail.
  • using parametrize: allow testing of multiple cases
  • to set a mark on a parameter, pytest.param(..., marks=...) syntax should be used
  • fixture, code for object construction, on a per-test basis
  • using bare assert for scalars and truth-testing
  • tm.assert_series_equal (and its counter part tm.assert_frame_equal), for xarray object comparisons.
  • the typical pattern of constructing an expected and comparing versus the result

We would name this file test_cool_feature.py and put in an appropriate place in the xarray/tests/ structure.

import pytest
import numpy as np
import xarray as xr
from xarray.testing import assert_equal


@pytest.mark.parametrize('dtype', ['int8', 'int16', 'int32', 'int64'])
def test_dtypes(dtype):
    assert str(np.dtype(dtype)) == dtype


@pytest.mark.parametrize('dtype', ['float32',
                         pytest.param('int16', marks=pytest.mark.skip),
                         pytest.param('int32', marks=pytest.mark.xfail(
                            reason='to show how it works'))])
def test_mark(dtype):
    assert str(np.dtype(dtype)) == 'float32'


@pytest.fixture
def dataarray():
    return xr.DataArray([1, 2, 3])


@pytest.fixture(params=['int8', 'int16', 'int32', 'int64'])
def dtype(request):
    return request.param


def test_series(dataarray, dtype):
    result = dataarray.astype(dtype)
    assert result.dtype == dtype

    expected = xr.DataArray(np.array([1, 2, 3], dtype=dtype))
    assert_equal(result, expected)

A test run of this yields

((xarray) $ pytest test_cool_feature.py -v
 =============================== test session starts ================================
 platform darwin -- Python 3.6.4, pytest-3.2.1, py-1.4.34, pluggy-0.4.0 --
 cachedir: ../../.cache
 plugins: cov-2.5.1, hypothesis-3.23.0
 collected 11 items

 test_cool_feature.py::test_dtypes[int8] PASSED
 test_cool_feature.py::test_dtypes[int16] PASSED
 test_cool_feature.py::test_dtypes[int32] PASSED
 test_cool_feature.py::test_dtypes[int64] PASSED
 test_cool_feature.py::test_mark[float32] PASSED
 test_cool_feature.py::test_mark[int16] SKIPPED
 test_cool_feature.py::test_mark[int32] xfail
 test_cool_feature.py::test_series[int8] PASSED
 test_cool_feature.py::test_series[int16] PASSED
 test_cool_feature.py::test_series[int32] PASSED
 test_cool_feature.py::test_series[int64] PASSED

 ================== 9 passed, 1 skipped, 1 xfailed in 1.83 seconds ==================

Tests that we have parametrized are now accessible via the test name, for example we could run these with -k int8 to sub-select only those tests which match int8.

((xarray) bash-3.2$ pytest  test_cool_feature.py  -v -k int8
=========================== test session starts ===========================
platform darwin -- Python 3.6.2, pytest-3.2.1, py-1.4.31, pluggy-0.4.0
collected 11 items

test_cool_feature.py::test_dtypes[int8] PASSED
test_cool_feature.py::test_series[int8] PASSED
Running the test suite

The tests can then be run directly inside your Git clone (without having to install xarray) by typing:

pytest xarray

The tests suite is exhaustive and takes a few minutes. Often it is worth running only a subset of tests first around your changes before running the entire suite.

The easiest way to do this is with:

pytest xarray/path/to/test.py -k regex_matching_test_name

Or with one of the following constructs:

pytest xarray/tests/[test-module].py
pytest xarray/tests/[test-module].py::[TestClass]
pytest xarray/tests/[test-module].py::[TestClass]::[test_method]

Using pytest-xdist, one can speed up local testing on multicore machines. To use this feature, you will need to install pytest-xdist via:

pip install pytest-xdist

Then, run pytest with the optional -n argument:

pytest xarray -n 4

This can significantly reduce the time it takes to locally run tests before submitting a pull request.

For more, see the pytest documentation.

Running the performance test suite

Performance matters and it is worth considering whether your code has introduced performance regressions. xarray is starting to write a suite of benchmarking tests using asv to enable easy monitoring of the performance of critical xarray operations. These benchmarks are all found in the xarray/asv_bench directory. asv supports both python2 and python3.

To use all features of asv, you will need either conda or virtualenv. For more details please check the asv installation webpage.

To install asv:

pip install git+https://github.com/spacetelescope/asv

If you need to run a benchmark, change your directory to asv_bench/ and run:

asv continuous -f 1.1 upstream/master HEAD

You can replace HEAD with the name of the branch you are working on, and report benchmarks that changed by more than 10%. The command uses conda by default for creating the benchmark environments. If you want to use virtualenv instead, write:

asv continuous -f 1.1 -E virtualenv upstream/master HEAD

The -E virtualenv option should be added to all asv commands that run benchmarks. The default value is defined in asv.conf.json.

Running the full benchmark suite can take up to one hour and use up a few GBs of RAM. Usually it is sufficient to paste only a subset of the results into the pull request to show that the committed changes do not cause unexpected performance regressions. You can run specific benchmarks using the -b flag, which takes a regular expression. For example, this will only run tests from a xarray/asv_bench/benchmarks/groupby.py file:

asv continuous -f 1.1 upstream/master HEAD -b ^groupby

If you want to only run a specific group of tests from a file, you can do it using . as a separator. For example:

asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods

will only run the GroupByMethods benchmark defined in groupby.py.

You can also run the benchmark suite using the version of xarray already installed in your current Python environment. This can be useful if you do not have virtualenv or conda, or are using the setup.py develop approach discussed above; for the in-place build you need to set PYTHONPATH, e.g. PYTHONPATH="$PWD/.." asv [remaining arguments]. You can run benchmarks using an existing Python environment by:

asv run -e -E existing

or, to use a specific Python interpreter,:

asv run -e -E existing:python3.5

This will display stderr from the benchmarks, and use your local python that comes from your $PATH.

Information on how to write a benchmark and how to use asv can be found in the asv documentation.

The xarray benchmarking suite is run remotely and the results are available here.

Documenting your code

Changes should be reflected in the release notes located in doc/whats-new.rst. This file contains an ongoing change log for each release. Add an entry to this file to document your fix, enhancement or (unavoidable) breaking change. Make sure to include the GitHub issue number when adding your entry (using `` GH1234 `` where 1234 is the issue/pull request number).

If your code is an enhancement, it is most likely necessary to add usage examples to the existing documentation. This can be done following the section regarding documentation above.

Contributing your changes to xarray

Committing your code

Keep style fixes to a separate commit to make your pull request more readable.

Once you’ve made changes, you can see them by typing:

git status

If you have created a new file, it is not being tracked by git. Add it by typing:

git add path/to/file-to-be-added.py

Doing ‘git status’ again should give something like:

# On branch shiny-new-feature
#
#       modified:   /relative/path/to/file-you-added.py
#

Finally, commit your changes to your local repository with an explanatory message. Xarray uses a convention for commit message prefixes and layout. Here are some common prefixes along with general guidelines for when to use them:

  • ENH: Enhancement, new functionality
  • BUG: Bug fix
  • DOC: Additions/updates to documentation
  • TST: Additions/updates to tests
  • BLD: Updates to the build process/scripts
  • PERF: Performance improvement
  • CLN: Code cleanup

The following defines how a commit message should be structured. Please reference the relevant GitHub issues in your commit message using GH1234 or #1234. Either style is fine, but the former is generally preferred:

  • a subject line with < 80 chars.
  • One blank line.
  • Optionally, a commit message body.

Now you can commit your changes in your local repository:

git commit -m
Pushing your changes

When you want your changes to appear publicly on your GitHub page, push your forked feature branch’s commits:

git push origin shiny-new-feature

Here origin is the default name given to your remote repository on GitHub. You can see the remote repositories:

git remote -v

If you added the upstream repository as described above you will see something like:

origin  git@github.com:yourname/xarray.git (fetch)
origin  git@github.com:yourname/xarray.git (push)
upstream        git://github.com/pydata/xarray.git (fetch)
upstream        git://github.com/pydata/xarray.git (push)

Now your code is on GitHub, but it is not yet a part of the xarray project. For that to happen, a pull request needs to be submitted on GitHub.

Review your code

When you’re ready to ask for a code review, file a pull request. Before you do, once again make sure that you have followed all the guidelines outlined in this document regarding code style, tests, performance tests, and documentation. You should also double check your branch changes against the branch it was based on:

  1. Navigate to your repository on GitHub – https://github.com/your-user-name/xarray
  2. Click on Branches
  3. Click on the Compare button for your feature branch
  4. Select the base and compare branches, if necessary. This will be master and shiny-new-feature, respectively.
Finally, make the pull request

If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be looked at and eventually merged into the master version. This pull request and its associated changes will eventually be committed to the master branch and available in the next release. To submit a pull request:

  1. Navigate to your repository on GitHub
  2. Click on the Pull Request button
  3. You can then click on Commits and Files Changed to make sure everything looks okay one last time
  4. Write a description of your changes in the Preview Discussion tab
  5. Click Send Pull Request.

This request then goes to the repository maintainers, and they will review the code. If you need to make more changes, you can make them in your branch, add them to a new commit, push them to GitHub, and the pull request will be automatically updated. Pushing them to GitHub again is done by:

git push origin shiny-new-feature

This will automatically update your pull request with the latest code and restart the Continuous Integration tests.

Delete your merged branch (optional)

Once your feature branch is accepted into upstream, you’ll probably want to get rid of the branch. First, merge upstream master into your branch so git knows it is safe to delete your branch:

git fetch upstream
git checkout master
git merge upstream/master

Then you can do:

git branch -d shiny-new-feature

Make sure you use a lower-case -d, or else git won’t warn you if your feature branch has not actually been merged.

The branch will still exist on GitHub, so to delete it there do:

git push origin --delete shiny-new-feature

See also

Get in touch

  • Ask usage questions (“How do I?”) on StackOverflow.
  • Report bugs, suggest features or view the source code on GitHub.
  • For less well defined questions or ideas, or to announce other projects of interest to xarray users, use the mailing list.

License

xarray is available under the open source Apache License.

History

xarray is an evolution of an internal tool developed at The Climate Corporation. It was originally written by Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo and was released as open source in May 2014. The project was renamed from “xray” in January 2016.