Tutorial

To get started, we will import numpy, pandas and xray:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xray

DataArray

xray.DataArray is xray’s implementation of a labeled, multi-dimensional array. It has three key properties:

  • values: a numpy.ndarray holding the array’s values
  • dims: dimension names for each axis, e.g., ('x', 'y', 'z')
  • coords: tick labels along each dimension, e.g., 1-dimensional arrays of numbers, datetime objects or strings.

xray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, like the index found on a pandas DataFrame and Series.

DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xray makes no attempt to interpret them, and propagates them only in unambiguous cases.

Creating a DataArray

The DataArray constructor takes a multi-dimensional array of values (e.g., a numpy ndarray), a list of coordinates labels and a list of dimension names:

In [4]: data = np.random.rand(4, 3)

In [5]: locs = ['IA', 'IL', 'IN']

In [6]: times = pd.date_range('2000-01-01', periods=4)

In [7]: foo = xray.DataArray(data, coords=[times, locs], dims=['time', 'space'])

In [8]: foo
Out[8]: 
<xray.DataArray (time: 4, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

All of these arguments (except for data) are optional, and will be filled in with default values:

In [9]: xray.DataArray(data)
Out[9]: 
<xray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    dim_0: Int64Index([0, 1, 2, 3], dtype='int64')
    dim_1: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

You can also create a DataArray by supplying a pandas Series, DataFrame or Panel, in which case any non-specified arguments in the DataArray constructor will be filled in from the pandas object:

In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])

In [11]: df.index.name = 'abc'

In [12]: df.columns.name = 'xyz'

In [13]: df
Out[13]: 
xyz  x  y
abc      
a    0  2
b    1  3

In [14]: xray.DataArray(df)
Out[14]: 
<xray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
       [1, 3]])
Coordinates:
    abc: Index([u'a', u'b'], dtype='object')
    xyz: Index([u'x', u'y'], dtype='object')
Attributes:
    Empty

xray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see GH164). However, the alternate from_series constructor will automatically unpack any hierarchical indexes it encounters by expanding the series into a multi-dimensional array, as described in Working with pandas.

DataArray properties

Let’s take a look at the important properties on our array:

In [15]: foo.values
Out[15]: 
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])

In [16]: foo.dims
Out[16]: ('time', 'space')

In [17]: foo.coords
Out[17]: 
time: <class 'pandas.tseries.index.DatetimeIndex'>
      [2000-01-01, ..., 2000-01-04]
      Length: 4, Freq: D, Timezone: None
space: Index([u'IA', u'IL', u'IN'], dtype='object')

In [18]: foo.attrs
Out[18]: OrderedDict()

In [19]: print(foo.name)
None

Now fill in some of that missing metadata:

In [20]: foo.name = 'foo'

In [21]: foo.attrs['units'] = 'meters'

In [22]: foo
Out[22]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

The coords property is dict like. Individual coordinates can be accessed by name:

In [23]: foo.coords['time']
Out[23]: 
<xray.Coordinate 'time' (time: 4)>
array(['1999-12-31T18:00:00.000000000-0600',
       '2000-01-01T18:00:00.000000000-0600',
       '2000-01-02T18:00:00.000000000-0600',
       '2000-01-03T18:00:00.000000000-0600'], dtype='datetime64[ns]')
Attributes:
    Empty

These are xray.Coordinate objects, which contain tick-labels for each dimension.

You can also access coordinates by indexing a DataArray directly by name, in which case it returns another DataArray:

In [24]: foo['time']
Out[24]: 
<xray.DataArray 'time' (time: 4)>
array(['1999-12-31T18:00:00.000000000-0600',
       '2000-01-01T18:00:00.000000000-0600',
       '2000-01-02T18:00:00.000000000-0600',
       '2000-01-03T18:00:00.000000000-0600'], dtype='datetime64[ns]')
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Linked dataset variables:
    space, foo
Attributes:
    Empty

Dataset

xray.Dataset is xray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.

Creating a Dataset

To make an xray.Dataset from scratch, pass in a dictionary with values in the form (dimensions, data[, attrs]):

In [25]: times
Out[25]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01, ..., 2000-01-04]
Length: 4, Freq: D, Timezone: None

In [26]: locs
Out[26]: ['IA', 'IL', 'IN']

In [27]: data
Out[27]: 
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])

In [28]: ds = xray.Dataset({'time': ('time', times),
   ....:                    'space': ('space', locs),
   ....:                    'foo': (['time', 'space'], data)})
   ....: 

In [29]: ds
Out[29]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
Attributes:
    Empty
  • dimensions should be a sequence of strings.
  • data should be a numpy.ndarray (or array-like object) that has a dimensionality equal to the length of the dimensions list.

We can also use xray.Variable or xray.DataArray objects instead of tuples:

In [30]: xray.Dataset({'bar': foo})
Out[30]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    bar              1        0   
Attributes:
    Empty

You can also create an dataset from a pandas.DataFrame with Dataset.from_dataframe or from a netCDF file on disk with open_dataset(). See Working with pandas and Serialization and IO.

Dataset contents

Dataset implements the Python dictionary interface, with values given by xray.DataArray objects:

In [31]: 'foo' in ds
Out[31]: True

In [32]: ds.keys()
Out[32]: ['space', 'foo', 'time']

In [33]: ds['foo']
Out[33]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

The valid keys include each listed “coordinate” and “noncoordinate” variables. Coordinates are arrays that label values along a particular dimension, implemented as a thin wrapper wrapper around a pandas.Index object. They are created automatically from dataset arrays whose name is equal to the one item in their list of dimensions.

Noncoordinate variables include all arrays in a Dataset other than its coordinates. These arrays can exist along multiple dimensions. The numbers in the columns in the Dataset representation indicate the order in which dimensions appear for each array (on a Dataset, the dimensions are always listed in alphabetical order).

We didn’t explicitly include an coordinate for the “space” dimension, so it was filled with an array of ascending integers of the proper length:

In [34]: ds['space']
Out[34]: 
<xray.DataArray 'space' (space: 3)>
array(['IA', 'IL', 'IN'], 
      dtype='|S2')
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    foo, time
Attributes:
    Empty

Noncoordinate and coordinate variables are listed explicitly by the noncoords and coords attributes.

There are also a few derived variables based on datetime coordinates that you can access from a dataset (e.g., “year”, “month” and “day”), even if you didn’t explicitly add them. These are known as “virtual_variables”:

In [35]: ds['time.dayofyear']
Out[35]: 
<xray.DataArray 'time.dayofyear' (time: 4)>
array([1, 2, 3, 4])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Linked dataset variables:
    space, foo
Attributes:
    Empty

Finally, datasets also store arbitrary metadata in the form of attributes:

In [36]: ds.attrs
Out[36]: OrderedDict()

In [37]: ds.attrs['title'] = 'example attribute'

In [38]: ds
Out[38]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
Attributes:
    title: example attribute

xray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you put in objects that are not strings, numbers or numpy.ndarray objects.

Modifying datasets

We can update a dataset in-place using Python’s standard dictionary syntax:

In [39]: ds['numbers'] = ('time', [10, 10, 20, 20])

In [40]: ds['abc'] = ('space', ['A', 'B', 'C'])

In [41]: ds
Out[41]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

It should be evident now how a Dataset lets you store many arrays along a (partially) shared set of common dimensions and coordinates.

To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __del__, get and update.

You also can select and drop an explicit list of variables by using the select_vars() and drop_vars() methods to return a new Dataset. select_vars automatically includes the relevant coordinates:

In [42]: ds.select_vars('abc')
Out[42]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    abc              0   
Attributes:
    title: example attribute

If a dimension name is given as an argument to drop_vars, it also drops all variables that use that dimension:

In [43]: ds.drop_vars('time', 'space')
Out[43]: 
<xray.Dataset>
Dimensions:     ()
Coordinates:
    None
Noncoordinates:
    None
Attributes:
    title: example attribute

You can copy a Dataset by using the copy() method:

In [44]: ds2 = ds.copy()

In [45]: del ds2['time']

In [46]: ds2
Out[46]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    abc              0   
Attributes:
    title: example attribute

By default, the copy is shallow, so only the container will be copied: the contents of the Dataset will still be the same underlying xray.Variable. You can copy all data by supplying the argument deep=True.

Indexing

Indexing a DataArray works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [47]: foo[:2]
Out[47]: 
<xray.DataArray 'foo' (time: 2, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

In [48]: foo[0, 0]
Out[48]: 
<xray.DataArray 'foo' ()>
array(0.12696983303810094)
Linked dataset variables:
    time, space
Attributes:
    units: meters

In [49]: foo[:, [2, 1]]
Out[49]: 
<xray.DataArray 'foo' (time: 4, space: 2)>
array([[ 0.26047601,  0.96671784],
       [ 0.33622174,  0.37674972],
       [ 0.12310214,  0.84025508],
       [ 0.44799682,  0.37301223]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IN', u'IL'], dtype='object')
Attributes:
    units: meters

xray also supports label-based indexing, just like pandas. Because Coordinate is a thin wrapper around a pandas.Index, label based indexing is very fast. To do label based indexing, use the loc attribute:

In [50]: foo.loc['2000-01-01':'2000-01-02', 'IA']
Out[50]: 
<xray.DataArray 'foo' (time: 2)>
array([ 0.12696983,  0.89723652])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
Linked dataset variables:
    space
Attributes:
    units: meters

You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xray is inclusive of both the start and stop bounds.

Setting values with label based indexing is also supported:

In [51]: foo.loc['2000-01-01', ['IL', 'IN']] = -10

In [52]: foo
Out[52]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data with the sel() and isel() methods:

# index by integer array indices
In [53]: foo.isel(space=0, time=slice(None, 2))
Out[53]: 
<xray.DataArray 'foo' (time: 2)>
array([ 0.12696983,  0.89723652])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
Linked dataset variables:
    space
Attributes:
    units: meters

# index by coordinate labels
In [54]: foo.sel(time=slice('2000-01-01', '2000-01-02'))
Out[54]: 
<xray.DataArray 'foo' (time: 2, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

The arguments to these methods can be any objects that could index the array along that dimension, e.g., labels for an individual value, Python slice objects or 1-dimensional arrays.

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:

In [55]: ds.isel(space=[0], time=[0])
Out[55]: 
<xray.Dataset>
Dimensions:     (space: 1, time: 1)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

In [56]: ds.sel(time='2000-01-01')
Out[56]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    foo              0   
    time                 
    numbers              
    abc              0   
Attributes:
    title: example attribute

Indexing with xray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s array broadcasting. This means you can do indexing like this, which wouldn’t work with numpy arrays:

In [57]: foo[foo['time.day'] > 1, foo['space'] != 'IL']
Out[57]: 
<xray.DataArray 'foo' (time: 3, space: 2)>
array([[ 0.89723652,  0.33622174],
       [ 0.45137647,  0.12310214],
       [ 0.5430262 ,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-02, ..., 2000-01-04]
          Length: 3, Freq: D, Timezone: None
    space: Index([u'IA', u'IN'], dtype='object')
Attributes:
    units: meters

This is a much simpler model than numpy’s advanced indexing, and is basically the only model that works for labeled arrays. If you would like to do advanced indexing, you can always index .values directly instead:

In [58]: foo.values[foo.values > 0.5]
Out[58]: array([ 0.89723652,  0.84025508,  0.5430262 ])

Computation

The metadata of DataArray objects enables particularly nice features for doing mathematical operations.

Basic math

Basic math with DataArray objects works just as you would expect:

In [59]: foo - 3
Out[59]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ -2.87303017, -13.        , -13.        ],
       [ -2.10276348,  -2.62325028,  -2.66377826],
       [ -2.54862353,  -2.15974492,  -2.87689786],
       [ -2.4569738 ,  -2.62698777,  -2.55200318]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:

In [60]: np.sin(foo)
Out[60]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12662895,  0.54402111,  0.54402111],
       [ 0.78160612,  0.36790009,  0.32992275],
       [ 0.43620456,  0.74481335,  0.12279146],
       [ 0.51672923,  0.36442217,  0.43316091]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

DataArray also has metadata aware versions of many numpy.ndarray methods:

In [61]: foo.T
Out[61]: 
<xray.DataArray 'foo' (space: 3, time: 4)>
array([[  0.12696983,   0.89723652,   0.45137647,   0.5430262 ],
       [-10.        ,   0.37674972,   0.84025508,   0.37301223],
       [-10.        ,   0.33622174,   0.12310214,   0.44799682]])
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Attributes:
    units: meters

In [62]: foo.round(2)
Out[62]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.13, -10.  , -10.  ],
       [  0.9 ,   0.38,   0.34],
       [  0.45,   0.84,   0.12],
       [  0.54,   0.37,   0.45]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

It also has the isnull and notnull methods from pandas:

In [63]: xray.DataArray([0, 1, np.nan, np.nan, 2]).isnull()
Out[63]: 
<xray.DataArray (dim_0: 5)>
array([False, False,  True,  True, False], dtype=bool)
Coordinates:
    dim_0: Int64Index([0, 1, 2, 3, 4], dtype='int64')
Attributes:
    Empty

You cannot directly do math with Dataset objects (yet!), but you can map an operation over any or all non-coordinates in a dataset by using apply():

In [64]: ds.drop_vars('abc').apply(lambda x: 2 * x)
Out[64]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
Attributes:
    Empty

Aggregation

Aggregation methods from ndarray have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):

In [65]: foo.sum('time')
Out[65]: 
<xray.DataArray 'foo' (space: 3)>
array([ 2.01860903, -8.40998298, -9.09267929])
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

In [66]: foo.std(['time', 'space'])
Out[66]: 
<xray.DataArray 'foo' ()>
array(3.901454019694515)
Attributes:
    Empty

In [67]: foo.min()
Out[67]: 
<xray.DataArray 'foo' ()>
array(-10.0)
Attributes:
    Empty

These operations also work on Dataset objects, by mapping over all non-coordinates:

In [68]: ds.mean('time')
Out[68]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    foo              0   
    numbers              
    abc              0   
Attributes:
    Empty

If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:

In [69]: foo.get_axis_num('space')
Out[69]: 1

To perform a NA skipping aggregations, pass the NA aware numpy functions directly to reduce method:

In [70]: foo.reduce(np.nanmean, 'time')
Out[70]: 
<xray.DataArray 'foo' (space: 3)>
array([ 0.50465226, -2.10249574, -2.27316982])
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Warning

Currently, xray uses the standard ndarray methods which do not automatically skip missing values, but we expect to switch the default to NA skipping versions (like pandas) in a future version (GH130).

Broadcasting

DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.

This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:

In [71]: a = xray.DataArray([1, 2, 3, 4], [['a', 'b', 'c', 'd']], ['x'])

In [72]: a
Out[72]: 
<xray.DataArray (x: 4)>
array([1, 2, 3, 4])
Coordinates:
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
Attributes:
    Empty

In [73]: b = xray.DataArray([-1, -2, -3], dims=['y'])

In [74]: b
Out[74]: 
<xray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
    y: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

With xray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:

In [75]: a * b
Out[75]: 
<xray.DataArray (x: 4, y: 3)>
array([[ -1,  -2,  -3],
       [ -2,  -4,  -6],
       [ -3,  -6,  -9],
       [ -4,  -8, -12]])
Coordinates:
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
    y: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

Moreover, dimensions are always reordered to the order in which they first appeared:

In [76]: c = xray.DataArray(np.arange(12).reshape(3, 4), [b['y'], a['x']])

In [77]: c
Out[77]: 
<xray.DataArray (y: 3, x: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
    y: Int64Index([0, 1, 2], dtype='int64')
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
Attributes:
    Empty

In [78]: a + c
Out[78]: 
<xray.DataArray (x: 4, y: 3)>
array([[ 1,  5,  9],
       [ 3,  7, 11],
       [ 5,  9, 13],
       [ 7, 11, 15]])
Coordinates:
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
    y: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

This means, for example, that you always subtract an array from its transpose!

In [79]: c - c.T
Out[79]: 
<xray.DataArray (y: 3, x: 4)>
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
Coordinates:
    y: Int64Index([0, 1, 2], dtype='int64')
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
Attributes:
    Empty

Alignment

Performing most binary operations on xray objects requires that the all coordinate values are equal:

In [80]: a + a[:2]
ValueError: coordinate 'x' is not aligned

However, xray does have some methods (copied from pandas) that make aligning DataArray and Dataset objects manually easy and fast.

Warning

pandas does index based alignment automatically when doing math, using join='outer'. xray doesn’t have automatic alignment yet, but we do intend to enable it in a future version (GH186). Unlike pandas, we expect to default to join='inner'.

Reindexing returns modified arrays with new coordinates, filling in missing values with NaN. To reindex a particular dimension, use reindex():

The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:

In [81]: baz = (10 * foo[:2, :2]).rename('baz')

In [82]: baz
Out[82]: 
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[   1.26969833, -100.        ],
       [   8.97236524,    3.76749716]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL'], dtype='object')
Attributes:
    Empty

Reindexing foo with baz selects out the first two values along each dimension:

In [83]: foo.reindex_like(baz)
Out[83]: 
<xray.DataArray 'foo' (time: 2, space: 2)>
array([[  0.12696983, -10.        ],
       [  0.89723652,   0.37674972]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL'], dtype='object')
Attributes:
    units: meters

The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:

In [84]: baz.reindex_like(foo)
Out[84]: 
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[   1.26969833, -100.        ,           nan],
       [   8.97236524,    3.76749716,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

The align() function lets us perform more flexible 'inner', 'outer', 'left' and 'right' joins:

In [85]: xray.align(foo, baz, join='inner')
Out[85]: 
(<xray.DataArray 'foo' (time: 2, space: 2)>
 array([[  0.12696983, -10.        ],
        [  0.89723652,   0.37674972]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, 2000-01-02]
           Length: 2, Freq: None, Timezone: None
     space: Index([u'IA', u'IL'], dtype='object')
 Attributes:
     units: meters, <xray.DataArray 'baz' (time: 2, space: 2)>
 array([[   1.26969833, -100.        ],
        [   8.97236524,    3.76749716]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, 2000-01-02]
           Length: 2, Freq: None, Timezone: None
     space: Index([u'IA', u'IL'], dtype='object')
 Attributes:
     Empty)

In [86]: xray.align(foo, baz, join='outer')
Out[86]: 
(<xray.DataArray 'foo' (time: 4, space: 3)>
 array([[  0.12696983, -10.        , -10.        ],
        [  0.89723652,   0.37674972,   0.33622174],
        [  0.45137647,   0.84025508,   0.12310214],
        [  0.5430262 ,   0.37301223,   0.44799682]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
     space: Index([u'IA', u'IL', u'IN'], dtype='object')
 Attributes:
     units: meters, <xray.DataArray 'baz' (time: 4, space: 3)>
 array([[   1.26969833, -100.        ,           nan],
        [   8.97236524,    3.76749716,           nan],
        [          nan,           nan,           nan],
        [          nan,           nan,           nan]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
     space: Index([u'IA', u'IL', u'IN'], dtype='object')
 Attributes:
     Empty)

Both reindex_like and align work interchangeably with DataArray and xray.Dataset objects with any number of overlapping dimensions:

In [87]: ds
Out[87]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

In [88]: ds.reindex_like(baz)
Out[88]: 
<xray.Dataset>
Dimensions:     (space: 2, time: 2)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

GroupBy: split-apply-combine

Pandas has very convenient support for “group by” operations, which implement the split-apply-combine strategy for crunching data:

  • Split your data into multiple independent groups.
  • Apply some function to each group.
  • Combine your groups back into a single data object.

xray implements this same pattern using very similar syntax to pandas. Group by operations work on both Dataset and DataArray objects. Note that currently, you can only group by a single one-dimensional variable (eventually, we hope to remove this limitation).

Split

Recall the “numbers” variable in our dataset:

In [89]: ds['numbers']
Out[89]: 
<xray.DataArray 'numbers' (time: 4)>
array([10, 10, 20, 20])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Linked dataset variables:
    space, foo, abc
Attributes:
    Empty

If we groupby the name of a variable in a dataset (we can also use a DataArray directly), we get back a xray.GroupBy object:

In [90]: ds.groupby('numbers')
Out[90]: <xray.groupby.DatasetGroupBy at 0x7fcdf38d16d0>

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

In [91]: ds.groupby('numbers').groups
Out[91]: {10: [0, 1], 20: [2, 3]}

You can also iterate over over groups in (label, group) pairs:

In [92]: list(ds.groupby('numbers'))
Out[92]: 
[(10, <xray.Dataset>
  Dimensions:     (space: 3, time: 2)
  Coordinates:
      space            X            
      time                      X   
  Noncoordinates:
      foo              1        0   
      numbers                   0   
      abc              0            
  Attributes:
      title: example attribute), (20, <xray.Dataset>
  Dimensions:     (space: 3, time: 2)
  Coordinates:
      space            X            
      time                      X   
  Noncoordinates:
      foo              1        0   
      numbers                   0   
      abc              0            
  Attributes:
      title: example attribute)]

Just like in pandas, creating a GroupBy object doesn’t actually split the data until you want to access particular values.

Apply

To apply a function to each group, you can use the flexible xray.GroupBy.apply() method. The resulting objects are automatically concatenated back together along the group axis:

In [93]: def standardize(x):
   ....:     return (x - x.mean()) / x.std()
   ....: 

In [94]: ds['foo'].groupby('numbers').apply(standardize)
Out[94]: 
<xray.DataArray (time: 4, space: 3)>
array([[ 0.64391378, -1.41264919, -1.41264919],
       [ 0.80033786,  0.69463853,  0.6864082 ],
       [-0.05512164,  1.76892497, -1.59490205],
       [ 0.37476413, -0.42269144, -0.07097397]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: None, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    numbers
Attributes:
    Empty

GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:

In [95]: foo.groupby('time').mean()
Out[95]: 
<xray.DataArray 'foo' (time: 4)>
array([-6.62434339,  0.53673599,  0.4715779 ,  0.45467842])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Attributes:
    units: meters

In [96]: ds.groupby('numbers').reduce(np.nanmean)
Out[96]: 
<xray.Dataset>
Dimensions:     (numbers: 2)
Coordinates:
    numbers           X    
Noncoordinates:
    foo               0    
Attributes:
    Empty

Squeezing

When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:

In [97]: next(iter(foo.groupby('space')))
Out[97]: 
('IA', <xray.DataArray 'foo' (time: 4)>
 array([ 0.12696983,  0.89723652,  0.45137647,  0.5430262 ])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
 Linked dataset variables:
     space
 Attributes:
     units: meters)
In [98]: next(iter(foo.groupby('space', squeeze=False)))
Out[98]: 
('IA', <xray.DataArray 'foo' (time: 4, space: 1)>
 array([[ 0.12696983],
        [ 0.89723652],
        [ 0.45137647],
        [ 0.5430262 ]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
     space: Index([u'IA'], dtype='object')
 Attributes:
     units: meters)

Although xray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.

Combining data

Concatenate

To combine arrays along a dimension into a larger arrays, you can use the DataArray.concat and Dataset.concat class methods:

In [99]: xray.DataArray.concat([foo[0], foo[1]], 'new_dim')
Out[99]: 
<xray.DataArray 'foo' (new_dim: 2, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174]])
Coordinates:
    new_dim: Int64Index([0, 1], dtype='int64')
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    time
Attributes:
    units: meters

In [100]: xray.Dataset.concat([ds.sel(time='2000-01-01'), ds.sel(time='2000-01-03')],
   .....:                     'new_dim')
   .....: 
Out[100]: 
<xray.Dataset>
Dimensions:     (new_dim: 2, space: 3)
Coordinates:
    new_dim           X              
    space                        X   
Noncoordinates:
    abc                          0   
    foo               0          1   
    numbers           0              
    time              0              
Attributes:
    title: example attribute

The second argument to concat can be Coordinate or DataArray object as well as a string, in which case it is used to label the values along the new dimension:

In [101]: xray.DataArray.concat([foo[0], foo[1]], xray.Coordinate('x', [-90, -100]))
Out[101]: 
<xray.DataArray 'foo' (x: 2, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174]])
Coordinates:
    x: Int64Index([-90, -100], dtype='int64')
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    time
Attributes:
    units: meters

Dataset.concat has a number of options which control how it combines data, and in particular, how it handles conflicting variables between datasets.

Merge and update

To combine multiple Datasets, you can use the merge() and update() methods. Merge checks for conflicting variables before merging and by default it returns a new Dataset:

In [102]: ds.merge({'hello': ('space', np.arange(3) + 10)})
Out[102]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
    hello            0            
Attributes:
    title: example attribute

In contrast, update modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:

In [103]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[103]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.

Equals and identical

xray objects can be compared by using the equals() and identical() methods.

equals checks dimension names, indexes and array values:

In [104]: foo.equals(foo.copy())
Out[104]: True

identical also checks attributes, and the name of each object:

In [105]: foo.identical(foo.rename('bar'))
Out[105]: False

In contrast, the == for DataArray objects performs element- wise comparison (like numpy):

In [106]: foo == foo.copy()
Out[106]: 
<xray.DataArray (time: 4, space: 3)>
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Like pandas objects, two xray objects are still equal or identical if they have missing values marked by NaN, as long as the missing values are in the same locations in both objects. This is not true for NaN in general, which usually compares False to everything, including itself:

In [107]: np.nan == np.nan
Out[107]: False

Working with pandas

One of the most important features of xray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn and ggplot.

Fortunately, there are straightforward representations of Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively. The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.

Note

If you want to convert a pandas data-structure into a DataArray with the same number of dimensions, you can simply use the DataArray construtor directly

pandas.DataFrame

To convert to a DataFrame, use the Dataset.to_dataframe() method:

In [108]: df = ds.to_dataframe()

In [109]: df
Out[109]: 
                        foo  numbers abc
space time                              
10.2  2000-01-01   0.126970       10   A
      2000-01-02   0.897237       10   A
      2000-01-03   0.451376       20   A
      2000-01-04   0.543026       20   A
9.4   2000-01-01 -10.000000       10   B
      2000-01-02   0.376750       10   B
      2000-01-03   0.840255       20   B
      2000-01-04   0.373012       20   B
3.9   2000-01-01 -10.000000       10   C
      2000-01-02   0.336222       10   C
      2000-01-03   0.123102       20   C
      2000-01-04   0.447997       20   C

We see that each nonindex in the Dataset is now a column in the DataFrame. The DataFrame representation is reminiscent of Hadley Wickham’s notion of tidy data. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().

To create a Dataset from a DataFrame, use the from_dataframe() class method:

In [110]: xray.Dataset.from_dataframe(df)
Out[110]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              0        1   
    numbers          0        1   
    abc              0        1   
Attributes:
    Empty

Notice that that dimensions of non-coordinates in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so needed to broadcast the data of each array to the full size of the new MultiIndex.

pandas.Series

DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:

In [111]: s = foo.to_series()

In [112]: s
Out[112]: 
time        space
2000-01-01  IA        0.126970
            IL      -10.000000
            IN      -10.000000
2000-01-02  IA        0.897237
            IL        0.376750
            IN        0.336222
2000-01-03  IA        0.451376
            IL        0.840255
            IN        0.123102
2000-01-04  IA        0.543026
            IL        0.373012
            IN        0.447997
Name: foo, dtype: float64

In [113]: xray.DataArray.from_series(s)
Out[113]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: None, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Both the from_series and from_dataframe methods use reindexing, so they works even if not the hierarchical index is not a full tensor product:

In [114]: s[::2]
Out[114]: 
time        space
2000-01-01  IA        0.126970
            IN      -10.000000
2000-01-02  IL        0.376750
2000-01-03  IA        0.451376
            IN        0.123102
2000-01-04  IL        0.373012
Name: foo, dtype: float64

In [115]: xray.DataArray.from_series(s[::2])
Out[115]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983,          nan, -10.        ],
       [         nan,   0.37674972,          nan],
       [  0.45137647,          nan,   0.12310214],
       [         nan,   0.37301223,          nan]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: None, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Serialization and IO

xray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceeding section) and using its broad range of IO tools.

Pickle

The simplest way to serialize an xray object is to use Python’s built-in pickle module:

In [116]: import cPickle as pickle

In [117]: pkl = pickle.dumps(ds)

In [118]: pickle.loads(pkl)
Out[118]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

Pickle support is important because it doesn’t require any external libraries and lets you use xray objects with Python modules like multiprocessing. However, there are two important cavaets:

  1. To simplify serialization, xray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
  2. Pickle will only work as long as the internal data structure of xray objects remains unchanged. Because the internal design of xray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xray will work in future versions.

Reading and writing to disk (netCDF)

Currently, the only external serialization format that xray supports is netCDF. netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms. We use netCDF because xray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects. Recent versions netCDF are based on the even more widely used HDF5 file-format.

Reading and writing netCDF files with xray requires the Python-netCDF4 library.

We can save a Dataset to disk using the Dataset.to_netcdf method:

In [119]: ds.to_netcdf('saved_on_disk.nc')

By default, the file is saved as netCDF4.

We can load netCDF files to create a new Dataset using the open_dataset() function:

In [120]: ds_disk = xray.open_dataset('saved_on_disk.nc')

In [121]: ds_disk
Out[121]: 
<xray.Dataset>
Dimensions:     (space: 4, time: 3)
Coordinates:
    space            X
    time                      X
Noncoordinates:
    foo              1        0
    numbers          0
    abc                       0
Attributes:
    title: example attribute

A dataset can also be loaded from a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument.

Data is loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until necessary. For an example of how these lazy arrays work, see the OPeNDAP section below.

Datasets have a close() method to close the associated netCDF file. The preferred way to handle this is to use a context-manager:

In [122]: with xray.open_dataset('my_file.nc') as ds:
...           print(ds.keys())
Out[122]: ['space', 'foo', 'time', 'numbers', 'abc']

Note

Although xray provides reasonable support for incremental reads of files on disk, it does not yet support incremental writes, which is important for dealing with datasets that do not fit into memory. This is a significant shortcoming that we hope to resolve (GH199) by adding the ability to create Dataset objects directly linked to a netCDF file on disk.

NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “_FillValue” attributes). If the argument decode_cf=True (default) is given to open_dataset, xray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.

You can view this encoding information and control the details of how xray serializes objects, by viewing and manipulating the DataArray.encoding attribute:

In [123]: ds_disk['time'].encoding
Out[123]: 
{'calendar': u'proleptic_gregorian',
 'chunksizes': None,
 'complevel': 0,
 'contiguous': True,
 'dtype': dtype('float64'),
 'fletcher32': False,
 'least_significant_digit': None,
 'shuffle': False,
 'units': u'days since 2000-01-01 00:00:00',
 'zlib': False}

Working with remote datasets (OPeNDAP)

xray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.

For example, we can open a connetion to GBs of weather data produced by the PRISM project, and hosted by International Research Institute for Climate and Society at Columbia:

In [124]: remote_data = xray.open_dataset(
    'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods')

In [125]: remote_data
Out[125]: 
<xray.Dataset>
Dimensions:     (T: 1432, X: 1405, Y: 621)
Coordinates:
    T               X
    X                        X
    Y                                 X
Noncoordinates:
    ppt             0        2        1
    tdmean          0        2        1
    tmax            0        2        1
    tmin            0        2        1
Attributes:
    Conventions: IRIDL
    expires: 1401580800

In [126]: remote_data['tmax']
Out[126]: 
<xray.DataArray 'tmax' (T: 1432, Y: 621, X: 1405)>
[1249427160 values with dtype=float64]
Attributes:
    pointwidth: 120
    units: Celsius_scale
    missing_value: -9999
    standard_name: air_temperature
    expires: 1401580800

We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:

In [127]: tmax = remote_data['tmax'][:500, ::3, ::3]

In [128]: tmax
Out[128]: 
<xray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Attributes:
    pointwidth: 120
    units: Celsius_scale
    missing_value: -9999
    standard_name: air_temperature
    expires: 1401580800

Now, let’s access and plot a small subset:

In [129]: tmax_ss = tmax[0]

For this dataset, we still need to manually fill in some of the values with NaN to indicate that they are missing. As soon as we access tmax_ss.values, the values are loaded over the network and cached on the DataArray so they can be manipulated:

In [130]: tmax_ss.values[tmax_ss.values < -99] = np.nan

Finally, we can plot the values with matplotlib:

In [131]: import matplotlib.pyplot as plt

In [132]: from matplotlib.cm import get_cmap

In [133]: plt.figure(figsize=(9, 5))

In [134]: plt.gca().patch.set_color('0')

In [135]: plt.contourf(tmax_ss['X'], tmax_ss['Y'], tmax_ss.values, 20,
     ...:     cmap=get_cmap('RdBu_r'))

In [136]: plt.colorbar()
_images/opendap-prism-tmax.png

Loading into memory

xray’s lazy loading of remote or on-disk datasets is not always desirable. In such cases, you can use the load_data() method to force loading a Dataset or DataArray entirely into memory. In particular, this can lead to significant speedups if done before performing array-based indexing.

Notes on xray’s internals

Warning

These implementation details may be useful for advanced users, but they will change in future versions.

DataArray

In the current version of xray, DataArrays are simply pointers to a dataset (the dataset attribute) and the name of a variable in the dataset (the name attribute), which indicates to which variable array operations should be applied. These variables are listed in the DataArray representation as “linked dataset variables”:

In [137]: foo
Out[137]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

Usually, xray automatically manages the Dataset objects that data arrays points to in a satisfactory fashion.

However, in some cases, particularly for performance reasons, you may want to explicitly ensure that the dataset only includes the variables you are interested in. For these cases, use the xray.DataArray.select_vars() method to select the names of variables you want to keep around, by default including the name of only the DataArray itself:

In [138]: foo2 = foo.select_vars()

In [139]: foo2
Out[139]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

foo2 is generally an equivalent labeled array to foo, but we dropped the dataset variables that are no longer relevant:

In [140]: foo.dataset.keys()
Out[140]: ['time', 'space', 'foo']

In [141]: foo2.dataset.keys()
Out[141]: ['time', 'space', 'foo']

Note

This feature may change in a future version of xray, because we intend to support non-index coordinates (GH197), which should cover all the use cases for “linked dataset variables” in a much more obvious fashion.

Variable

Variable implements xray’s basic building block for Dataset and DataArray variables. It supports the numpy ndarray interface, but is extended to support and use basic metadata (not including index values). It consists of:

  1. dims: A tuple of dimension names.
  2. values: The N-dimensional array (for example, of type numpy.ndarray) storing the array’s data. It must have the same number of dimensions as the length of dimensions.
  3. attrs: An ordered dictionary of additional metadata to associate with this array.

The main functional difference between Variables and numpy arrays is that numerical operations on Variables implement array broadcasting by dimension name. For example, adding an Variable with dimensions (‘time’,) to another Variable with dimensions (‘space’,) results in a new Variable with dimensions (‘time’, ‘space’). Furthermore, numpy reduce operations like mean or sum are overwritten to take a “dimension” argument instead of an “axis”.

Variables are light-weight objects used as the building block for datasets. They are more primitive objects, so operations with them provide marginally higher performance than using DataArrays. However, manipulating data in the form of a Dataset or DataArray should almost always be preferred, because they can use more complete metadata in context of coordinate labels.

You can find a read-only copy of the variables associated with a Dataset in its .variables attribute, or for a DataArray in its .variable attribute.