# Data structures¶

xray’s core data structures are the `Dataset`,
the `Variable` (including its subclass
`Coordinate`) and the `DataArray`.

The document is intended as a technical summary of the xray data model. It should be mostly of interest to advanced users interested in extending or contributing to xray internals.

## Dataset¶

`Dataset` is a Python object representing a fully self-
described dataset of labeled N-dimensional arrays. It consists of:

**variables**: A dictionary of Variable objects.**dimensions**: A dictionary giving the lengths of shared dimensions, which are required to be consistent across all variables in a Dataset.**attributes**: An ordered dictionary of metadata.

The design of the Dataset is based by the NetCDF file format for self-described scientific data. This is a data model that has become very successful and widely used in the geosciences.

The Dataset is an intelligent container. It allows for simultaneous integer
or label based indexing of all of its variables, supports split-apply-combine
operations with groupby, and can be converted to and from
`pandas.DataFrame` objects.

## Variable¶

`Variable` implements xray’s basic extended array object. It
supports the numpy ndarray interface, but is extended to support and use
basic metadata (not including coordinate values). It consists of:

**dimensions**: A tuple of dimension names.**data**: The N-dimensional array (for example, of type`numpy.ndarray`) storing the array’s data. It must have the same number of dimensions as the length of the “dimensions” attribute.**attributes**: An ordered dictionary of additional metadata to associate with this array.

The main functional difference between Variables and numpy arrays is that
numerical operations on Variables implement array broadcasting by dimension
name. For example, adding an Variable with dimensions (‘time’,) to another
Variable with dimensions (‘space’,) results in a new Variable with dimensions
(‘time’, ‘space’). Furthermore, numpy reduce operations like `mean` or
`sum` are overwritten to take a “dimension” argument instead of an “axis”.

Variables are light-weight objects used as the building block for datasets.
**However, manipulating data in the form of a Dataset or DataArray should
almost always be preferred** (see below), because they can use more complete
metadata in context of coordinate labels.

## DataArray¶

A `DataArray` object is a multi-dimensional array with labeled
dimensions and coordinates. Coordinate labels give it additional power over the
Variable object, so it should be preferred for all high-level use.

Under the covers, DataArrays are simply pointers to a dataset (the `dataset`
attribute) and the name of a variable in the dataset (the `name` attribute),
which indicates to which variable array operations should be applied.

DataArray objects implement the broadcasting rules of Variable objects, but
also use and maintain coordinates (aka “indices”). This means you can do
intelligent (and fast!) label based indexing on DataArrays (via the
`.loc` attribute), do flexibly split-apply-combine operations with
`groupby` and convert them to or from `pandas.Series` objects.