Working with pandas¶
One of the most important features of xarray is the ability to convert to and
from pandas
objects to interact with the rest of the PyData
ecosystem. For example, for plotting labeled data, we highly recommend
using the visualization built in to pandas itself or provided by the pandas
aware libraries such as Seaborn.
Hierarchical and tidy data¶
Tabular data is easiest to work with when it meets the criteria for tidy data:
- Each column holds a different variable.
- Each rows holds a different observation.
In this “tidy data” format, we can represent any Dataset
and
DataArray
in terms of pandas.DataFrame
and
pandas.Series
, respectively (and vice-versa). The representation
works by flattening non-coordinates to 1D, and turning the tensor product of
coordinate indexes into a pandas.MultiIndex
.
Dataset and DataFrame¶
To convert any dataset to a DataFrame
in tidy form, use the
Dataset.to_dataframe()
method:
In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
...: coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
...: 'along_x': ('x', np.random.randn(2)),
...: 'scalar': 123})
...:
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
along_x (x) float64 0.1192 -1.044
scalar int64 123
* y (y) <U1 'a' 'b' 'c'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [3]: df = ds.to_dataframe()
In [4]: df
Out[4]:
foo along_x scalar
x y
10 a 0.469112 0.119209 123
b -0.282863 0.119209 123
c -1.509059 0.119209 123
20 a -1.135632 -1.044236 123
b 1.212112 -1.044236 123
c -0.173215 -1.044236 123
We see that each variable and coordinate in the Dataset is now a column in the
DataFrame, with the exception of indexes which are in the index.
To convert the DataFrame
to any other convenient representation,
use DataFrame
methods like reset_index()
,
stack()
and unstack()
.
To create a Dataset
from a DataFrame
, use the
from_dataframe()
class method or the equivalent
pandas.DataFrame.to_xarray
method (pandas
v0.18 or later):
In [5]: xr.Dataset.from_dataframe(df)
Out[5]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
along_x (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044
scalar (x, y) int64 123 123 123 123 123 123
Notice that that dimensions of variables in the Dataset
have now
expanded after the round-trip conversion to a DataFrame
. This is because
every object in a DataFrame
must have the same indices, so we need to
broadcast the data of each array to the full size of the new MultiIndex
.
Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.
DataArray and Series¶
DataArray
objects have a complementary representation in terms of a
pandas.Series
. Using a Series preserves the Dataset
to
DataArray
relationship, because DataFrames
are dict-like containers
of Series
. The methods are very similar to those for working with
DataFrames:
In [6]: s = ds['foo'].to_series()
In [7]: s
Out[7]:
x y
10 a 0.469112
b -0.282863
c -1.509059
20 a -1.135632
b 1.212112
c -0.173215
Name: foo, dtype: float64
# or equivalently, with Series.to_xarray()
In [8]: xr.DataArray.from_series(s)