Working with pandas¶

One of the most important features of xray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn and ggplot.

We particularly focus on conversions to and from tabular structures in the form of Hadley Wickham’s tidy data:

Each column holds a different variable (coordinates and variables in xray’s terminology).
Each rows holds a different observation.

In this “tidy data” format, we can represent any Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.

Note

If you want to convert a pandas data-structure into a DataArray with the same number of dimensions, you can simply use the DataArray construtor directly.

To and from DataFrames¶

To convert to a DataFrame, use the Dataset.to_dataframe() method:

In [1]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
   ...:                    coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
   ...:                            'along_x': ('x', np.random.randn(2)),
   ...:                            'scalar': 123})
   ...: 

In [2]: ds
Out[2]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) |S1 'a' 'b' 'c'
  * x        (x) int64 10 20
    scalar   int64 123
    along_x  (x) float64 0.1192 -1.044
Variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

In [3]: df = ds.to_dataframe()

In [4]: df
Out[4]: 
           foo  scalar   along_x
x  y                            
10 a  0.469112     123  0.119209
   b -0.282863     123  0.119209
   c -1.509059     123  0.119209
20 a -1.135632     123 -1.044236
   b  1.212112     123 -1.044236
   c -0.173215     123 -1.044236

We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().

To create a Dataset from a DataFrame, use the from_dataframe() class method:

In [5]: xray.Dataset.from_dataframe(df)
Out[5]: 
<xray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'
Variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    scalar   (x, y) int64 123 123 123 123 123 123
    along_x  (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044

Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.

Likewise, all the “other coordinates” ended up as variables, because pandas does not distinguish non-index coordinates.

To and from Series¶

DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:

In [6]: s = ds['foo'].to_series()

In [7]: s
Out[7]: 
x   y
10  a    0.469112
    b   -0.282863
    c   -1.509059
20  a   -1.135632
    b    1.212112
    c   -0.173215
Name: foo, dtype: float64

In [8]: xray.DataArray.from_series(s)
Out[8]: 
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * y        (y) object 'a' 'b' 'c'
  * x        (x) int64 10 20

Both the from_series and from_dataframe methods use reindexing, so they work even if not the hierarchical index is not a full tensor product:

In [9]: s[::2]
Out[9]: 
x   y
10  a    0.469112
    c   -1.509059
20  b    1.212112
Name: foo, dtype: float64

In [10]: xray.DataArray.from_series(s[::2])
Out[10]: 
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.4691123 ,         nan, -1.5090585 ],
       [        nan,  1.21211203,         nan]])
Coordinates:
  * y        (y) object 'a' 'b' 'c'
  * x        (x) int64 10 20