Frequently Asked Questions

Why is pandas not enough?

pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?

Sometimes, we really want to work with collections of higher dimensional array (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.

Pandas does support N-dimensional panels, but the implementation is very limited:

  • You need to create a new factory type for each dimensionality.
  • You can’t do math between NDPanels with different dimensionality.
  • Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.

Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.

Should I use xray instead of pandas?

It’s not an either/or choice! xray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.

That said, you should only bother with xray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.

What is your approach to metadata?

We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

Automatic interpretation of labels is powerful but also reduces flexibility. With xray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically intrepret and enforce units or CF conventions. (An exception is serialization to netCDF with cf_conventions=True.)

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xray usually drops conflicting attrs when combining arrays and datasets instead of raising an exception, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

Does xray support out-of-core computation?

Not yet! Distributed and out-of-memory computation is certainly something we’re excited, but for now we have focused on making xray a full-featured tool for in-memory analytics (like pandas).

We have some ideas for what out-of-core support could look like (probably through an library like biggus or Blaze), but we’re not there yet. An intermediate step would be supporting incremental writes to a Dataset linked to a NetCDF file on disk.