Wednesday, December 4th, 2024 (1 day ago)
xarray.DataTree
has been released in v2024.10.0, and the prototype xarray-contrib/datatree
repository archived, after collaboration between the xarray team and NASA ESDIS.
Xarray users have been asking for a way to handle multiple netCDF4 groups since at least 2016. Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk.
Our solution is the new high-level container class xarray.DataTree
.
It acts like a tree of linked xarray.Dataset
objects, with alignment enforced between variables in sibling nodes but not between parents and children. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores.
For more details please see the high-level description and the dedicated page on hierarchical data, and the section on IO with groups in the xarray documentation.
If you previously had used the DataTree
prototype in the xarray-contrib/datatree
repository, that has now been archived and will no longer be supported. Instead we encourage you to migrate to the implementation of DataTree
that you can import from xarray, following the migration guide.
This was a big feature addition! For a decade there have been 3 core public xarray data structures, now there are 4: Variable
, DataArray
, Dataset
, and now DataTree
.
Datatree represents arguably one of the single largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across 80 PRs, and the resulting datatree code now contains contributions from at least 25 people.
We also had to resolve some really gnarly design questions to make it work in a way we were happy with.
DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps.
In March 2021, the xarray team submitted a funding proposal to the Chan-Zuckerberg Initiative to develop "TreeDataset", citing bioscience use cases such as microscopy image pyramids. Unfortunately whilst we've been lucky to receive CZI funding before, on this occasion we didn't win money to work on the datatree idea.
In the absence of dedicated funding for datatree, Tom then used some time whilst at the Climate Data Science Lab at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the xarray-contrib/datatree
repository, and steadily gained a small community of intrepid users. It was driven partly by the use case of climate model intercomparison datasets.
A separate repository was chosen for speed of iteration, and to avoid giving the impression that these early experiments would have the same level of long-term support promised for code in xarray's main repo. However this meant that the prototype datatree
library was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals.
The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support.
Amazingly the NASA team were able to offer engineer time, so starting in early 2024 Owen, Matt, and Eni (NASA) worked on migrating datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core devs).
This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some signficant changes to the design without worrying too much about backwards-incompatibility (for example enabling the new "coordinate inheritance" feature).
This development story is different from the more typical scientific grant funding model - how did that work out for us?
The scientific grant model for funding software expects you to present a full idea in a proposal, wait 6-12 months to hopefully get funding for it, then implement the whole thing during the grant period. In contrast datatree evolved over a gradual process of moving from ideas to hacky prototype to robust implementation, with big time gaps for user feedback and experimentation. The migration was completed by developer-users who actually wanted the feature, rather than grant awardees working in service of a separate and maybe-theoretical userbase.
Overall while the migration effort took longer than anticipated we found it worked out quite well!
This contributing model is more similar to how OSS has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models.
Overall this type of collaboration could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, please reach out to us.
DataTree
!#Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time.
Be aware that as xarray.DataTree
is still new there will likely be some bugs lurking, as well as as-yet unimplemented features (as there always are)!
A number of other people also contributed to datatree in various ways - particular shoutout to Alfonso Ladino and Etinenne Schalk for their dedicated attendance at many of the weekly migration meetings!