Other libraries of interest¶

Part of the following librairies address an incresingly common problem: what happens if the data we wish to analyze is “big data”?

We learned how to use Numpy, Pandas, and Xarray to analyze various types of environmental data. There are obviously many others useful libraries, especially when it comes to Big data.

Note

Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them.

By this definition, most of the dataset we are regularly confronted to in environmental science (actually in Earth science more generally) are big data.

Faster array manipulation¶

DASK

Dask provides advanced parallelism for analytics. It is developed in coordination with other community projects like NumPy, pandas, and scikit-learn.

Zarr

Zarr is a format for the storage of chunked, compressed, N-dimensional arrays that depends on Numpy.

Data loading libraries¶

Intake

Intake is a lightweight set of tools for loading and sharing data in data science projects.

Siphon

Siphon allows efficient access to Unidata data hosted on a THREDDS Data Server.

NCI examples¶

A useful overview of the National Computing Infrastructure data collection and services can be found here. Series of Jupyter notebooks examples are provided.
NCI Dask themed notebook tutorials demonstrate how to use Dask on data collections hosted at the NCI as well as data extracted from external databases (especially for eReefs models one can look at the following Dask_13_intensive_calculation_eReef.ipynb notebook).
NCI THREDDS demonstrate how to access data stored on NCI’s THREDDS Data Server using Jupyter notebooks.

Environmental Simulation Modelling

Other libraries of interest¶

Faster array manipulation¶

Data loading libraries¶

NCI examples¶