Wendelin.core - Out-of-core NumPy arrays

Wendelin.core allows you to work with arrays bigger than RAM and local disk. Bigarrays are persisted to storage, and can be changed in transactional manner.

In other words bigarrays are something like numpy.memmap for numpy.ndarray and OS files, but support transactions and files bigger than disk. The whole bigarray cannot generally be used as a drop-in replacement for numpy arrays, but bigarray slices are real ndarrays and can be used everywhere ndarray can be used, including in C/Cython/Fortran code. Slice size is limited by virtual address-space size, which is ~ max 127TB on Linux/amd64.

The main class to work with is ZBigArray and is used like ndarray from NumPy:

create array:

from wendelin.bigarray.array_zodb import ZBigArray
import transaction

# root is connected to opened database
root['A'] = A = ZBigArray(shape=..., dtype=...)
transaction.commit()

view array as a real ndarray:

a = A[:]        # view which covers all array, if it fits into address-space
b = A[10:100]

data for views will be loaded lazily on memory access.

work with views, including using C/Cython/Fortran functions from NumPy and other libraries to read/modify data:
```
a[2] = 1
a[10:20] = numpy.arange(10)
numpy.mean(a)
```
the amount of modifications in one transaction should be less than available RAM.
the amount of data read is limited only by virtual address-space size.

data can be appended to array in O(δ) time:

values                  # ndarray to append of shape  (δ,)
A.append(values)

and array itself can be resized in O(1) time:

A.resize(newshape)

changes to array data can be either discarded or saved back to DB:

transaction.abort()     # discard all made changes
transaction.commit()    # atomically save all changes

When using NEO or ZEO as a database, bigarrays can be simultaneously used by several nodes in a cluster.

Please see demo/demo_zbigarray.py for a complete example.

Current state and Roadmap

Wendelin.core works in real life for workloads Nexedi is using in production, including 24/7 projects. We are, however, aware of the following limitations and things that need to be improved:

wendelin.core is currently not very fast
there are big - proportional to input in size - temporary array allocations in third-party libraries (NumPy, scikit-learn, ...) which might practically prevent processing out-of-core arrays depending on the functionality used.

Thus

we are currently working on improved wendelin.core design and implementation, which uses kernel virtual memory manager (complemented by one implemented in userspace) with arrays backend presented to kernel via FUSE as virtual filesystem implemented in Go.

As of 2021 November this filesystem reached its alpha state and is staged to be tried for real.

In parallel we will also:

try wendelin.core 1.0 on large data sets
identify and incrementally fix big-temporaries allocation issues in NumPy and scikit-learn

We are open to community help with the above.

Additional materials

Wendelin.core tutorial
Slides (pdf) from presentation about wendelin.core in PyData Paris 2015

Name		Name	Last commit message	Last commit date
Latest commit History 378 Commits
3rdparty		3rdparty
bigarray		bigarray
bigfile		bigfile
demo		demo
include/wendelin		include/wendelin
lib		lib
t		t
wcfs		wcfs
.gitignore		.gitignore
.gitmodules		.gitmodules
.nxdtest		.nxdtest
CHANGELOG.rst		CHANGELOG.rst
COPYING		COPYING
Makefile		Makefile
README.rst		README.rst
conftest.py		conftest.py
pyproject.toml		pyproject.toml
setup.py		setup.py
tox.ini		tox.ini
wendelin.py		wendelin.py

License

Nexedi/wendelin.core

Folders and files

Latest commit

History

Repository files navigation

Wendelin.core - Out-of-core NumPy arrays

Current state and Roadmap

Additional materials

About

Resources

License

Stars

Watchers

Forks

Languages