-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for "directory full of binary files" kind of DataSet #206
Comments
Test data set: split up a 4G dataset into individual frames (64k per frame), each in a single file. This is basically the worst case. Small test script: import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
import numpy as np
from libertem import api
from libertem.executor.inline import InlineJobExecutor
from libertem.io.dataset.raw_parts import RawFilesDataSet
@profile
def main():
ctx = api.Context(executor=InlineJobExecutor())
ds = RawFilesDataSet(
path="/home/clausen/Data/many_small_files/frame00016293.bin",
scan_size=(256, 256),
detector_size=(128, 128),
tileshape=(1, 8, 128, 128),
dtype="float32"
)
ds.initialize()
job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])
result = ctx.run(job)
if __name__ == "__main__":
main() Profiling results (raw results below):
Notes:
Line profiling results for reading many small files:
As a comparison, here is a profile for a single large file:
|
Currently limited to one frame per file, this limit should be lifted before the reader is added to LiberTEM proper. For performance discussions, see #206. Also included are some simple examples for comparison and profiling (single-threaded).
With some optimization, I got the following results (single threaded run, profiler disabled): Single large file:
2^16 small files:
So LiberTEM should still be usable for these kinds of datasets, once a proper reader is implemented, even though working on many small files takes about 1.6x as much time. For the use-case of opening simulated datasets, this should not be as pronounced, as files are larger (~1MB) than in my testcase. |
Nice! That looks good. |
Updated the first comment with the steps necessary to have full support for the "raw file set" format. For loading Dr. Probe files in a user-friendly way, some metadata may be necessary, either entered by the user when loading, or loaded from some JSON sidecar file. CC @ju-bar Also, the output of Dr. Probe has an additional dimension, the sample thickness. At the beginning, we could support loading a single thickness in the UI, later, we could add a navigation slider for the additional dimension, and re-run the job for the selected thickness. From the scripting interface, we can already support the additional dimension; for example using a sparse matrix as mask, which is only populated for the thickness you are interested in. Another possibility would be to just display the different thickness results one below another, but that only scales to a handful of results. |
Discussion with @uellue and @matbryan52: this is related to having a descriptor of either the file set, or the whole dataset. For example, one could allow users to create a descriptor for a glob like this: from libertem.io.base import FileSetDescriptor
fds = FileSetDescriptor.glob("stuff/*.bin")
ds = ctx.load('raw', descriptor=fsd)
# under the hood:
RawFileDataSet(descriptor=fsd, path=None) Or, alternatively, the descriptor could also include information about the data set, which is #1376 |
Some software, for example in STEM simulation, uses a directory full of small binary files as the result. We need to test if we can support this in a high-performance way, and if we can, add support in LiberTEM.
Generally, we may want to use straight
read
calls, asmmap
should have more overhead, and we need to copy anyways if we want to stack the images for processing.It should have similar parameters to the all-in-one-file raw format.
The text was updated successfully, but these errors were encountered: