Support for "directory full of binary files" kind of DataSet #206

sk1p · 2018-12-12T18:14:06Z

Some software, for example in STEM simulation, uses a directory full of small binary files as the result. We need to test if we can support this in a high-performance way, and if we can, add support in LiberTEM.

Generally, we may want to use straight read calls, as mmap should have more overhead, and we need to copy anyways if we want to stack the images for processing.

It should have similar parameters to the all-in-one-file raw format.

implement prototype
do basic benchmarking with synthetic data via the Python API
support for more than one frame per file
support for arbitrary tile shapes
clean up implementation and add it to the UI
user-friendly workflow for loading dr. probe data

The text was updated successfully, but these errors were encountered:

sk1p · 2018-12-19T18:48:54Z

Test data set: split up a 4G dataset into individual frames (64k per frame), each in a single file. This is basically the worst case.

Small test script:

import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'

import numpy as np
from libertem import api
from libertem.executor.inline import InlineJobExecutor
from libertem.io.dataset.raw_parts import RawFilesDataSet

@profile
def main():
    ctx = api.Context(executor=InlineJobExecutor())

    ds = RawFilesDataSet(
        path="/home/clausen/Data/many_small_files/frame00016293.bin",
        scan_size=(256, 256),
        detector_size=(128, 128),
        tileshape=(1, 8, 128, 128),
        dtype="float32"
    )
    ds.initialize()

    job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])

    result = ctx.run(job)

if __name__ == "__main__":
    main()

Profiling results (raw results below):

glob.glob takes quite a while, but that is initialization cost and easily amortized
most of the job time is indeed spent in I/O (get_tiles(...))
in get_tiles, ~20% are spent opening files, ~20% copying to the tile buffer, and ~45% reading using np.fromfile. The copy may be avoidable, by somehow reading directly into the tile buffer, but opening and reading cannot be avoided while keeping to the tiling scheme

Notes:

we get more overhead by needing to read from more than one file for stacking (copies...)
it may actually be more efficient in this case to just use smaller tiles to get rid of copies, did not try yet
reading many small files means we get more filesystem overhead, and no longer read in large contiguous blocks

Line profiling results for reading many small files:

/home/clausen/Data/many_small_files/frame*.bin
Wrote profile results to raw_many_files.py.lprof
Timer unit: 1e-06 s

Total time: 3.27343 s
File: /home/clausen/source/libertem/src/libertem/io/dataset/raw_parts.py
Function: get_tiles at line 115

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   115                                               @profile
   116                                               def get_tiles(self, crop_to=None):
   117        16        100.0      6.2      0.0          stackheight = self.tileshape.nav.size
   118        16        480.0     30.0      0.0          tile_buffer = np.zeros(self.tileshape.flatten_nav(), dtype=self.dataset.dtype)
   119        16       2302.0    143.9      0.1          files_by_tile = list(grouper(self.files, stackheight))
   120        16        118.0      7.4      0.0          detector_size = tuple(self.tileshape.sig)
   121                                           
   122        16         22.0      1.4      0.0          partition_origin = self.slice.origin
   123                                           
   124        16         21.0      1.3      0.0          files_done = 0
   125      8208      11878.0      1.4      0.4          for files in files_by_tile:
   126      8192      12439.0      1.5      0.4              if len(files) != stackheight:
   127                                                           # allocate different size buffer for "rest-tile":
   128                                                           tile_buffer = np.zeros((len(files),) + tuple(self.dataset.raw_shape.sig),
   129                                                                                  dtype=self.dataset.dtype)
   130      8192      44003.0      5.4      1.3              origin = tuple(sum(x) for x in zip(partition_origin, (files_done, 0, 0)))
   131      8192      10772.0      1.3      0.3              tile_slice = Slice(
   132      8192      10005.0      1.2      0.3                  origin=origin,
   133      8192     110464.0     13.5      3.4                  shape=Shape(tile_buffer.shape, sig_dims=self.tileshape.sig.dims)
   134                                                       )
   135      8192      12452.0      1.5      0.4              files_done += len(files)
   136      8192      10634.0      1.3      0.3              if crop_to is not None:
   137                                                           intersection = tile_slice.intersection_with(crop_to)
   138                                                           if intersection.is_null():
   139                                                               continue
   140     73728     123152.0      1.7      3.8              for idx, fn in enumerate(files):
   141     65536     627925.0      9.6     19.2                  with open(fn, "rb") as f:
   142     65536    1467679.0     22.4     44.8                      d = np.fromfile(f, dtype=self.dataset.dtype)
   143     65536     163261.0      2.5      5.0                      d = d.reshape(detector_size)
   144     65536     588332.0      9.0     18.0                      tile_buffer[idx, ...] = d
   145                                           
   146      8192      10749.0      1.3      0.3              yield DataTile(
   147      8192      10176.0      1.2      0.3                  data=tile_buffer,
   148      8192      56464.0      6.9      1.7                  tile_slice=tile_slice
   149                                                       )

Total time: 5.22401 s
File: /home/clausen/source/libertem/src/libertem/job/masks.py
Function: __call__ at line 174

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   174                                               @profile
   175                                               def __call__(self):
   176        16         49.0      3.1      0.0          num_masks = len(self.masks)
   177        16        214.0     13.4      0.0          part = np.zeros((num_masks,) + tuple(self.partition.shape.nav), dtype="float32")
   178      8208    3930744.0    478.9     75.2          for data_tile in self.partition.get_tiles():
   179      8192     104240.0     12.7      2.0              data = data_tile.flat_data
   180      8192      12451.0      1.5      0.2              if data.dtype.kind == 'u':
   181                                                           data = data.astype("float32")
   182      8192     353625.0     43.2      6.8              masks = self.masks[data_tile]
   183      8192      10080.0      1.2      0.2              if self.masks.use_sparse:
   184                                                           # The sparse matrix has to be the left-hand side, for that
   185                                                           # reason we transpose before and after multiplication.
   186                                                           result = masks.T.dot(data.T).T
   187      8192       8716.0      1.1      0.2              elif self.use_torch and torch is not None:
   188      8192       9053.0      1.1      0.2                  result = torch.mm(
   189      8192      40507.0      4.9      0.8                      torch.from_numpy(data),
   190      8192     214221.0     26.2      4.1                      torch.from_numpy(masks),
   191                                                           ).numpy()
   192                                                       else:
   193                                                           result = data.dot(masks)
   194      8192     125174.0     15.3      2.4              dest_slice = data_tile.tile_slice.shift(self.partition.slice)
   195      8192     255519.0     31.2      4.9              reshaped = self.reshaped_data(data=result, dest_slice=dest_slice)
   196                                                       # Ellipsis to match the "number of masks" part of the result
   197      8192     159098.0     19.4      3.0              part[(Ellipsis,) + dest_slice.get(nav_only=True)] += reshaped
   198                                                   return [
   199        16         22.0      1.4      0.0              ResultTile(
   200        16         15.0      0.9      0.0                  data=part,
   201        16        281.0     17.6      0.0                  dest_slice=self.partition.slice.get(nav_only=True),
   202                                                       )
   203                                                   ]

Total time: 6.0188 s
File: raw_many_files.py
Function: main at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def main():
    13         1          6.0      6.0      0.0      ctx = api.Context(executor=InlineJobExecutor())
    14                                           
    15         1          2.0      2.0      0.0      ds = RawFilesDataSet(
    16         1          2.0      2.0      0.0          path="/home/clausen/Data/many_small_files/frame00016293.bin",
    17         1          2.0      2.0      0.0          scan_size=(256, 256),
    18         1          1.0      1.0      0.0          detector_size=(128, 128),
    19         1          1.0      1.0      0.0          tileshape=(1, 8, 128, 128),
    20         1         15.0     15.0      0.0          dtype="float32"
    21                                               )
    22         1     688981.0 688981.0     11.4      ds.initialize()
    23                                           
    24         1         15.0     15.0      0.0      job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])
    25                                           
    26         1    5329778.0 5329778.0     88.6      result = ctx.run(job)


real	0m6.813s
user	0m5.820s
sys	0m0.967s

As a comparison, here is a profile for a single large file:

Wrote profile results to raw_one_file.py.lprof
Timer unit: 1e-06 s

Total time: 1.94544 s
File: /home/clausen/source/libertem/src/libertem/job/masks.py
Function: __call__ at line 174

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   174                                               @profile
   175                                               def __call__(self):
   176        16         42.0      2.6      0.0          num_masks = len(self.masks)
   177        16        318.0     19.9      0.0          part = np.zeros((num_masks,) + tuple(self.partition.shape.nav), dtype="float32")
   178      8208     475401.0     57.9     24.4          for data_tile in self.partition.get_tiles():
   179      8192     124133.0     15.2      6.4              data = data_tile.flat_data
   180      8192      12216.0      1.5      0.6              if data.dtype.kind == 'u':
   181                                                           data = data.astype("float32")
   182      8192     346504.0     42.3     17.8              masks = self.masks[data_tile]
   183      8192      10028.0      1.2      0.5              if self.masks.use_sparse:
   184                                                           # The sparse matrix has to be the left-hand side, for that
   185                                                           # reason we transpose before and after multiplication.
   186                                                           result = masks.T.dot(data.T).T
   187      8192       8870.0      1.1      0.5              elif self.use_torch and torch is not None:
   188      8192       8333.0      1.0      0.4                  result = torch.mm(
   189      8192      30601.0      3.7      1.6                      torch.from_numpy(data),
   190      8192     397433.0     48.5     20.4                      torch.from_numpy(masks),
   191                                                           ).numpy()
   192                                                       else:
   193                                                           result = data.dot(masks)
   194      8192     122955.0     15.0      6.3              dest_slice = data_tile.tile_slice.shift(self.partition.slice)
   195      8192     239451.0     29.2     12.3              reshaped = self.reshaped_data(data=result, dest_slice=dest_slice)
   196                                                       # Ellipsis to match the "number of masks" part of the result
   197      8192     168817.0     20.6      8.7              part[(Ellipsis,) + dest_slice.get(nav_only=True)] += reshaped
   198                                                   return [
   199        16         18.0      1.1      0.0              ResultTile(
   200        16         15.0      0.9      0.0                  data=part,
   201        16        305.0     19.1      0.0                  dest_slice=self.partition.slice.get(nav_only=True),
   202                                                       )
   203                                                   ]

Total time: 2.09469 s
File: raw_one_file.py
Function: main at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def main():
    13         1          5.0      5.0      0.0      ctx = api.Context(executor=InlineJobExecutor())
    14                                           
    15         1          1.0      1.0      0.0      ds = RawFileDataSet(
    16         1          2.0      2.0      0.0          path="/home/clausen/Data/EMPAD/scan_11_x256_y256.raw",
    17         1          2.0      2.0      0.0          scan_size=(256, 256),
    18         1          1.0      1.0      0.0          detector_size_raw=(130, 128),
    19         1          1.0      1.0      0.0          crop_detector_to=(128, 128),
    20         1          1.0      1.0      0.0          tileshape=(1, 8, 128, 128),
    21         1          9.0      9.0      0.0          dtype="float32"
    22                                               )
    23         1          2.0      2.0      0.0      ds.initialize()
    24                                           
    25         1          7.0      7.0      0.0      job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])
    26                                           
    27         1    2094656.0 2094656.0    100.0      result = ctx.run(job)


real	0m2.873s
user	0m2.643s
sys	0m0.214s

Currently limited to one frame per file, this limit should be lifted before the reader is added to LiberTEM proper. For performance discussions, see #206. Also included are some simple examples for comparison and profiling (single-threaded).

sk1p · 2018-12-20T13:02:54Z

With some optimization, I got the following results (single threaded run, profiler disabled):

Single large file:

real	0m1.787s
user	0m1.498s
sys	0m0.270s

2^16 small files:

real	0m2.844s
user	0m2.143s
sys	0m0.672s

So LiberTEM should still be usable for these kinds of datasets, once a proper reader is implemented, even though working on many small files takes about 1.6x as much time. For the use-case of opening simulated datasets, this should not be as pronounced, as files are larger (~1MB) than in my testcase.

uellue · 2018-12-20T13:32:45Z

Nice! That looks good.

sk1p · 2018-12-20T14:40:39Z

Updated the first comment with the steps necessary to have full support for the "raw file set" format.

For loading Dr. Probe files in a user-friendly way, some metadata may be necessary, either entered by the user when loading, or loaded from some JSON sidecar file. CC @ju-bar

Also, the output of Dr. Probe has an additional dimension, the sample thickness. At the beginning, we could support loading a single thickness in the UI, later, we could add a navigation slider for the additional dimension, and re-run the job for the selected thickness. From the scripting interface, we can already support the additional dimension; for example using a sparse matrix as mask, which is only populated for the thickness you are interested in.

Another possibility would be to just display the different thickness results one below another, but that only scales to a handful of results.

sk1p · 2023-03-20T15:29:48Z

Discussion with @uellue and @matbryan52: this is related to having a descriptor of either the file set, or the whole dataset. For example, one could allow users to create a descriptor for a glob like this:

from libertem.io.base import FileSetDescriptor
fds = FileSetDescriptor.glob("stuff/*.bin")

ds = ctx.load('raw', descriptor=fsd)

# under the hood:
RawFileDataSet(descriptor=fsd, path=None)

Or, alternatively, the descriptor could also include information about the data set, which is #1376

sk1p added file formats and I/O AAA-Priority labels Dec 12, 2018

sk1p removed the AAA-Priority label Dec 20, 2018

sk1p added the enhancement New feature or request label Feb 5, 2019

sk1p mentioned this issue Jul 20, 2020

Ad-hoc DataSet for custom/weird formats #849

Closed

uellue linked a pull request Jul 28, 2022 that will close this issue

Add DataSet implementation for groups of raw files #1224

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for "directory full of binary files" kind of DataSet #206

Support for "directory full of binary files" kind of DataSet #206

sk1p commented Dec 12, 2018 •

edited

sk1p commented Dec 19, 2018

sk1p commented Dec 20, 2018

uellue commented Dec 20, 2018

sk1p commented Dec 20, 2018

sk1p commented Mar 20, 2023

Support for "directory full of binary files" kind of DataSet #206

Support for "directory full of binary files" kind of DataSet #206

Comments

sk1p commented Dec 12, 2018 • edited

sk1p commented Dec 19, 2018

sk1p commented Dec 20, 2018

uellue commented Dec 20, 2018

sk1p commented Dec 20, 2018

sk1p commented Mar 20, 2023

sk1p commented Dec 12, 2018 •

edited