Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "directory full of binary files" kind of DataSet #206

Open
2 of 6 tasks
sk1p opened this issue Dec 12, 2018 · 5 comments · May be fixed by #1224
Open
2 of 6 tasks

Support for "directory full of binary files" kind of DataSet #206

sk1p opened this issue Dec 12, 2018 · 5 comments · May be fixed by #1224
Labels

Comments

@sk1p
Copy link
Member

sk1p commented Dec 12, 2018

Some software, for example in STEM simulation, uses a directory full of small binary files as the result. We need to test if we can support this in a high-performance way, and if we can, add support in LiberTEM.

Generally, we may want to use straight read calls, as mmap should have more overhead, and we need to copy anyways if we want to stack the images for processing.

It should have similar parameters to the all-in-one-file raw format.

  • implement prototype
  • do basic benchmarking with synthetic data via the Python API
  • support for more than one frame per file
  • support for arbitrary tile shapes
  • clean up implementation and add it to the UI
  • user-friendly workflow for loading dr. probe data
@sk1p
Copy link
Member Author

sk1p commented Dec 19, 2018

Test data set: split up a 4G dataset into individual frames (64k per frame), each in a single file. This is basically the worst case.

Small test script:

import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'

import numpy as np
from libertem import api
from libertem.executor.inline import InlineJobExecutor
from libertem.io.dataset.raw_parts import RawFilesDataSet

@profile
def main():
    ctx = api.Context(executor=InlineJobExecutor())

    ds = RawFilesDataSet(
        path="/home/clausen/Data/many_small_files/frame00016293.bin",
        scan_size=(256, 256),
        detector_size=(128, 128),
        tileshape=(1, 8, 128, 128),
        dtype="float32"
    )
    ds.initialize()

    job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])

    result = ctx.run(job)

if __name__ == "__main__":
    main()

Profiling results (raw results below):

  • glob.glob takes quite a while, but that is initialization cost and easily amortized
  • most of the job time is indeed spent in I/O (get_tiles(...))
  • in get_tiles, ~20% are spent opening files, ~20% copying to the tile buffer, and ~45% reading using np.fromfile. The copy may be avoidable, by somehow reading directly into the tile buffer, but opening and reading cannot be avoided while keeping to the tiling scheme

Notes:

  • we get more overhead by needing to read from more than one file for stacking (copies...)
  • it may actually be more efficient in this case to just use smaller tiles to get rid of copies, did not try yet
  • reading many small files means we get more filesystem overhead, and no longer read in large contiguous blocks

Line profiling results for reading many small files:

/home/clausen/Data/many_small_files/frame*.bin
Wrote profile results to raw_many_files.py.lprof
Timer unit: 1e-06 s

Total time: 3.27343 s
File: /home/clausen/source/libertem/src/libertem/io/dataset/raw_parts.py
Function: get_tiles at line 115

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   115                                               @profile
   116                                               def get_tiles(self, crop_to=None):
   117        16        100.0      6.2      0.0          stackheight = self.tileshape.nav.size
   118        16        480.0     30.0      0.0          tile_buffer = np.zeros(self.tileshape.flatten_nav(), dtype=self.dataset.dtype)
   119        16       2302.0    143.9      0.1          files_by_tile = list(grouper(self.files, stackheight))
   120        16        118.0      7.4      0.0          detector_size = tuple(self.tileshape.sig)
   121                                           
   122        16         22.0      1.4      0.0          partition_origin = self.slice.origin
   123                                           
   124        16         21.0      1.3      0.0          files_done = 0
   125      8208      11878.0      1.4      0.4          for files in files_by_tile:
   126      8192      12439.0      1.5      0.4              if len(files) != stackheight:
   127                                                           # allocate different size buffer for "rest-tile":
   128                                                           tile_buffer = np.zeros((len(files),) + tuple(self.dataset.raw_shape.sig),
   129                                                                                  dtype=self.dataset.dtype)
   130      8192      44003.0      5.4      1.3              origin = tuple(sum(x) for x in zip(partition_origin, (files_done, 0, 0)))
   131      8192      10772.0      1.3      0.3              tile_slice = Slice(
   132      8192      10005.0      1.2      0.3                  origin=origin,
   133      8192     110464.0     13.5      3.4                  shape=Shape(tile_buffer.shape, sig_dims=self.tileshape.sig.dims)
   134                                                       )
   135      8192      12452.0      1.5      0.4              files_done += len(files)
   136      8192      10634.0      1.3      0.3              if crop_to is not None:
   137                                                           intersection = tile_slice.intersection_with(crop_to)
   138                                                           if intersection.is_null():
   139                                                               continue
   140     73728     123152.0      1.7      3.8              for idx, fn in enumerate(files):
   141     65536     627925.0      9.6     19.2                  with open(fn, "rb") as f:
   142     65536    1467679.0     22.4     44.8                      d = np.fromfile(f, dtype=self.dataset.dtype)
   143     65536     163261.0      2.5      5.0                      d = d.reshape(detector_size)
   144     65536     588332.0      9.0     18.0                      tile_buffer[idx, ...] = d
   145                                           
   146      8192      10749.0      1.3      0.3              yield DataTile(
   147      8192      10176.0      1.2      0.3                  data=tile_buffer,
   148      8192      56464.0      6.9      1.7                  tile_slice=tile_slice
   149                                                       )

Total time: 5.22401 s
File: /home/clausen/source/libertem/src/libertem/job/masks.py
Function: __call__ at line 174

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   174                                               @profile
   175                                               def __call__(self):
   176        16         49.0      3.1      0.0          num_masks = len(self.masks)
   177        16        214.0     13.4      0.0          part = np.zeros((num_masks,) + tuple(self.partition.shape.nav), dtype="float32")
   178      8208    3930744.0    478.9     75.2          for data_tile in self.partition.get_tiles():
   179      8192     104240.0     12.7      2.0              data = data_tile.flat_data
   180      8192      12451.0      1.5      0.2              if data.dtype.kind == 'u':
   181                                                           data = data.astype("float32")
   182      8192     353625.0     43.2      6.8              masks = self.masks[data_tile]
   183      8192      10080.0      1.2      0.2              if self.masks.use_sparse:
   184                                                           # The sparse matrix has to be the left-hand side, for that
   185                                                           # reason we transpose before and after multiplication.
   186                                                           result = masks.T.dot(data.T).T
   187      8192       8716.0      1.1      0.2              elif self.use_torch and torch is not None:
   188      8192       9053.0      1.1      0.2                  result = torch.mm(
   189      8192      40507.0      4.9      0.8                      torch.from_numpy(data),
   190      8192     214221.0     26.2      4.1                      torch.from_numpy(masks),
   191                                                           ).numpy()
   192                                                       else:
   193                                                           result = data.dot(masks)
   194      8192     125174.0     15.3      2.4              dest_slice = data_tile.tile_slice.shift(self.partition.slice)
   195      8192     255519.0     31.2      4.9              reshaped = self.reshaped_data(data=result, dest_slice=dest_slice)
   196                                                       # Ellipsis to match the "number of masks" part of the result
   197      8192     159098.0     19.4      3.0              part[(Ellipsis,) + dest_slice.get(nav_only=True)] += reshaped
   198                                                   return [
   199        16         22.0      1.4      0.0              ResultTile(
   200        16         15.0      0.9      0.0                  data=part,
   201        16        281.0     17.6      0.0                  dest_slice=self.partition.slice.get(nav_only=True),
   202                                                       )
   203                                                   ]

Total time: 6.0188 s
File: raw_many_files.py
Function: main at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def main():
    13         1          6.0      6.0      0.0      ctx = api.Context(executor=InlineJobExecutor())
    14                                           
    15         1          2.0      2.0      0.0      ds = RawFilesDataSet(
    16         1          2.0      2.0      0.0          path="/home/clausen/Data/many_small_files/frame00016293.bin",
    17         1          2.0      2.0      0.0          scan_size=(256, 256),
    18         1          1.0      1.0      0.0          detector_size=(128, 128),
    19         1          1.0      1.0      0.0          tileshape=(1, 8, 128, 128),
    20         1         15.0     15.0      0.0          dtype="float32"
    21                                               )
    22         1     688981.0 688981.0     11.4      ds.initialize()
    23                                           
    24         1         15.0     15.0      0.0      job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])
    25                                           
    26         1    5329778.0 5329778.0     88.6      result = ctx.run(job)


real	0m6.813s
user	0m5.820s
sys	0m0.967s

As a comparison, here is a profile for a single large file:

Wrote profile results to raw_one_file.py.lprof
Timer unit: 1e-06 s

Total time: 1.94544 s
File: /home/clausen/source/libertem/src/libertem/job/masks.py
Function: __call__ at line 174

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   174                                               @profile
   175                                               def __call__(self):
   176        16         42.0      2.6      0.0          num_masks = len(self.masks)
   177        16        318.0     19.9      0.0          part = np.zeros((num_masks,) + tuple(self.partition.shape.nav), dtype="float32")
   178      8208     475401.0     57.9     24.4          for data_tile in self.partition.get_tiles():
   179      8192     124133.0     15.2      6.4              data = data_tile.flat_data
   180      8192      12216.0      1.5      0.6              if data.dtype.kind == 'u':
   181                                                           data = data.astype("float32")
   182      8192     346504.0     42.3     17.8              masks = self.masks[data_tile]
   183      8192      10028.0      1.2      0.5              if self.masks.use_sparse:
   184                                                           # The sparse matrix has to be the left-hand side, for that
   185                                                           # reason we transpose before and after multiplication.
   186                                                           result = masks.T.dot(data.T).T
   187      8192       8870.0      1.1      0.5              elif self.use_torch and torch is not None:
   188      8192       8333.0      1.0      0.4                  result = torch.mm(
   189      8192      30601.0      3.7      1.6                      torch.from_numpy(data),
   190      8192     397433.0     48.5     20.4                      torch.from_numpy(masks),
   191                                                           ).numpy()
   192                                                       else:
   193                                                           result = data.dot(masks)
   194      8192     122955.0     15.0      6.3              dest_slice = data_tile.tile_slice.shift(self.partition.slice)
   195      8192     239451.0     29.2     12.3              reshaped = self.reshaped_data(data=result, dest_slice=dest_slice)
   196                                                       # Ellipsis to match the "number of masks" part of the result
   197      8192     168817.0     20.6      8.7              part[(Ellipsis,) + dest_slice.get(nav_only=True)] += reshaped
   198                                                   return [
   199        16         18.0      1.1      0.0              ResultTile(
   200        16         15.0      0.9      0.0                  data=part,
   201        16        305.0     19.1      0.0                  dest_slice=self.partition.slice.get(nav_only=True),
   202                                                       )
   203                                                   ]

Total time: 2.09469 s
File: raw_one_file.py
Function: main at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def main():
    13         1          5.0      5.0      0.0      ctx = api.Context(executor=InlineJobExecutor())
    14                                           
    15         1          1.0      1.0      0.0      ds = RawFileDataSet(
    16         1          2.0      2.0      0.0          path="/home/clausen/Data/EMPAD/scan_11_x256_y256.raw",
    17         1          2.0      2.0      0.0          scan_size=(256, 256),
    18         1          1.0      1.0      0.0          detector_size_raw=(130, 128),
    19         1          1.0      1.0      0.0          crop_detector_to=(128, 128),
    20         1          1.0      1.0      0.0          tileshape=(1, 8, 128, 128),
    21         1          9.0      9.0      0.0          dtype="float32"
    22                                               )
    23         1          2.0      2.0      0.0      ds.initialize()
    24                                           
    25         1          7.0      7.0      0.0      job = ctx.create_mask_analysis(dataset=ds, factories=[lambda: np.ones(ds.shape.sig)])
    26                                           
    27         1    2094656.0 2094656.0    100.0      result = ctx.run(job)


real	0m2.873s
user	0m2.643s
sys	0m0.214s

sk1p added a commit that referenced this issue Dec 20, 2018
Currently limited to one frame per file, this limit should be lifted
before the reader is added to LiberTEM proper. For performance
discussions, see #206. Also included are some simple examples for
comparison and profiling (single-threaded).
@sk1p
Copy link
Member Author

sk1p commented Dec 20, 2018

With some optimization, I got the following results (single threaded run, profiler disabled):

Single large file:

real	0m1.787s
user	0m1.498s
sys	0m0.270s

2^16 small files:

real	0m2.844s
user	0m2.143s
sys	0m0.672s

So LiberTEM should still be usable for these kinds of datasets, once a proper reader is implemented, even though working on many small files takes about 1.6x as much time. For the use-case of opening simulated datasets, this should not be as pronounced, as files are larger (~1MB) than in my testcase.

@uellue
Copy link
Member

uellue commented Dec 20, 2018

Nice! That looks good.

@sk1p sk1p removed the AAA-Priority label Dec 20, 2018
@sk1p
Copy link
Member Author

sk1p commented Dec 20, 2018

Updated the first comment with the steps necessary to have full support for the "raw file set" format.

For loading Dr. Probe files in a user-friendly way, some metadata may be necessary, either entered by the user when loading, or loaded from some JSON sidecar file. CC @ju-bar

Also, the output of Dr. Probe has an additional dimension, the sample thickness. At the beginning, we could support loading a single thickness in the UI, later, we could add a navigation slider for the additional dimension, and re-run the job for the selected thickness. From the scripting interface, we can already support the additional dimension; for example using a sparse matrix as mask, which is only populated for the thickness you are interested in.

Another possibility would be to just display the different thickness results one below another, but that only scales to a handful of results.

@sk1p sk1p added the enhancement New feature or request label Feb 5, 2019
@uellue uellue linked a pull request Jul 28, 2022 that will close this issue
6 tasks
@sk1p
Copy link
Member Author

sk1p commented Mar 20, 2023

Discussion with @uellue and @matbryan52: this is related to having a descriptor of either the file set, or the whole dataset. For example, one could allow users to create a descriptor for a glob like this:

from libertem.io.base import FileSetDescriptor
fds = FileSetDescriptor.glob("stuff/*.bin")

ds = ctx.load('raw', descriptor=fsd)

# under the hood:
RawFileDataSet(descriptor=fsd, path=None)

Or, alternatively, the descriptor could also include information about the data set, which is #1376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants