Add DataSet implementation for groups of raw files #1224

matbryan52 · 2022-03-10T14:27:58Z

Adds an extension of RawDataSet which can handle groups of files and files with frame headers/footers. Responds to #1204 .

Currently the implementation is not merged into RawDataSet or the usual LiberTEM ctx.load or web GUI endpoints. In a future release it would be possible to design a single class which handles both single-file and multi-file reading (in fact the implementation in this PR already handles single files), but it would entail a change to the RawDataSet API/interface (specifically the path argument which might change to plural). For now this class should be considered undocumented / testing-only until the interface is stabilized.

One limitation of the system is that when using MMapBackend it is impossible to have a frame header or footer which is not a multiple of dtype.itemsize. In this case the dataset raises a DataSetException and tells the user to change the backend.

This implementation also contains one or two optimizations specific to handling groups of files which might be useful elsewhere. Specifically:

The method check_valid has been modified to only check 256 files at a time, this prevents an OSError due to too many open files
The mechanism to infer the size of each file has been written to take advantage of parallel processing if this is provided by the executor (using executor.map on chunks of files). This prevents a big slowdown encountered when the class calls os.stat on thousands of files, particularly over a network filesystem.
The number of partitions is adjusted to ensure fewer than 256 files-per-partition, to help avoid the OSError

Closes #206 (Dieter)

Contributor Checklist:

I have added or updated my entry in the creators.json file
I have added a changelog entry for my contribution
I have added/updated documentation for all user-facing changes
I have added/updated test cases
I have included the rebuilt production build of the client (only if changes were made to the GUI)

Reviewer Checklist:

/azp run libertem.libertem-data passed

matbryan52 · 2022-03-10T14:39:41Z

/azp run libertem.libertem-data

azure-pipelines · 2022-03-10T14:39:52Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2022-03-10T15:03:32Z

Codecov Report

Merging #1224 (28042a2) into master (d827e4a) will increase coverage by 0.12%.
The diff coverage is 93.33%.

@@            Coverage Diff             @@
##           master    #1224      +/-   ##
==========================================
+ Coverage   72.56%   72.69%   +0.12%     
==========================================
  Files         285      286       +1     
  Lines       15270    15360      +90     
  Branches     2521     2537      +16     
==========================================
+ Hits        11081    11166      +85     
- Misses       3783     3786       +3     
- Partials      406      408       +2

Impacted Files	Coverage Δ
src/libertem/io/dataset/base/backend_buffered.py	`75.96% <ø> (ø)`
src/libertem/io/dataset/raw_group.py	`93.33% <93.33%> (ø)`
src/libertem/io/dataset/base/fileset.py	`95.65% <0.00%> (+2.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d827e4a...28042a2. Read the comment docs.

matbryan52 · 2022-03-10T15:09:03Z

Data pipeline tests passed on linux 36=>39 before the push to add test skipping on Mac OSX.

matbryan52 · 2022-03-11T09:43:16Z

/azp run libertem.libertem-data

azure-pipelines · 2022-03-11T09:43:27Z

Azure Pipelines successfully started running 1 pipeline(s).

uellue · 2022-07-28T08:39:43Z

@matbryan52 should we try to get this into LiberTEM 0.11?

matbryan52 · 2022-07-28T08:49:34Z

@matbryan52 should we try to get this into LiberTEM 0.11?

Definitely, the functionality for reading the files is already there in the normal RawDataSet, we just have to agree on the API design for taking multiple files as argument.

matbryan52 added 12 commits March 10, 2022 11:49

Add raw_group ds with tests

73b31ea

Open files in check_valid in chunks

6b4ae20

Add test for RawFileGroupSet

a0b86ec

Add test for check_valid

8349569

Add doc on dataset

00c0490

Use a single os.stat per file when interpreting ds

9cec299

Typing and use executor.map

76f0ae2

Remove unused method

c51dd9b

Coerce path to string in DirectBufferedFile

4416f66

Use lt math prod module

51b9470

Add chunked file os.stat

d9503a6

Bugfix chain from iterable

99d21ce

Add skip test if on Mac OS and DIrectBackend

f766d76

matbryan52 added 5 commits March 11, 2022 08:19

Adjust number of partitions to avoid OSError

38103d4

Don't use dicts for mapping filesizes

cdd8ba7

Add test for string paths

e26f605

Log warning if paths contains duplicates

2a06554

Add tests for max open files

28042a2

sk1p added this to the 0.10 milestone May 2, 2022

matbryan52 removed this from the 0.10 milestone Jun 22, 2022

sk1p added this to the 0.11 milestone Jul 28, 2022

sk1p removed this from the 0.11 milestone Mar 20, 2023

sk1p modified the milestones: backlog, 0.12 Mar 20, 2023

matbryan52 modified the milestones: 0.12, 0.13 Jul 5, 2023

sk1p mentioned this pull request Aug 15, 2023

Focus and planning for the v0.13 release #1501

Closed

1 task

sk1p modified the milestones: 0.13, 0.14 Oct 25, 2023

sk1p modified the milestones: 0.14, 0.15 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataSet implementation for groups of raw files #1224

Add DataSet implementation for groups of raw files #1224

matbryan52 commented Mar 10, 2022 •

edited by uellue

matbryan52 commented Mar 10, 2022

azure-pipelines bot commented Mar 10, 2022

codecov bot commented Mar 10, 2022 •

edited

matbryan52 commented Mar 10, 2022

matbryan52 commented Mar 11, 2022

azure-pipelines bot commented Mar 11, 2022

uellue commented Jul 28, 2022

matbryan52 commented Jul 28, 2022 •

edited

Add DataSet implementation for groups of raw files #1224

Are you sure you want to change the base?

Add DataSet implementation for groups of raw files #1224

Conversation

matbryan52 commented Mar 10, 2022 • edited by uellue

Contributor Checklist:

Reviewer Checklist:

matbryan52 commented Mar 10, 2022

azure-pipelines bot commented Mar 10, 2022

codecov bot commented Mar 10, 2022 • edited

Codecov Report

matbryan52 commented Mar 10, 2022

matbryan52 commented Mar 11, 2022

azure-pipelines bot commented Mar 11, 2022

uellue commented Jul 28, 2022

matbryan52 commented Jul 28, 2022 • edited

matbryan52 commented Mar 10, 2022 •

edited by uellue

codecov bot commented Mar 10, 2022 •

edited

matbryan52 commented Jul 28, 2022 •

edited