Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataSet implementation for groups of raw files #1224

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

matbryan52
Copy link
Member

@matbryan52 matbryan52 commented Mar 10, 2022

Adds an extension of RawDataSet which can handle groups of files and files with frame headers/footers. Responds to #1204 .

Currently the implementation is not merged into RawDataSet or the usual LiberTEM ctx.load or web GUI endpoints. In a future release it would be possible to design a single class which handles both single-file and multi-file reading (in fact the implementation in this PR already handles single files), but it would entail a change to the RawDataSet API/interface (specifically the path argument which might change to plural). For now this class should be considered undocumented / testing-only until the interface is stabilized.

One limitation of the system is that when using MMapBackend it is impossible to have a frame header or footer which is not a multiple of dtype.itemsize. In this case the dataset raises a DataSetException and tells the user to change the backend.

This implementation also contains one or two optimizations specific to handling groups of files which might be useful elsewhere. Specifically:

  • The method check_valid has been modified to only check 256 files at a time, this prevents an OSError due to too many open files
  • The mechanism to infer the size of each file has been written to take advantage of parallel processing if this is provided by the executor (using executor.map on chunks of files). This prevents a big slowdown encountered when the class calls os.stat on thousands of files, particularly over a network filesystem.
  • The number of partitions is adjusted to ensure fewer than 256 files-per-partition, to help avoid the OSError

Closes #206 (Dieter)

Contributor Checklist:

Reviewer Checklist:

  • /azp run libertem.libertem-data passed

@matbryan52
Copy link
Member Author

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Mar 10, 2022

Codecov Report

Merging #1224 (28042a2) into master (d827e4a) will increase coverage by 0.12%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1224      +/-   ##
==========================================
+ Coverage   72.56%   72.69%   +0.12%     
==========================================
  Files         285      286       +1     
  Lines       15270    15360      +90     
  Branches     2521     2537      +16     
==========================================
+ Hits        11081    11166      +85     
- Misses       3783     3786       +3     
- Partials      406      408       +2     
Impacted Files Coverage Δ
src/libertem/io/dataset/base/backend_buffered.py 75.96% <ø> (ø)
src/libertem/io/dataset/raw_group.py 93.33% <93.33%> (ø)
src/libertem/io/dataset/base/fileset.py 95.65% <0.00%> (+2.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d827e4a...28042a2. Read the comment docs.

@matbryan52
Copy link
Member Author

Data pipeline tests passed on linux 36=>39 before the push to add test skipping on Mac OSX.

@matbryan52
Copy link
Member Author

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sk1p sk1p added this to the 0.10 milestone May 2, 2022
@matbryan52 matbryan52 removed this from the 0.10 milestone Jun 22, 2022
@uellue
Copy link
Member

uellue commented Jul 28, 2022

@matbryan52 should we try to get this into LiberTEM 0.11?

@matbryan52
Copy link
Member Author

matbryan52 commented Jul 28, 2022

@matbryan52 should we try to get this into LiberTEM 0.11?

Definitely, the functionality for reading the files is already there in the normal RawDataSet, we just have to agree on the API design for taking multiple files as argument.

@sk1p sk1p added this to the 0.11 milestone Jul 28, 2022
@sk1p sk1p removed this from the 0.11 milestone Mar 20, 2023
@sk1p sk1p modified the milestones: backlog, 0.12 Mar 20, 2023
@matbryan52 matbryan52 modified the milestones: 0.12, 0.13 Jul 5, 2023
@sk1p sk1p modified the milestones: 0.13, 0.14 Oct 25, 2023
@sk1p sk1p modified the milestones: 0.14, 0.15 Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for "directory full of binary files" kind of DataSet
3 participants