Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter sidecar files #768

Open
sk1p opened this issue May 4, 2020 · 6 comments · May be fixed by #1376
Open

Parameter sidecar files #768

sk1p opened this issue May 4, 2020 · 6 comments · May be fixed by #1376

Comments

@sk1p
Copy link
Member

sk1p commented May 4, 2020

From #153:

SEQ files don't contain any information about the scan, so there was a question if we could read this from a "sidecar" file, like we do with other formats with their proprietary implementations. One idea would be to have a simple file, maybe in .ini format, or .yml, which contains that information. This idea could even be extended to other formats that need additional information, like raw. Something like this:

my_dataset.yml:

format: SEQ
path: something.seq
scan_size: [256, 256]
dark_frame: dark.tif

my_other_dataset.yml:

format: RAW
path: other.bin
dtype: u2
scan_size: [256, 256]
detector_size: [128, 128]
dark_frame: dark.tif

These yml files could then provide one-click opening in LiberTEM, even if the underlying raw dataset doesn't provide enough information.

@sk1p
Copy link
Member Author

sk1p commented May 29, 2020

Notes about the dark_frame parameter (and other corrections): in addition to the path to the file, there may be more parameters needed, for example when loading raw binary data, the dtype needs to be specified, or the array name when loading matlab files.

@uellue
Copy link
Member

uellue commented May 29, 2020

Ah, that resonates with the discussion about specifying bad pixels, gain map, dark frame and also other parameters for cases where the original format doesn't contain that.

@jhgee
Copy link

jhgee commented Jan 20, 2022

I want to revive the discussion here and gather some ideas about what should be achieved and what open questions are still around.

  1. Is there already something similar like parameter sidecar files for the cluster class - https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/io/dataset/cluster.py#L63-L75 ?
  2. Which format do we prefer? YAML sounds suitable. If we don't need a specific feature from one specific configuration file format I think it's best to aim for consistency across the LiberTEM project and choose a format accordingly

@sk1p
Copy link
Member Author

sk1p commented Jan 20, 2022

  1. Is there already something similar like parameter sidecar files for the cluster class - https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/io/dataset/cluster.py#L63-L75 ?

Yeah, those are very specific for that use case. I'm also not sure if the cluster dataset will survive - for caching, we are now using fscache instead, and the cluster ds was mostly meant to be used as a backend for caching...

  1. Which format do we prefer? YAML sounds suitable. If we don't need a specific feature from one specific configuration file format I think it's best to aim for consistency across the LiberTEM project and choose a format accordingly

I think the main candidates are: YAML, JSON, TOML.

JSON

Pro:

  • very simple format, easy to read/write from Python and other languages
  • included in the Python stdlib

Con:

YAML

Pro:

  • user-friendly format, not as strict as JSON

Con:

  • Some surprising behavior (classical example: in language: no, the literal no is de-serialized to a bool False value; whereas in language: de, de is de-serialized as a string)
  • Need to take care to use safe loading to not execute arbitrary code...
  • https://www.arp242.net/yaml-config.html for some examples

TOML

https://github.com/toml-lang/toml

Not sure about this one, as I haven't used it yet (other than writing some configuration files), but it may be a good candidate:

Pro:

  • user-friendly simple syntax (?), meant for writing configuration files

Con:

  • Not included in stdlib
  • Not meant for serializing arbitrary objects - but that should not be an issue for this use-case

@sk1p
Copy link
Member Author

sk1p commented Jan 20, 2022

Looking more into it, I kind of like TOML:

TOML shares traits with other file formats used for application configuration and data serialization, such as YAML and JSON. TOML and JSON both are simple and use ubiquitous data types, making them easy to code for or parse with machines. TOML and YAML both emphasize human readability features, like comments that make it easier to understand the purpose of a given line. TOML differs in combining these, allowing comments (unlike JSON) but preserving simplicity (unlike YAML).

Because TOML is explicitly intended as a configuration file format, parsing it is easy, but it is not intended for serializing arbitrary data structures. TOML always has a hash table at the top level of the file, which can easily have data nested inside its keys, but it doesn't permit top-level arrays or floats, so it cannot directly serialize some data. There is also no standard identifying the start or end of a TOML file, which can complicate sending it through a stream. These details must be negotiated on the application layer.

In [1]: import toml

In [2]: toml.loads("""
   ...: [dataset]
   ...: type = "tvips"
   ...: path = "/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips"
   ...: 
   ...: [corrections]
   ...: dark_frame = "/home/alex/Data/TVIPS/dark.tif"
   ...: excluded_pixels = [[0, 0], [128, 128]]
   ...: """)
Out[2]: 
{'dataset': {'type': 'tvips',
  'path': '/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips'},
 'corrections': {'dark_frame': '/home/alex/Data/TVIPS/dark.tif',
  'excluded_pixels': [[0, 0], [128, 128]]}}

The paths above could also be relative: if the path starts with './', it will be joined with the path of the directory that contains the sidecar file.

For the corrections, we could also think about accepting some parameters, like this:

In [4]: toml.loads("""
   ...: [dataset]
   ...: type = "tvips"
   ...: path = "/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips"
   ...: 
   ...: [corrections]
   ...: excluded_pixels = [[0, 0], [128, 128]]
   ...: 
   ...: [corrections.dark_frame]
   ...: path = "/home/alex/Data/TVIPS/dark.tif"
   ...: format = "TIFF"
   ...: 
   ...: [corrections.gain_map]
   ...: path = "/home/alex/Data/TVIPS/gain.bin"
   ...: format = "RAW"
   ...: dtype = "float32"
   ...: """)
Out[4]: 
{'dataset': {'type': 'tvips',
  'path': '/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips'},
 'corrections': {'excluded_pixels': [[0, 0], [128, 128]],
  'dark_frame': {'path': '/home/alex/Data/TVIPS/dark.tif', 'format': 'TIFF'},
  'gain_map': {'path': '/home/alex/Data/TVIPS/gain.bin',
   'format': 'RAW',
   'dtype': 'float32'}}}

(loading corrections directly in Context.load is not supported yet, but I guess with the sidecar format, it makes sense to support this - i.e. store a CorrectionSet on the DataSet as the default, which then will be used in Context.run_udf if nothing else is specified)

@sk1p sk1p added this to the 0.10 milestone Jan 24, 2022
@sk1p
Copy link
Member Author

sk1p commented Jan 24, 2022

I'm adding this to the 0.10 milestone - @jhgee thank you for kick-starting this discussion again! If you'd like to help out on building this feature, it would be good to have a concrete specification for the sidecar file, and maybe have a go at implementing a prototype. If you encounter any difficulties, we are here to help!

@uellue uellue modified the milestones: 0.10, 0.11 Jun 1, 2022
@matbryan52 matbryan52 linked a pull request Jan 10, 2023 that will close this issue
7 tasks
@matbryan52 matbryan52 linked a pull request Jan 11, 2023 that will close this issue
7 tasks
@sk1p sk1p modified the milestones: 0.11, 0.12 Mar 21, 2023
@sk1p sk1p removed this from the 0.12 milestone Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants