Parameter sidecar files #768

sk1p · 2020-05-04T19:15:28Z

From #153:

SEQ files don't contain any information about the scan, so there was a question if we could read this from a "sidecar" file, like we do with other formats with their proprietary implementations. One idea would be to have a simple file, maybe in .ini format, or .yml, which contains that information. This idea could even be extended to other formats that need additional information, like raw. Something like this:

my_dataset.yml:

format: SEQ
path: something.seq
scan_size: [256, 256]
dark_frame: dark.tif

my_other_dataset.yml:

format: RAW
path: other.bin
dtype: u2
scan_size: [256, 256]
detector_size: [128, 128]
dark_frame: dark.tif

These yml files could then provide one-click opening in LiberTEM, even if the underlying raw dataset doesn't provide enough information.

The text was updated successfully, but these errors were encountered:

sk1p · 2020-05-29T13:54:38Z

Notes about the dark_frame parameter (and other corrections): in addition to the path to the file, there may be more parameters needed, for example when loading raw binary data, the dtype needs to be specified, or the array name when loading matlab files.

uellue · 2020-05-29T13:54:43Z

Ah, that resonates with the discussion about specifying bad pixels, gain map, dark frame and also other parameters for cases where the original format doesn't contain that.

jhgee · 2022-01-20T12:23:16Z

I want to revive the discussion here and gather some ideas about what should be achieved and what open questions are still around.

Is there already something similar like parameter sidecar files for the cluster class - https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/io/dataset/cluster.py#L63-L75 ?
Which format do we prefer? YAML sounds suitable. If we don't need a specific feature from one specific configuration file format I think it's best to aim for consistency across the LiberTEM project and choose a format accordingly

sk1p · 2022-01-20T13:24:44Z

Is there already something similar like parameter sidecar files for the cluster class - https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/io/dataset/cluster.py#L63-L75 ?

Yeah, those are very specific for that use case. I'm also not sure if the cluster dataset will survive - for caching, we are now using fscache instead, and the cluster ds was mostly meant to be used as a backend for caching...

Which format do we prefer? YAML sounds suitable. If we don't need a specific feature from one specific configuration file format I think it's best to aim for consistency across the LiberTEM project and choose a format accordingly

I think the main candidates are: YAML, JSON, TOML.

JSON

Pro:

very simple format, easy to read/write from Python and other languages
included in the Python stdlib

Con:

strict, users need to be aware of not having trailing commas etc.
https://www.arp242.net/json-config.html

YAML

Pro:

user-friendly format, not as strict as JSON

Con:

Some surprising behavior (classical example: in language: no, the literal no is de-serialized to a bool False value; whereas in language: de, de is de-serialized as a string)
Need to take care to use safe loading to not execute arbitrary code...
https://www.arp242.net/yaml-config.html for some examples

TOML

https://github.com/toml-lang/toml

Not sure about this one, as I haven't used it yet (other than writing some configuration files), but it may be a good candidate:

Pro:

user-friendly simple syntax (?), meant for writing configuration files

Con:

Not included in stdlib
Not meant for serializing arbitrary objects - but that should not be an issue for this use-case

sk1p · 2022-01-20T18:58:57Z

Looking more into it, I kind of like TOML:

TOML shares traits with other file formats used for application configuration and data serialization, such as YAML and JSON. TOML and JSON both are simple and use ubiquitous data types, making them easy to code for or parse with machines. TOML and YAML both emphasize human readability features, like comments that make it easier to understand the purpose of a given line. TOML differs in combining these, allowing comments (unlike JSON) but preserving simplicity (unlike YAML).

Because TOML is explicitly intended as a configuration file format, parsing it is easy, but it is not intended for serializing arbitrary data structures. TOML always has a hash table at the top level of the file, which can easily have data nested inside its keys, but it doesn't permit top-level arrays or floats, so it cannot directly serialize some data. There is also no standard identifying the start or end of a TOML file, which can complicate sending it through a stream. These details must be negotiated on the application layer.

In [1]: import toml

In [2]: toml.loads("""
   ...: [dataset]
   ...: type = "tvips"
   ...: path = "/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips"
   ...: 
   ...: [corrections]
   ...: dark_frame = "/home/alex/Data/TVIPS/dark.tif"
   ...: excluded_pixels = [[0, 0], [128, 128]]
   ...: """)
Out[2]: 
{'dataset': {'type': 'tvips',
  'path': '/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips'},
 'corrections': {'dark_frame': '/home/alex/Data/TVIPS/dark.tif',
  'excluded_pixels': [[0, 0], [128, 128]]}}

The paths above could also be relative: if the path starts with './', it will be joined with the path of the directory that contains the sidecar file.

For the corrections, we could also think about accepting some parameters, like this:

In [4]: toml.loads("""
   ...: [dataset]
   ...: type = "tvips"
   ...: path = "/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips"
   ...: 
   ...: [corrections]
   ...: excluded_pixels = [[0, 0], [128, 128]]
   ...: 
   ...: [corrections.dark_frame]
   ...: path = "/home/alex/Data/TVIPS/dark.tif"
   ...: format = "TIFF"
   ...: 
   ...: [corrections.gain_map]
   ...: path = "/home/alex/Data/TVIPS/gain.bin"
   ...: format = "RAW"
   ...: dtype = "float32"
   ...: """)
Out[4]: 
{'dataset': {'type': 'tvips',
  'path': '/home/alex/Data/TVIPS/rec_20200623_080237_000.tvips'},
 'corrections': {'excluded_pixels': [[0, 0], [128, 128]],
  'dark_frame': {'path': '/home/alex/Data/TVIPS/dark.tif', 'format': 'TIFF'},
  'gain_map': {'path': '/home/alex/Data/TVIPS/gain.bin',
   'format': 'RAW',
   'dtype': 'float32'}}}

(loading corrections directly in Context.load is not supported yet, but I guess with the sidecar format, it makes sense to support this - i.e. store a CorrectionSet on the DataSet as the default, which then will be used in Context.run_udf if nothing else is specified)

sk1p · 2022-01-24T13:08:01Z

I'm adding this to the 0.10 milestone - @jhgee thank you for kick-starting this discussion again! If you'd like to help out on building this feature, it would be good to have a concrete specification for the sidecar file, and maybe have a go at implementing a prototype. If you encounter any difficulties, we are here to help!

sk1p added file formats and I/O discussion labels May 4, 2020

jhgee mentioned this issue May 29, 2020

sync_offset, reshape datasets and get coords for UDFs #793

Merged

7 tasks

sk1p mentioned this issue Jun 8, 2020

Corrections: support for loading correction data via the GUI #807

Open

sk1p mentioned this issue Sep 8, 2020

MRC DataSet support #873

Merged

3 tasks

sk1p mentioned this issue Nov 26, 2020

Electronic labbook integration #893

Closed

5 tasks

sk1p added this to the 0.10 milestone Jan 24, 2022

uellue modified the milestones: 0.10, 0.11 Jun 1, 2022

matbryan52 linked a pull request Jan 10, 2023 that will close this issue

WIP: Dataset Sidecar / Config prototype #1376

Draft

7 tasks

matbryan52 linked a pull request Jan 11, 2023 that will close this issue

WIP: Dataset Sidecar / Config prototype #1376

Draft

7 tasks

sk1p modified the milestones: 0.11, 0.12 Mar 21, 2023

sk1p removed this from the 0.12 milestone Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter sidecar files #768

Parameter sidecar files #768

sk1p commented May 4, 2020

sk1p commented May 29, 2020

uellue commented May 29, 2020

jhgee commented Jan 20, 2022 •

edited

sk1p commented Jan 20, 2022

sk1p commented Jan 20, 2022

sk1p commented Jan 24, 2022

Parameter sidecar files #768

Parameter sidecar files #768

Comments

sk1p commented May 4, 2020

sk1p commented May 29, 2020

uellue commented May 29, 2020

jhgee commented Jan 20, 2022 • edited

sk1p commented Jan 20, 2022

JSON

YAML

TOML

sk1p commented Jan 20, 2022

sk1p commented Jan 24, 2022

jhgee commented Jan 20, 2022 •

edited