Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

name and the nwb.file schema #552

Open
sneakers-the-rat opened this issue Aug 29, 2023 · 5 comments
Open

name and the nwb.file schema #552

sneakers-the-rat opened this issue Aug 29, 2023 · 5 comments

Comments

@sneakers-the-rat
Copy link

Trying to figure out how the name field works in containers:

When a dataset/group, say TimeSeries for concreteness, is stored in an NWBFile, it is stored using a name property in a top-level container group like /acquisition/raw_running_wheel_rotation

TimeSeries and its ancestor classes do not have a name field, however (except for the name field in the NWB schema language which optionally sets the name of the class, but it is unset for classes like TimeSeries that define a type with neurodata_type_def).

nwb-storage docs describe a name field, but there seems to be an amiguity between the class-level name and the implicit instance name - https://nwb-storage.readthedocs.io/en/latest/storage_hdf5.html

The description of how a dataset is to be saved is relatively abstract (which is good) in the nwb.file schema, specifying just that the file contains groups like acquisition, which seem to use sub-groups as a list of allowable types - a special case not described in the schema language spec.

The name property seems to be implemented in the hdmf.container.AbstractContainer class - https://github.com/hdmf-dev/hdmf/blob/e801d9ee76e73ebfc8bf926e64a5a1a65337aebe/src/hdmf/container.py#L285 - but the schema for Container also lacks a name field - https://github.com/hdmf-dev/hdmf-common-schema/blob/9b2580a21647e06be54708fabf2d44cef73d32cb/common/base.yaml#L7

The implementation of add_acquisition calls _add_acquisition_internal, which seems to be specified in an abstract way in __clsconf__ but I can't trace it through the logic in hdmf to see how that actually works.

So my question is basically: Am I missing something in the schema, or is there some accounting of things that are "outside" the schema somewhere? trying to make as general of a schema mapping as possible.

Additionally, any help understanding how the __clsconf__ works and where _add_acquisition_internal might be found would be lovely :)

@oruebel
Copy link
Contributor

oruebel commented Aug 29, 2023

The name property seems to be implemented in the hdmf.container.AbstractContainer class - https://github.com/hdmf-dev/hdmf/blob/e801d9ee76e73ebfc8bf926e64a5a1a65337aebe/src/hdmf/container.py#L285 - but the schema for Container also lacks a name field - https://github.com/hdmf-dev/hdmf-common-schema/blob/9b2580a21647e06be54708fabf2d44cef73d32cb/common/base.yaml#L7

The language is designed for hierarchical data organizations. All data objects in a data hierarchy must have a name. I.e., any instance of a data object (dataset, group, attribute, etc.) must have a name in order to place it in the hierarchy. As such, the name of a data object is an intrinsic property of the data object that is required to store and locate it in the data hierarchy (similar to how all files and folders on a file system on your computer must have a name). I.e., in a hierarchical data model the name (or path) of the object is the primary key of the object. Because name is required for all data objects in the hierarchical data model we cannot describe it as a separate field (because that field would then also need a name field and so on). Long story short, any data object must have a name and the name is an intrinsic property of the object.

If you contrast this with a data model that is more oriented around the concept of collections of objects (e.g,. in a object store), there you then have some form of and id that serves as the primary key. E.g., in the simplest case, if you just have a list of objects, the index of the object in that list is an intrinsic property of the object that identifies that object, but that index is not stored or modeled as a field of the object.

except for the name field in the NWB schema language which optionally sets the name of the class

This is not fully correct. The name of the class is defined by the neurodata_type_def key. The name field in the schema defines the name of the data object (i.e., the instance of the neurodata_type). As such, when defining a name in the schema, it means that there can only be a single instance of that class at a particular location in the hierarchy (similar to how you cannot have two files with the same name in the same folder on your file system). I.e., any type in schema with a name specified can only have a quantity of 0 or 1. Because of this, name is not specified for many neurodata_types in the schema, in particular any neurodata_type where we may need to create many instances within the same group (i.e,. location in the hierarchy).

nwb-storage docs describe a name field, but there seems to be an amiguity between the class-level name and the implicit instance name - https://nwb-storage.readthedocs.io/en/latest/storage_hdf5.html

The storage model is hierarchical. The name field (for groups, datasets, attributes etc.) here always refers to the name of the instance of the data object. The storage model (because it is hierarchical) in many ways doesn't really care about the class (neurodata_type) of the data objects. Because of this, we need to store the name of the class (i.e., the value of neurodata_type_def key in the schema) explicitly as an Attribute. E.g., if you look at the storage definition for a Group https://nwb-storage.readthedocs.io/en/latest/storage_hdf5.html#groups it specifies that the neurodata_type key in the schema is mapped to the Attribute neurodata_type in the storage. This is also the reason why Attributes in the schema cannot have a neurodata_type_def key, because we need to store the neurodata_type as an Attribute in the hierarchical data model (Attributes cannot have Attributes in HDF5).

Additionally, any help understanding how the clsconf works and where _add_acquisition_internal might be found would be lovely :)

With __clsconf__ you start diving into Metaclass magic as part of the MultiContainerInterface . The following tutorial explains how to use __clsconf__ https://github.com/hdmf-dev/hdmf/blob/e801d9ee76e73ebfc8bf926e64a5a1a65337aebe/src/hdmf/container.py#L883 Based on the definition in __clsconf__ the MultiContainerInterface automatically generates add, create, and get methods. To see how those functions are being generated you would need to look at the __build* methods in the MultiContainerInterface class definition here https://github.com/hdmf-dev/hdmf/blob/e801d9ee76e73ebfc8bf926e64a5a1a65337aebe/src/hdmf/container.py#L883.

So for _add_acquisition_internal the __clsconf__ entry here in the definition of NWBFile
which says

{
   'attr': 'acquisition',
   'add': '_add_acquisition_internal',
   'type': (NWBDataInterface, DynamicTable),
   'get': 'get_acquisition'
}

tells MultiContainerInterface that for the acquisition it should create a private add method called '_add_acquisition_internal' and a public get method get_acquisition. I.e., the _add_acquisition_internal method is being auto-generated by the Metaclass.

@sneakers-the-rat
Copy link
Author

sneakers-the-rat commented Aug 30, 2023

OK! THANK YOU!

This clears up what has been the most mysterious part of reading the spec:

Both neurodata_type_def and neurodata_type_inc are optional keys. To enable the unique identification, every group (and dataset) must either have a fixed name and/or a unique data type. This means, any group (or dataset) with a variable name must have a unique data type.

I wasn't sure what that meant, because it seemed like name would then be quasi-synonymous with neurodata_type_def as a means of making a class because of cases where eg. neurodata_type_inc was used without a name (eg in Images) that seemed to imply instantiation of a class. That makes the rest of the type system make way more sense to me and should allow me to really simplify my model of the schema.

So let me see if i've got this correct - every instantiated object that goes in an NWB file needs a name. Objects without names are either classes that can be instantiated on their own, or instantiated as part of a group. Things with fixed names are necessarily part of hierarchical class object. So is it correct that neurodata_type_def and name are mutually exclusive fields?

I think this would be great to have in the docs, because it took me quite a bit of puzzling to make sense of the class and inheritance system. Maybe amending this section to further describe that the presence or absence of a name is meaningful about the disposition of the object - that "any group (or dataset) with a variable name must have a unique data type." means that a group or dataset with a variable name also means that the object must be given a name on instantiation?

That also could clarify the quantity property - ie. when a name is present, quantity indicates that the property itself can have multiple values, but when the name is not present then there will be multiple named instance of the property.

So, eg. (ignoring incompleteness and other schema violations, just for the sake of minimal illustration to test my understanding)

Schema Instance
Named
- neurodata_type_def: MyType
  datasets:
  - name: keywords
    quantity: "*"
    dtype: text
MyInstance:
  is_a: MyType
  datasets:
    keywords:
    - myKeyword1
    - myKeyword2
    - ...
Unnamed
- neurodata_type_def: MyType
  datasets:
  - neurodata_type_inc: keyword
    quantity: "*"
MyInstance:
  is_a: MyType
  datasets:
    MyKeywordName1:
      is_a: keyword
      value: MyKeyword
    MyKeywordName2:
      is_a: keyword
      value: MyKeyword2

In fact, looking at the schema now, it seems like * is only ever used with unnamed classes, and + or ? is only used with named classes. And maybe correspondingly neurodata_type_inc without a name has to be a group?


and thank you for the info on __clsconf__ - i figured they were autogenerated methods, i just couldn't tell how they were being generated. So does that mean that all those add and get methods are generic/identical? and they differ in that they add instances to different locations in the hierarchy?

It looks like the link for the tutorial is the same as that for the link to the class definition, but i would love to see the tutorial.

Thanks for the answer!! Need to go revise my translation code to preserve nesting of unnamed classes and add a required name slot to the subclass when one isn't present

@oruebel
Copy link
Contributor

oruebel commented Aug 30, 2023

So is it correct that neurodata_type_def and name are mutually exclusive fields?

No. Think of neurodata_type_def as the name of the class and name as the name of the instance. If both neurodata_type_def and name are specified it just means that when you create an instance then it must have the name specified in the schema and if name is not specified then an instance can have any user-defined name. As I mentioned above:

when defining a name in the schema, it means that there can only be a single instance of that class at a particular location in the hierarchy (similar to how you cannot have two files with the same name in the same folder on your file system). Because of this, this part of your example is not valid:

datasets:
  - name: keywords
    quantity: "*"
    dtype: text

The name of instances at particular location must be unique, i.e., you cannot have multiple instances of a dataset with the name keywords in the same place (similar to how you can't have multiple files called myfile.txt in the same folder on your filesytem). So you cannot have a quantity: * when name is specified in the schema, but quantity can only be 1 or ? (i.e., 0 or 1) in this case. If you want to create multiple instances of the same dataset, it must have a neurodata_type (either, neurodata_type_def and/or neurodata_type_inc), I.e,. this part of your example is correct:

- neurodata_type_def: MyType
  datasets:
  - neurodata_type_inc: keyword
    quantity: "*"

That also could clarify the quantity property - ie. when a name is present, quantity indicates that the property itself can have multiple values, but when the name is not present then there will be multiple named instance of the property.

quantity indicates "how many instances of the neurodata_type (class) you can create". E.g., quantity=1 means there must be exactly one instance of the this type at that particular locations, and quantity=* means you can have any number 0...* of instances.

In fact, looking at the schema now, it seems like * is only ever used with unnamed classes

Yes, because when you have a fixed instance name set in the schema then the quantity can only be 0 or 1. Only Datasets and Groups in the schema with a neurodata_type can be instantiated multiple times.

So does that mean that all those add and get methods are generic/

That is correct for the autogenerated add and get method with the caveat that they expect object of a particular neurodata_type (class) to add.

they differ in that they add instances to different locations in the hierarchy?

They differ in the type of objects they add. The location in the hierarchy is not necessarily different. Consider you have a button in your file browser for Create Text File and Create HTML File. Each time you hit one of the buttons they will create a new text or html file in the same folder (i.e., location) that you have open in your file browser. Each time you hit one of the Create buttons the browser will ask you to give the new file a name and if a file with the same name already exists, then it will tell you that you can't create the file because another file with the same name already exists.

@sneakers-the-rat
Copy link
Author

Alright thanks again, I think i've got it now.

name:

  • instance name is a required property for datasets and groups
  • instance name can be set either
    • at instantiation
    • by a fixed, class-level name
  • instance name is orthogonal to type declaration/inclusion
  • top-level classes (ie. those with neurodata_type_def) typically do not have a fixed name, but are allowed to (in the core schema there are 6 such named top-level classes, all in nwb.file to describe specific tables within the file).

So in LinkML I would model unnamed classes with a required name property, meaning that on instantiation they need to be given a name, eg.

Timeseries:
  attributes:
    name:
      required: true
      range: string
    data:
      range: Timeseries__Data

and I would model a named class like:

Timeseries__Data:
  attributes:
    name:
      required: true
      ifabsent: "data"
      equals_string: "data"
      range: string

quantity:

  • quantity is only meaningful for groups and datasets that are themselves within a group - quantity is undefined for top-level classes (ie. datasets/groups with neurodata_type_def) because it is explicitly in reference to the allowable quantity of a class at a given point in the hierarchy
  • only unnamed classes can have a quantity '*' which indicates that any number of the same kind of instances can be present at that point in the hierarchy with unique names
  • only unnamed classes can have a quantity '+' which indicates one or more ...(ibid)
  • a class is indicated as required...
    • for named classes, with quantity 1 (default)
    • for unnamed classes, with quantity 1, or '+' for "at least one"
  • a class is indicated as optional...
    • for named classes, with quantity '?'
    • for unnamed classes, with quantity '*' (and theoretically also '?' but I don't see any classes that do this in the core schema)

which also resolves my question about using groups like

groups:
- neurodata_type_inc: NWBDataInterface
  doc: Acquired, raw data.
  quantity: '*'
- neurodata_type_inc: DynamicTable
  doc: Tabular data that is relevant to acquisition
  quantity: '*'

as a way of declaring which types are allowed in a group without being a special case.

Think i am squared away on that and will update my model!


They differ in the type of objects they add. The location in the hierarchy is not necessarily different.

OK this is a little confusing to me, doesn't add_acquisition add a NWBDataInterface or DynamicTable underneath /acquisition? and similarly add_stimulus add a TimeSeries underneath /stimulus? so they differ both in type and in hierarchy location? Wouldn't I be able to create a dataset under /acquisition with the same name as one under /stimulus?

so

nwbfile.add_acquisition(TimeSeries(name='myname', ...))
nwbfile.add_stimulus(TimeSeries(name='myname', ...))

would make

/acquisition/myname
/stimulus/myname

thanks again for the time spent clarifying! would be happy to help draft additions to the docs to clarify these things, the role of name in particular could do with a paragraph or two

@sneakers-the-rat
Copy link
Author

There, I think that's pretty nice :)

  • handling unnamed container classes correctly
  • handling using groups of multiple unnamed subgroups as a range constraint without
    making an additional subclass
  • handling mixed case with some '*' quantity unnamed groups and some other named groups by snake-casing the neurodata_type_inc (as I think I noticed pynwb does)
  • amended generator to add const if a default value is declared along with a matching equals_string

(with some fields omitted in the YAML to emphasize the nice parts)

NWBFile:
  name: NWBFile
  description: An NWB file storing cellular-based neurophysiology data from a single
    experimental session.
  is_a: NWBContainer
  attributes:
    name:
      name: name
      ifabsent: string(root)
      range: string
      required: true
      equals_string: root
    file_create_date:
      name: file_create_date
      description: '...'
      multivalued: true
      range: isodatetime
      required: true
    acquisition:
      name: acquisition
      multivalued: true
      any_of:
      - range: NWBDataInterface
      - range: DynamicTable
    intervals:
      name: intervals
      description: Experimental intervals, whether that be logically distinct sub-experiments
        having a particular scientific goal, trials (see trials subgroup) during
        an experiment, or epochs (see epochs subgroup) deriving from analysis of
        data.
      multivalued: false
      range: NWBFile__intervals
      required: false
  tree_root: true

NWBFile__intervals:
  name: NWBFile__intervals
  description: '...'
  attributes:
    name:
      name: name
      ifabsent: string(intervals)
      range: string
      required: true
      equals_string: intervals
    epochs:
      name: epochs
      description: Divisions in time marking experimental stages or sub-divisions
        of a single recording session.
      multivalued: false
      range: TimeIntervals
      required: false
    time_intervals:
      name: time_intervals
      description: Optional additional table(s) for describing other experimental
        time intervals.
      multivalued: true
      range: TimeIntervals
      required: false

generates pydantic model:

class NWBFile(NWBContainer):
    """
    An NWB file storing cellular-based neurophysiology data from a single experimental session.
    """
    nwb_version: Optional[str] = Field(None, description="""File version string. Use semantic versioning, e.g. 1.2.1. This will be the name of the format with trailing major, minor and patch numbers.""")
    name: str = Field("root", const=True)
    file_create_date: List[datetime ] = Field(default_factory=list, description="""...""")
    identifier: str = Field(..., description="""...""")
    session_description: str = Field(..., description="""...""")
    session_start_time: datetime  = Field(..., description="""...""")
    timestamps_reference_time: datetime  = Field(..., description="""...""")
    acquisition: Optional[List[Union[DynamicTable, NWBDataInterface]]] = Field(default_factory=list)
    analysis: Optional[List[Union[DynamicTable, NWBContainer]]] = Field(default_factory=list)
    scratch: Optional[List[Union[DynamicTable, NWBContainer]]] = Field(default_factory=list)
    processing: Optional[List[ProcessingModule]] = Field(default_factory=list)
    stimulus: NWBFileStimulus = Field(..., description="""...""")
    general: NWBFileGeneral = Field(..., description="""...""")
    intervals: Optional[NWBFileIntervals] = Field(None, description="""...""")
    units: Optional[Units] = Field(None, description="""...""")
    
class NWBFileIntervals(ConfiguredBaseModel):
    """
    Experimental intervals, whether that be logically distinct sub-experiments having a particular scientific goal, trials (see trials subgroup) during an experiment, or epochs (see epochs subgroup) deriving from analysis of data.
    """
    name: str = Field("intervals", const=True)
    epochs: Optional[TimeIntervals] = Field(None, description="""Divisions in time marking experimental stages or sub-divisions of a single recording session.""")
    trials: Optional[TimeIntervals] = Field(None, description="""Repeated experimental events that have a logical grouping.""")
    invalid_times: Optional[TimeIntervals] = Field(None, description="""Time intervals that should be removed from analysis.""")
    time_intervals: Optional[List[TimeIntervals]] = Field(default_factory=list, description="""Optional additional table(s) for describing other experimental time intervals.""")
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants