Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: dandi filenames with +acquisition data #1265

Open
colleenjg opened this issue Apr 5, 2023 · 43 comments
Open

Feature request: dandi filenames with +acquisition data #1265

colleenjg opened this issue Apr 5, 2023 · 43 comments

Comments

@colleenjg
Copy link

I am adding a new version for each session in my Dandiset. This version has the full optical physiology video added as an acquisition object. The resulting files are about 40 GB bigger than the original versions, which is why I'd like them both to be available on the archive.

Running dandi organize, however, results in this message: 1 out of 2 paths are not unique. We will try adding _obj- based on crc32 of object_id.

Given that the acquisition module is typically a very major component of a file, is there any way to have the file names reflect whether it is present, e.g. +acquisition (just like there is currently +behavior, +ophys and +image)? In my case, this would make interpreting the files versions in my dandiset much more straightforward than the more arbitrary object_id-related identifier.

@satra
Copy link
Member

satra commented Apr 5, 2023

@colleenjg - any reason not to put them in the same nwb file ?

ps. we are currently debating whether we create a special folder for derivatives in the dandiset, or allow for separate dandisets, one containing raw data, and another containing data after processing (cell extractions, spikes, etc.,.)

@colleenjg
Copy link
Author

@satra yeah - it's really been my on-going debate about whether to have multiple version of the files or just one. A lot of the compute I use is offline (clusters with offline compute nodes), so streaming isn't always an option. And in general, I haven't moved to fully remote compute - I still do a lot locally, as do a lot of my colleagues.

My full dataset is 2.2 TB with the imaging stacks, which is just a massive amount of data to download and store. It's a very manageable 15 GB without the imaging stacks, which I'm keen to keep the "basic" file versions on Dandi.

@satra
Copy link
Member

satra commented Apr 5, 2023

in that case, let's do an experiment, would you be willing to put the raw data into a separate dandiset? and then we can link the derived data to it. i can show you how at the asset level. on the dandiset level, it's going to be via resource links.

@colleenjg
Copy link
Author

yeah! Happy to give it a try!

@satra
Copy link
Member

satra commented Apr 5, 2023

@CodyCBakerPhD, @bendichter - any suggestions. there doesn't seem to be a neural datatype distinction we can make. perhaps something to consider in the schema. this has come up a few times. it's similar to how cameras store raw, raw+jpeg, jpeg.

@bendichter
Copy link
Contributor

bendichter commented Apr 5, 2023

@satra there actually are neurodata type distinctions here. Raw ophys data has the neurodata type TwoPhotonSeries (or the recent addition, OnePhotonSeries). Processed ophys does not have these but has the data types PlaneSegmentation, ImageSegmentation, and RoiResponseSeries. dandi-cli groups all of these into "ophys", but we could just as easily separate them into e.g. "ophys_raw" "ophys_processed" and "ophys_raw_and_processed".

By the way, the same is true for extracellular electrophysiology. Raw has ElectricalSeries and procesesed does not but has Units. These could be split into different categories of raw and processed but currently are not.

@satra
Copy link
Member

satra commented Apr 5, 2023

thanks @bendichter. @yarikoptic - seems like we should do a remapping. this was done very early in dandi's lifecycle. i'll open a separate issue for tackling this as well as techniques and modalities together.

@colleenjg
Copy link
Author

I think that would be really great for my use case, and hopefully many others.

@CodyCBakerPhD
Copy link
Contributor

One thing I will mention while endeavoring upon this experiment - please be sure that there is no difference between the 'acquisition-only' files and any derivatives in terms of the metadata structure; the only difference being the absence of the series containing the bulk data in downstream derivatives

In particular, please ensure that all of the hierarchical metadata belonging to that series (which is very lightweight) remains in any processed files; that is, imaging planes, optic channel info, devices (for ecephys; the electrode table, electrode groupings, devices)

That metadata remains very informative and valuable for interpreting any derived data, even in the absence of the actual raw series

@bendichter
Copy link
Contributor

@jwodder here's the conversation we were discussing. Could you point me to the place where the mapping is made between neurodata types and filenames?

@bendichter
Copy link
Contributor

@satra we also need to converge on a naming convention. I proposed "ophys_raw" "ophys_processed" and "ophys_raw_and_processed" but I'm not sure that's the best

@jwodder
Copy link
Member

jwodder commented Apr 10, 2023

@bendichter

Could you point me to the place where the mapping is made between neurodata types and filenames?

I don't understand the question. Can you give an example of the mapping you're referring to?

@bendichter
Copy link
Contributor

bendichter commented Apr 10, 2023

@jwodder

dataset contains ElectricalSeries -> suffix = "ecephys"
dataset contains TwoPhotonSeries -> suffix = "ophys"
etc.

@jwodder
Copy link
Member

jwodder commented Apr 10, 2023

@bendichter That's not defined by dandi anywhere. I think those strings may come from the pynwb metadata.

@satra
Copy link
Member

satra commented Apr 10, 2023

@jwodder - somewhere in organize a renaming is taking place with suffixes being added to the filename. could you please point @bendichter to that location? indeed the information about the suffixes are coming from pynwb metadata, but the name is being created somewhere in organize.

@bendichter
Copy link
Contributor

I think this dictionary is used:

dandi-cli/dandi/metadata.py

Lines 748 to 935 in cfb39be

neurodata_typemap: Dict[str, Neurodatum] = {
"ElectricalSeries": {
"module": "ecephys",
"neurodata_type": "ElectricalSeries",
"technique": "multi electrode extracellular electrophysiology recording technique",
"approach": "electrophysiological approach",
},
"SpikeEventSeries": {
"module": "ecephys",
"neurodata_type": "SpikeEventSeries",
"technique": "spike sorting technique",
"approach": "electrophysiological approach",
},
"FeatureExtraction": {
"module": "ecephys",
"neurodata_type": "FeatureExtraction",
"technique": "spike sorting technique",
"approach": "electrophysiological approach",
},
"LFP": {
"module": "ecephys",
"neurodata_type": "LFP",
"technique": "signal filtering technique",
"approach": "electrophysiological approach",
},
"EventWaveform": {
"module": "ecephys",
"neurodata_type": "EventWaveform",
"technique": "spike sorting technique",
"approach": "electrophysiological approach",
},
"EventDetection": {
"module": "ecephys",
"neurodata_type": "EventDetection",
"technique": "spike sorting technique",
"approach": "electrophysiological approach",
},
"ElectrodeGroup": {
"module": "ecephys",
"neurodata_type": "ElectrodeGroup",
"technique": "surgical technique",
"approach": "electrophysiological approach",
},
"PatchClampSeries": {
"module": "icephys",
"neurodata_type": "PatchClampSeries",
"technique": "patch clamp technique",
"approach": "electrophysiological approach",
},
"CurrentClampSeries": {
"module": "icephys",
"neurodata_type": "CurrentClampSeries",
"technique": "current clamp technique",
"approach": "electrophysiological approach",
},
"CurrentClampStimulusSeries": {
"module": "icephys",
"neurodata_type": "CurrentClampStimulusSeries",
"technique": "current clamp technique",
"approach": "electrophysiological approach",
},
"VoltageClampSeries": {
"module": "icephys",
"neurodata_type": "VoltageClampSeries",
"technique": "voltage clamp technique",
"approach": "electrophysiological approach",
},
"VoltageClampStimulusSeries": {
"module": "icephys",
"neurodata_type": "VoltageClampStimulusSeries",
"technique": "voltage clamp technique",
"approach": "electrophysiological approach",
},
"TwoPhotonSeries": {
"module": "ophys",
"neurodata_type": "TwoPhotonSeries",
"technique": "two-photon microscopy technique",
"approach": "microscopy approach; cell population imaging",
},
"OpticalChannel": {
"module": "ophys",
"neurodata_type": "OpticalChannel",
"technique": "surgical technique",
"approach": "microscopy approach; cell population imaging",
},
"ImagingPlane": {
"module": "ophys",
"neurodata_type": "ImagingPlane",
"technique": None,
"approach": "microscopy approach; cell population imaging",
},
"PlaneSegmentation": {
"module": "ophys",
"neurodata_type": "PlaneSegmentation",
"technique": None,
"approach": "microscopy approach; cell population imaging",
},
"Position": {
"module": "behavior",
"neurodata_type": "Position",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"SpatialSeries": {
"module": "behavior",
"neurodata_type": "SpatialSeries",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"BehavioralEpochs": {
"module": "behavior",
"neurodata_type": "BehavioralEpochs",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"BehavioralEvents": {
"module": "behavior",
"neurodata_type": "BehavioralEvents",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"BehavioralTimeSeries": {
"module": "behavior",
"neurodata_type": "BehavioralTimeSeries",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"PupilTracking": {
"module": "behavior",
"neurodata_type": "PupilTracking",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"EyeTracking": {
"module": "behavior",
"neurodata_type": "EyeTracking",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"CompassDirection": {
"module": "behavior",
"neurodata_type": "CompassDirection",
"technique": "behavioral technique",
"approach": "behavioral approach",
},
"ProcessingModule": {
"module": "base",
"neurodata_type": "ProcessingModule",
"technique": "analytical technique",
"approach": None,
},
"RGBImage": {
"module": "image",
"neurodata_type": "RGBImage",
"technique": "photographic technique",
"approach": None,
},
"DecompositionSeries": {
"module": "misc",
"neurodata_type": "DecompositionSeries",
"technique": "fourier analysis technique",
"approach": None,
},
"Units": {
"module": "misc",
"neurodata_type": "Units",
"technique": "spike sorting technique",
"approach": "electrophysiological approach",
},
"Spectrum": {
"module": "ndx-spectrum",
"neurodata_type": "Spectrum",
"technique": "fourier analysis technique",
"approach": None,
},
"OptogeneticStimulusSIte": {
"module": "ogen",
"neurodata_type": "OptogeneticStimulusSIte",
"technique": None,
"approach": "optogenetic approach",
},
"OptogeneticSeries": {
"module": "ogen",
"neurodata_type": "OptogeneticSeries",
"technique": None,
"approach": "optogenetic approach",
},
}

@jwodder
Copy link
Member

jwodder commented Apr 10, 2023

@satra If by "suffixes" you mean the filename substring of the form _WORD[+WORD][+WORD], that's the modalities field. The individual modalities are extracted by _populate_modalities(), and they're joined together and added to the filename as part of _assign_dandi_names().

@jwodder
Copy link
Member

jwodder commented Apr 10, 2023

@bendichter That dict is only used when converting NWB metadata to Dandi Schema asset metadata. It plays no role in file organization.

@satra
Copy link
Member

satra commented Apr 10, 2023

thank you @jwodder. @bendichter the populate modalities function and it's underlying call to get at the modalities from pynwb is what you are looking for at the moment.

@bendichter
Copy link
Contributor

bendichter commented Apr 10, 2023

oh, ok I get it. In get_neurodata_types_to_modalities_map we are getting the module of the neurodata type and using that for the name. The output of this function is the dictionary we are talking about. While it's nice to have a perfect 1-to-1 with the modules of PyNWB, there is nothing stopping us from making our names more descriptive, adding information like whether the dataset is acquisition data, processed data, or both. That would require us to manually create a map from neurodata type to filename "pseuffix".

@satra
Copy link
Member

satra commented Apr 10, 2023

@bendichter - you can also add more info to that dictionary you pointed to in metadata.py if that helps and use it in that function. we do want to keep that suffix space somewhat compact though especially given nwb can store a lot of different types of info.

@satra
Copy link
Member

satra commented Apr 10, 2023

even better if a dictionary like that comes from pynwb itself !! :)

@bendichter
Copy link
Contributor

I think what we are looking for here is a distinction between raw and processed data.

One way to do this would be to look at neurodata types. PyNWB does not make this distinction, so we would need to do it ourselves.

Another way, which is probably more accurate, is to look at the location of the neurodata object within the file, whether the data is in /acquisition or /processing. This is more accurate because there are some cases where you can have data types that would normally be considered acquisition but sometimes they are processing. For instance, an ElectricalSeries is usually acquisition, but if you have an LFP signal that you got through offline downsampling that you want to save to NWB, you would save that as an ElectricalSeries inside an LFP object inside of a ProcessingModule. I can create a function that uses the location information, but we'll need to create a naming function that is a bit more complex than a simple ndtype -> name dictionary.

@satra
Copy link
Member

satra commented Apr 10, 2023

@bendichter - i think your knowledge of nwb will help in creating that function. i think i was saying that we can start with putting it in the CLI, but perhaps some of these nwb things could go into pynwb itself, and that way others could use it to.

@satra
Copy link
Member

satra commented Apr 12, 2023

@bendichter - are you going to send a PR for this? just checking so that @colleenjg can have a version before she needs to upload the data.

@bendichter
Copy link
Contributor

@satra yes, I'm working on it. To do this right we would need to base the name creation on the NWB file rather than the metadata extracted from the NWB file, so it's going to take some time. I can try to have this done by end of week.

@colleenjg
Copy link
Author

@bendichter @satra End of week would be great for me - that should give me a few days to upload the large files with the raw imaging data

@bendichter
Copy link
Contributor

How about this for a naming convention:

'ecephys(acq)'
'ecephys(proc)'
'ecephys(acq+proc)'
'ecephys(acq+proc)+ophys(acq)'
'ophys(proc)+behavior(proc)'

@bendichter
Copy link
Contributor

After thinking more about it, @satra, you were right. In @colleenjg 's particular case we could have looked solely at neurodata types, but if we want to solve the problem more generally we will need to look not only at what neurodata types are present but also at where those types are in the file. Specifically, acquisition data can be found in the /acquisition group, and processed data can be found in the /processing group. The metadata object does not hold information about where the neurodata objects are in the NWB file, so I instead determined the modality section of the filename from the NWBFile object. I am not sure how I would incorporate this into dandi organize. It is not a drop-in replacement for the existing function, since the existing function uses metadata as input, not the opened NWBFile. Maybe we could maintain a dictionary as we do the initial metadata parsing.

Here's the code:

import importlib
import inspect
import pynwb

module_names = ("ecephys", "ophys", "icephys", "image", "retinotopy", "ogen", "behavior")

module_map = dict()
for module_name in module_names:
    module = importlib.import_module("pynwb." + module_name)
    for x in module.__dict__.values():
        if inspect.isclass(x) and \
        issubclass(x, pynwb.core.NWBMixin) and \
        x.__module__ == "pynwb." + module_name:
            module_map[x] = module_name
            
def modalities_string(nwbfile: pynwb.file.NWBFile):
    
    def get_processing_neurodata_objects(nwbfile: pynwb.file.NWBFile):
        for mod in nwbfile.processing.values():
            for data_interface in mod.data_interfaces.values():
                yield data_interface
                
    def check_for_module(list_of_ndobjects: list, module_name: str):
        for ndobj in list_of_ndobjects:
            for cls in type(ndobj).mro():
                if module_map.get(type(ndobj), None) == module_name:
                    return True
                 
    processing_neurodata_objects = list(get_processing_neurodata_objects(nwbfile))
    
    module_strs = []
    for module_name in module_names:
        
        acq = check_for_module(nwbfile.acquisition.values(), module_name)
        proc = check_for_module(processing_neurodata_objects, module_name)
        
        # handle the Units table, which is an exception to the rule.
        if module_name == "ecephys" and nwbfile.units is not None:
            proc = True
            
        if not (acq or proc):
            continue
        elif acq and not proc:
            module_strs.append(f"{module_name}(acq)")
        elif not acq and proc:
            module_strs.append(f"{module_name}(proc)")
        else:
            module_strs.append(f"{module_name}(acq+proc)")
            
    return "+".join(module_strs)

here are some tests:

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.ophys import mock_TwoPhotonSeries, mock_OnePhotonSeries, mock_DfOverF
from pynwb.testing.mock.behavior import mock_SpatialSeries
from pynwb.ecephys import ElectricalSeries

# ecephys acquisition
nwbfile = mock_NWBFile()
nwbfile.add_acquisition(mock_ElectricalSeries())
assert modalities_string(nwbfile) == 'ecephys(acq)'

# ecephys processing
nwbfile2 = mock_NWBFile()
proc_mod = nwbfile2.create_processing_module("ecephys", "ecephys")
#proc_mod.add(mock_ElectricalSeries())
nwbfile2.add_unit(spike_times=[1.0, 2.0, 3.0])
assert modalities_string(nwbfile2) == 'ecephys(proc)'

# ecephys acquisition and processing
nwbfile3 = mock_NWBFile()
nwbfile3.add_acquisition(mock_ElectricalSeries())
proc_mod = nwbfile3.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile3) == 'ecephys(acq+proc)'

# ecephys acquisition and processing and ophys acq
nwbfile4 = mock_NWBFile()
nwbfile4.add_acquisition(mock_ElectricalSeries())
nwbfile4.add_acquisition(mock_TwoPhotonSeries())
proc_mod = nwbfile4.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile4) == 'ecephys(acq+proc)+ophys(acq)'

# ophys processing and behavior processing
nwbfile5 = mock_NWBFile()
proc_mod = nwbfile5.create_processing_module("ophys", "ophys")
proc_mod.add(mock_DfOverF())
proc_mod2 = nwbfile5.create_processing_module("behavior", "behavior")
proc_mod2.add(mock_SpatialSeries())
assert modalities_string(nwbfile5) == 'ophys(proc)+behavior(proc)'

@jwodder do you have ideas about what might be the best way to incorporate this into dandi organize?

@satra
Copy link
Member

satra commented Apr 12, 2023

@bendichter - nice! perhaps replace the ( with - to make filenames be a bit more terminal friendly

'ecephys-acq'
'ecephys-proc'
'ecephys-acq-proc'
'ecephys-acq-proc+ophys-acq'
'ophys-proc+behavior-proc'

also in terms of the code, you could extract those strings when extracting metadata, and forgo the mapping function. there is a specific raw nwb metadata extractor function.

@bendichter
Copy link
Contributor

Here's a version that uses dashes instead of parens:

import importlib
import inspect
import pynwb

module_names = ("ecephys", "ophys", "icephys", "image", "retinotopy", "ogen", "behavior")

module_map = dict()
for module_name in module_names:
    module = importlib.import_module("pynwb." + module_name)
    for x in module.__dict__.values():
        if inspect.isclass(x) and \
        issubclass(x, pynwb.core.NWBMixin) and \
        x.__module__ == "pynwb." + module_name:
            module_map[x] = module_name
            
def modalities_string(nwbfile):
    
    def get_processing_neurodata_objects(nwbfile):
        for mod in nwbfile.processing.values():
            for data_interface in mod.data_interfaces.values():
                yield data_interface
                
    def check_for_module(list_of_ndobjects: list, module_name: str):
        for ndobj in list_of_ndobjects:
            for cls in type(ndobj).mro():
                if module_map.get(type(ndobj), None) == module_name:
                    return True
                 
    processing_neurodata_objects = list(get_processing_neurodata_objects(nwbfile))
    
    module_strs = []
    for module_name in module_names:
        
        acq = check_for_module(nwbfile.acquisition.values(), module_name)
        proc = check_for_module(processing_neurodata_objects, module_name)
        
        # handle the Units table, which is an exception to the rule.
        if module_name == "ecephys" and nwbfile.units is not None:
            proc = True
            
        if not (acq or proc):
            continue
        elif acq and not proc:
            module_strs.append(f"{module_name}-acq")
        elif not acq and proc:
            module_strs.append(f"{module_name}-proc")
        else:
            module_strs.append(f"{module_name}-acq-proc")
            
    return "+".join(module_strs)
        

tests:

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.ophys import mock_TwoPhotonSeries, mock_OnePhotonSeries, mock_DfOverF
from pynwb.testing.mock.behavior import mock_SpatialSeries
from pynwb.ecephys import ElectricalSeries

# ecephys acquisition
nwbfile = mock_NWBFile()
nwbfile.add_acquisition(mock_ElectricalSeries())
assert modalities_string(nwbfile) == 'ecephys-acq'

# ecephys processing
nwbfile2 = mock_NWBFile()
proc_mod = nwbfile2.create_processing_module("ecephys", "ecephys")
#proc_mod.add(mock_ElectricalSeries())
nwbfile2.add_unit(spike_times=[1.0, 2.0, 3.0])
assert modalities_string(nwbfile2) == 'ecephys-proc'

# ecephys acquisition and processing
nwbfile3 = mock_NWBFile()
nwbfile3.add_acquisition(mock_ElectricalSeries())
proc_mod = nwbfile3.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile3) == 'ecephys-acq-proc'

# ecephys acquisition and processing and ophys acq
nwbfile4 = mock_NWBFile()
nwbfile4.add_acquisition(mock_ElectricalSeries())
nwbfile4.add_acquisition(mock_TwoPhotonSeries())
proc_mod = nwbfile4.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile4) == 'ecephys-acq-proc+ophys-acq'

# ophys processing and behavior processing
nwbfile5 = mock_NWBFile()
proc_mod = nwbfile5.create_processing_module("ophys", "ophys")
proc_mod.add(mock_DfOverF())
proc_mod2 = nwbfile5.create_processing_module("behavior", "behavior")
proc_mod2.add(mock_SpatialSeries())
assert modalities_string(nwbfile5) == 'ophys-proc+behavior-proc'

@jwodder
Copy link
Member

jwodder commented Apr 13, 2023

@bendichter Is this code supposed to completely replace the current modalities calculation or just augment it somehow? Specifically, is the goal to append -acq and/or -proc to every modality value in organized NWB files?

@bendichter
Copy link
Contributor

Is this code supposed to completely replace the current modalities calculation or just augment it somehow?

This is meant to replace a section of _assign_dandi_names, establishing a new method for creating the modalities section of filenames.

Specifically, is the goal to append -acq and/or -proc to every modality value in organized NWB files?

Yes, that's the goal.

@jwodder
Copy link
Member

jwodder commented Apr 13, 2023

@bendichter I would start updating the code as follows:

diff --git a/dandi/metadata.py b/dandi/metadata.py
index 2597a32..2b126ef 100644
--- a/dandi/metadata.py
+++ b/dandi/metadata.py
@@ -53,6 +53,7 @@ lgr = get_logger()
 def get_metadata(
     path: str | Path | Readable,
     digest: Optional[Digest] = None,
+    modalities: bool = False,
 ) -> Optional[dict]:
     """
     Get "flatdata" from a .nwb file
@@ -127,7 +128,7 @@ def get_metadata(
         tried_imports = set()
         while True:
             try:
-                meta.update(_get_pynwb_metadata(r))
+                meta.update(_get_pynwb_metadata(r, modalities=modalities))
                 break
             except KeyError as exc:  # ATM there is
                 lgr.debug("Failed to read %s: %s", r, exc)
diff --git a/dandi/organize.py b/dandi/organize.py
index 32dcadc..8894d14 100644
--- a/dandi/organize.py
+++ b/dandi/organize.py
@@ -108,9 +108,6 @@ def create_unique_filenames_from_metadata(
     # Additional fields
     #
 
-    # Add "modalities" composed from the ones we could deduce
-    _populate_modalities(metadata)
-
     # handle cases where session_id was not provided
     # In some of those we could have session_start_time, so we could produce
     # session_id based on those
@@ -401,28 +398,6 @@ def _sanitize_value(value, field):
     return value
 
 
-def _populate_modalities(metadata):
-    ndtypes_to_modalities = get_neurodata_types_to_modalities_map()
-    ndtypes_unassigned = set()
-    for r in metadata:
-        mods = set()
-        nd_types = r.get("nd_types", [])
-        if isinstance(nd_types, str):
-            nd_types = nd_types.split(",")
-        for nd_rec in nd_types:
-            # split away the count
-            ndtype = nd_rec.split()[0]
-            mod = ndtypes_to_modalities.get(ndtype, None)
-            if mod:
-                if mod not in ("base", "device", "file", "misc"):
-                    # skip some trivial/generic ones
-                    mods.add(mod)
-            else:
-                ndtypes_unassigned.add(ndtype)
-        # tuple so we could easier figure out "unique" values below
-        r["modalities"] = tuple(sorted(mods.union(set(r.get("modalities", {})))))
-
-
 def _populate_session_ids_from_time(metadata):
     ses_times = [m.get("session_start_time", None) for m in metadata]
     if not all(ses_times):
@@ -799,7 +774,7 @@ def organize(
 
         def _get_metadata(path):
             try:
-                meta = get_metadata(path)
+                meta = get_metadata(path, modalities=True)
             except Exception as exc:
                 meta = {}
                 failed.append(path)
diff --git a/dandi/pynwb_utils.py b/dandi/pynwb_utils.py
index 5e1afff..d843db5 100644
--- a/dandi/pynwb_utils.py
+++ b/dandi/pynwb_utils.py
@@ -202,7 +202,7 @@ def _scan_neurodata_types(grp: h5py.File) -> List[Tuple[Any, Any]]:
     return out
 
 
-def _get_pynwb_metadata(path: str | Path | Readable) -> dict[str, Any]:
+def _get_pynwb_metadata(path: str | Path | Readable, modalities: bool = False) -> dict[str, Any]:
     out = {}
     with open_readable(path) as fp, h5py.File(fp) as h5, NWBHDF5IO(
         file=h5, load_namespaces=True
@@ -255,6 +255,9 @@ def _get_pynwb_metadata(path: str | Path | Readable) -> dict[str, Any]:
         # get external_file data:
         out["external_file_objects"] = _get_image_series(nwb)
 
+        if modalities:
+            out["modalities"] = modalities_string(nwb)
+
     return out
 
 
@@ -554,3 +557,7 @@ def open_readable(r: str | Path | Readable) -> IO[bytes]:
         return r.open()
     else:
         return open(r, "rb")
+
+
+def modalities_string(nwbfile: pynwb.NWBFile) -> str:
+    ### Your code here ###

@yarikoptic
Copy link
Member

yarikoptic commented Apr 17, 2023

Sorry for the delay in joining this conversation -- was traveling etc. My thinking:

  • dandi organize if first of all just a helper! If it does not arrive to the final desired file naming, there is always a simple mv (or other command/interface) to rename file(s) further to the desired naming without needing to await for some logic to be implemented in dandi organize. Moreover, since dandi archive is smart enough to not require reupload of the same blobs, and there is even dandi move command to rename files locally/in the archive -- I would recommend to proceed with upload of the .nwb files the content of which you do not expect to change as soon as you have a viable naming convention (and possibly improve/rename later). Relevant here:
    • As discussed above -- it is recommended to store "raw" data in separate files or even dandisets.
  • the main body of discussion seems to be about filename annotation of having processed data in the dandiset/files. @bendichter and @satra exercised above ideas on extending our ad-hoc DANDI filenaming/layout with placing more of metadata into "modalities" suffix of the filename. Although looks nice on the initial sight, I do not think it is a good idea:
    • DANDI layout (result of dandi organize) was introduced to largely mimic BIDS layout.
    • DANDI layout is not really "standardized" beyond what is implemented in dandi organize and as an archive we should avoid leading a standardization effort, and should strive the balance between remaining flexible to suite users' use cases while adopting (and influencing) standards which exist or being developed
    • DANDI layout filename is sub-<label>_ses-<label>[_<entity>-<value>]*_<modalities>.<ext> at large mimics BIDS and entities we accept do largely come from BIDS and we work with BIDS and BEPs to introduce new.
    • _<entity>-<value> are there to be added to filename to provide extra "hints" (not really to encode metadata values) on how two or more files differ. dandi organize logic "automates" and "accents" on above - it adds an _<entity>-<value> (unless explicitly listed in --required-field) only if such "entity" differs across files
    • _<modalities> was introduced also to mimic BIDS suffixies but since NWB allow for multiple modalities to be present in the same file, we allowed for composition (thus + between different modalities)
    • BIDS has a number of "entities" which were introduced to "hint" that/how files differ based on processing done
      • during data acquisition (by hardware) , e.g. _rec- (for "reconstruction") and/or _proc- (for "processing"). Also _acq- is there to "note" on differences in acquisition (e.g. parameters). Again, just a "note" and details should be contained inside files, not filenames
      • after having being acquired: _desc- (for "description") as a generic entity to apply to derived data files.
    • I would not expect two files which would have somewhat duplicate/conflicting combinations like _ecephys-acq and _ecephys-acq-proc. Indeed there could be different set of "acq/proc" across different modalities, but I would expect it to be consistent across files within subject/session (e.g. not to have in one _ecephys-acq and another _ecephys-acq-proc).
    • So it feels natural to me for us to extend with support of _proc and _desc- entities to allow users to describe difference between files in case of files containing derived/processed data and for dandi organize make use of either or both of them instead of coming up with an additional ad-hoc way to extend _<modalities> to become a composite "modalities and processing hints" (wherever we use _entity-value for the "hints").

If we agree that we should just add _proc- and/or _desc-, then we can discuss on how organize could/should make use of them (e.g. may be via looking at the unique sets of "modality + processing" parametrizations etc).

@jwodder
Copy link
Member

jwodder commented Apr 17, 2023

@yarikoptic Note that the code for validating that asset paths conform to dandi organize currently requires modalities to consist of just lowercase ASCII letters, so a user trying to upload a file with a - in the modality would currently get an error. (They would still be able to use dandi move to rename assets on the server to use their desired format, as validation is not run on move.)

@satra
Copy link
Member

satra commented Apr 17, 2023

@yarikoptic - i think the challenge here that is different from bids is that the same file may have multiple types of data, each in raw + proc form, so there is still a need of modality specific information, which i don't think can easily be added in _proc or _desc. i agree with the notion that we shouldn't have acq+proc and acq or proc in the same file.

what happens with datasets that only share spike times? are you going to disallow it?

he code for validating that asset paths conform to dandi organize currently requires modalities to consist of just lowercase ASCII letters

@jwodder - if that's the case dandi cli has changed. as we currently should have support for + in the modalities string. there are many dandisets with multiple modalities in nwb files that create that suffix.

@jwodder
Copy link
Member

jwodder commented Apr 17, 2023

@satra The + is a modality separator, which the validation supports. The restriction to lowercase letters applies to the individual modalities.

@yarikoptic
Copy link
Member

@yarikoptic - i think the challenge here that is different from bids is that the same file may have multiple types of data, each in raw + proc form, so there is still a need of modality specific information, which i don't think can easily be added in _proc or _desc.

why? _desc-raw+proc, if not as descriptive in some cases, is just fine/valid. If you want to become more specific, can even be _desc-allphys-rawproc+beh-deeplabcut or alike, or in other words -- as descriptive as you like if you really want to annotate per modality, or just _desc-preproc+curated1 as a description of preprocessed and curated data for all modalities present etc.

The point IMHO is that if we really want to add per-modality entities, we need to come up with some generic approach generalizing to other entities etc, not just limit it only for proc and raw "hints". But as it would render filenames likely unusable, IMHO _desc- is just fine.

i agree with the notion that we shouldn't have acq+proc and acq or proc in the same file.

what happens with datasets that only share spike times? are you going to disallow it?

nope -- are those spikes for ecephys? then _desc-spikes_ecephys or alike.

@yarikoptic
Copy link
Member

@yarikoptic Note that the code for validating that asset paths conform to dandi organize currently requires modalities to consist of just lowercase ASCII letters, so a user trying to upload a file with a - in the modality would currently get an error.

Thanks! happy to hear that we made it so specific -- it is easier to relax than to make more restrictive.

@bendichter
Copy link
Contributor

I agree with the notion that we shouldn't have acq+proc and acq or proc in the same file.

@satra just so I understand, are you talking here about the file naming convention, or are you saying that you think the NWB file itself should not mix acquisition and processing data between modalities?

@Yarik, Let me see if I understand some of your points:

  1. Putting DANDI-specific naming conventions inside of _desc- allows us to better comply with existing BIDS naming convention rules.

I don't see a problem with that on my end.

  1. These _desc- fields do not need to follow any strict automatically applied rules and could be manually specified by the user.

I don't see a problem with that from my end either. In fact, this alone could solve @colleenjg's problem, as she would be able to add _desc- to modify the filenames of all NWB files and include the "acquired"/"processed" metadata she desires. It would require a little custom script that parses the NWB files and adds _desc- the way she wants. This approach is flexible enough to handle lots of other types of problems we may run up against and could provide a good compromise between standardization (the existing modalities string) and customization (in desc) while providing better compliance with BIDS. One downside of this approach is that this would be user-specific and would therefore not be generally applied to all DANDI users. It would also require an additional step. Perhaps a compromise would be to be able to optionally supply a user-specified function for the desc string. Something like:

dandi -desc extract_desc.py organize .
#  extract_desc.py
function extract_desc(nwbfile):
    if ...:
        return "acq"
    elif ...:
        return "proc"
    else:
        return ""

but I'm not sure if this would really be better than simply creating a custom script that iterates over and renames the files. Something like:

from glob import glob
from pynwb import NWBHDF5IO
import os

desc_dict = dict()

for nwb_filepath in glob("data/*/**.nwb", recursive=True):
    with NWBHDF5IO(nwb_filepath, mode="r", load_namespaces=True) as io:
        nwbfile = io.read()
        desc_dict[nwb_filepath] = extract_desc(nwbfile)

for src, desc in desc_dict.items():
    os.rename(src, src[:-4] + desc + ".nwb")

Is this along the lines of what you were thinking?

@colleenjg
Copy link
Author

colleenjg commented Apr 17, 2023

@satra and @yarikoptic Given the time crunch I'm under, I think I'll go ahead and tag the files with the raw ophys data with something like +raw. This would fit well with my download code for the dataset that looks to identify files based on whether certain strings occur or do not occur in their names.

Do you have a preference for what string I might use? I aim to do this tomorrow (Tuesday) at the latest.

@yarikoptic
Copy link
Member

quick one for @colleenjg : if you just use _obj-raw for those (in place of what actual object id organize uses there) now, it would even make it legit for DANDI validation etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants