Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

ot.emd with large datasets #397

Open
johnalison opened this issue Sep 7, 2022 · 6 comments
Open

ot.emd with large datasets #397

johnalison opened this issue Sep 7, 2022 · 6 comments

Comments

@johnalison
Copy link

johnalison commented Sep 7, 2022

馃殌 Feature

I was wondering if it would be possible to do the OT matrix computation (ot.emd) using sparse matrices,
(eg: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_array.html#scipy.sparse.coo_array)
where only the distances between pairs of entries below some cutoff would be given, and only the coupling for pairs entries above some cutoff would be returned.

Motivation

I'd like to compute the OT matrix with between large (O(500k)) datasets.
Currently, the biggest limitation to dataset size is the memory requirement for storing the len(a) x len(b) matrices.
In practice, nearly all of the input distances are large, and nearly all of the corresponding couplings are small.
So I thought that the sparse matrices could be a good solution.

Alternatives

We are open to any suggestions for alternatives.

We currently approximate the full OT matrix by computing many OT matrices using O(10k) subsets of the source and target datasets. The effect of the full OT between the datasets is then approximated by "stitching together" the smaller coupling matrices. This reduces the effective statistical power of our data, which we think we can avoid with the sparse matrix implementation.

Additional context

The large datasets come from LHC data:
https://arxiv.org/abs/2208.02807
Thank you for your OT implementation ! We have found it very useful in our work.

@EduardoGoulart1
Copy link

I think it shouldn't be too difficult. One would have to adapt the EMD_wrapper to accept sparse arrays. I could give it a try

@rflamary
Copy link
Collaborator

Sorry for te late answer. This would be very interesting but I don't think teh current C++ implementation will allow for this. In order to solve very large problems we will need to compute teh cost matrix in a lazy way (recompute the cost only when needed in the solver) which is a classical strategy that saves memory but will also mean slower solvers. If we do that we should be able to work with and return sparse matrices. @EduardoGoulart1 we are obviously very interested by such a conribution.

@EduardoGoulart1
Copy link

EduardoGoulart1 commented Feb 21, 2024

I would go with a more simple solution first. We just need to adapt the code to handle sparse cost/flow matrices. The network_simplex_simple.h would require nearly no changes. The main problem seems to be the EMD_wrapper.cpp where a full bipartite graph is explicitly created

@rflamary
Copy link
Collaborator

For a sparse cost, you suppose that the cost is infinite if not provided? It was never clear to me what it meant for Lemon solver not to set the values...

@EduardoGoulart1
Copy link

EduardoGoulart1 commented Feb 22, 2024

@rflamary I guess he wants the library to accept graphs that are not full-bipartite, i.e. not all (src, tgt) pairs are available. The network simplex code already accepts that. The only modification would be the wrapper. Unfortunately, I'm out for the next 10 days, so I can't prototype that (I would also only do it after doing some cleanups in the C++ code)

Edit: Essentially, that means that sparse costs are infinite. The only complication would if the input data is such that no feasible solution can be generated (but then I would argue that this is the user's problem)

@rflamary
Copy link
Collaborator

@EduardoGoulart1 that is very clear and definitely something of interest to us. Take your time and feel free to connect to the slack for direct discussions when your are back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants