Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling in missing values #149

Open
sampotter opened this issue Nov 1, 2021 · 6 comments
Open

Filling in missing values #149

sampotter opened this issue Nov 1, 2021 · 6 comments

Comments

@sampotter
Copy link

sampotter commented Nov 1, 2021

I noticed that (at least according to the FAQ in v0.5.5) there is no strategy for filling in missing values. I can imagine there are a number of use cases where such missing values are prevalent due to the existence of geometry in a computational domain.

Having used zfp a bit in this context, and having some understanding of how zfp works (at least according to the original paper), a simple strategy like filling in with a constant value would lead to an unnecessarily large compressed size for each 4^d zfp block with missing values.

Has there been any work on this? Is there any interest in seeing this issue addressed?

I have a very simple algorithm which should work fairly well here, and would like to try integrating it and seeing how well it performs in different contexts.

@lindstro
Copy link
Member

lindstro commented Nov 1, 2021

There has been some work on this, but we still do not have what I would call a satisfactory solution. One reason for this is that the current definition of the zfp code stream does not support marking which values are missing, so that information would (currently) have to be managed separately.

We did do some work with one customer where we simply replaced the missing values with the mean value within a block. That worked reasonably well in their application. Ideally, you'd pad with values that promote compression. This is something that is done to compress partial blocks that extend beyond the domain (e.g., for arrays whose dimensions are not multiples of four), where we pad with neighboring values. A similar strategy could be employed for more general configurations of missing values, but once again the bit mask would have to be maintained separately.

A somewhat more sophisticated approach is to perform a linear solve so that fill values result in zero-valued coefficients following the linear decorrelating transform. This gets a little tricky as the unknowns could be in a configuration that result in a singular system (see this paper for one approach to dealing with this). This type of padding could be done outside of zfp by the user, or we could provide a utility function that does it. One challenge with this approach is that it would likely have to be done before the maximum exponent within a block is computed, as this type of "smooth" padding could result in new values that are larger than the true values in a block and cause arithmetic overflow. That's why we choose among available values to handle partial blocks.

Ultimately, we do want to fully support missing values (and NaNs, infinities, and similar), but it will require that zfp also compresses the bit mask and embeds it in the stream. This is particularly tricky when dealing with fixed-rate mode, where at low enough rates you might not be able to represent the mask itself without loss.

There are several issues like this that really complicate how to handle missing values correctly. I would welcome any suggestions you might have.

@sampotter
Copy link
Author

sampotter commented Nov 1, 2021

Thanks for the quick and thorough response. I appreciate the level of detail provided regarding zfp's internals.

I am certainly in "zfp user space", and I'm not sure whether the strategy I'm proposing would be too slow to integrate into zfp as an automatic inpainting method. Either way, this is roughly what I'm thinking...

For a 4 x 4 x 4 zfp block, assume you have N missing values, with 0 < N < 64. Let Phi be the matrix whose columns are the zfp transform's basis functions. If we pull out the 64 - N rows of Phi corresponding to the 64 - N zfp block entries with known values, we get a 64 - N x 64 matrix whose columns give an overcomplete dictionary (i.e., each column is an "atom" in that dictionary). We want to find a vector of atom weights which will reconstruct the known values exactly but also lead to a minimal number of bytes after zfp's embedded coding happens. The first is not too hard---we can just solve a least squares problem. But there's no guarantee the compressed size will be any good. An alternative is to use algorithms from compressed sensing, such as orthogonal matching pursuit (OMP) (Tropp's 2004 "Greed is good" paper gives a nice and simple overview). Basic idea is to initialize a residual to the known values, repeatedly find the atom with the largest absolute dot product, and project the residual into null space of known atoms. In the case of the particular dictionary we have, OMP will terminate in a finite number of steps and recover the known values exactly.

I did a few quick tests and found that this approach reliably picked out a sparse coefficient vector with 64 - N nonzero values. This makes sense---IIRC, each subset of 64 - N of these atoms should be linearly independent. For the sorts of fields I'm compressing, I found a few instances where even sparser coefficient vector was discovered (e.g., consider a constant block with some missing values), but for real-world signals there's no reason to expect this to happen very often.

In my tests, I also found that this OMP algorithm has a strong tendency to pick out coefficients roughly in increasing order of sequency and decreasing order of magnitude. So it seems like a good choice for inpainting in a manner which works well with the embedded coding that follows. I compared this with using biharmonic inpainting and found that they resulted in comparable compressed sizes.

Anyway, you can make this OMP thing run pretty fast---should take a similar amount of time to running Gram-Schmidt on a 64 - N x 64 - N matrix (or something like this). Not sure if this will be fast enough, but the fact that OMP can run completely independently on each block seems to jive with zfp's ethos.

Let me know if this is interesting to you.

Edit: What you mentioned about not exceeding the range of the known values is interesting. I'm not sure whether what I'm proposing is likely to screw this up or not. I'll give this a little thought.

@lindstro
Copy link
Member

lindstro commented Nov 1, 2021

This sounds intriguing and is certainly something that could be explored and adopted outside of zfp as a preprocess. My sense is that such heavy machinery would be an overkill for applications that demand very high-speed (de)compression, e.g., those that use zfp's compressed-array classes for in-memory storage. But it may be a good idea to support user algorithms for inpainting and an interface for separately (de)compressing a bitmap that indicates which values are missing.

I'd be curious to see how well OMP, biharmonic, and our own wavelet-based and spectral interpolation methods (adapted to use zfp's basis) do relative to simpler approaches, and whether the expense in performance and implementation complexity warrant such approaches. If you pursue this further, would you be willing to share your results?

Regarding overflow, that is typically not an issue when the inpainting is done independent of and before compression. We've in fact done exactly that as an alternative to padding-by-selection for the case of partial blocks, where we simply use low-degree polynomial extrapolation. Looking at the current implementation, it appears that we do in fact perform padding before compression, so overflow cannot occur after we perform the block-floating-point transform, which was my initial concern. However, there's still the risk of overflowing the scalar type, i.e., exceeding FLT_MAX or INT_MAX, when padding before the block-floating-point transform. Such overflow should be rare when working with floating-point data, however, as the values to be compressed ought not be anywhere near {FLT,DBL}_MAX.

@sampotter
Copy link
Author

OK, I had the same hunch (too heavyweight to do inline). Maybe exploring these options will suggest a good way to do superfast inpainting which is compatible with that set of users' requirements, but I won't worry about that for now.

I'd be happy to compare different approaches and see how they fare. I'll need to throw together some examples to test on first, since the fields I'm testing on unfortunately aren't publicly available. If you have any examples kicking around that you think would be worth testing on, please feel free to point me to them. I tried searching for some of the examples used in the zfp papers in the repository, but couldn't find them. Although, IIRC, most of them didn't have geometry.

If this goes anywhere, I'd be happy to share my results and potentially write them up if you think it's worth it. (I just started a postdoc where I have a lot of flexibility, so I'm on the look out for new and interesting projects.)

Re: your comment about overflow: What you said makes sense. Having inpainting overflow the range of a float or double seems exceedingly unlikely. Is there any reason to do inpainting after normalizing the values in a block? Also, how does zfp currently deal with padding? The low-degree polynomial extrapolation you mentioned or something else?

@lindstro
Copy link
Member

lindstro commented Nov 2, 2021

Great. For examples, I would check out SDRBench, which has the Miranda data used in the 2014 zfp paper and some version of S3D data. You might also want to consider the Isabel data set on that site. The original version has missing values marked by 1e35; those were replaced with zeros in the SDRBench version. Such missing values are common in weather and climate data.

Overflowing floats is unlikely, but the compressor still has to handle this rare case. Once we support half precision, overflow would become more common.

The padding scheme for partial 1D blocks is as follows:

x??? -> xxxx
xy?? -> xyyx
xyz? -> xyzx

Here '?' denotes a missing value and x, y, and z are true values. In multiple dimensions, 1D padding is done along x first, then y, etc. See implementation here. This type of padding could be done either before or after the block-floating-point transform; the result would be the same. It's easier to do it before so that special cases don't have to be handled in the compression pipeline.

@sampotter
Copy link
Author

Thanks for data sets. Those will be useful.

I'll take a stab at this as I have time and get back to you here once I've got some results. If you have any more thoughts, please don't hesitate to reach out to me here (I should get notified), or by email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants