Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hdf5 tile data after animation stops #1371

Merged
merged 9 commits into from
May 20, 2024
Merged

Conversation

pford
Copy link
Collaborator

@pford pford commented Apr 17, 2024

Description

  • What is implemented or fixed? Mention the linked issue(s), if any.
    Fixes odd small pixel tiles seen in hdf5 image after animation stops #1368

  • How does this PR solve the issue? Give a brief summary.
    By default, image quality is decreased during animation for better speed. For each channel, tile data for HDF5 images is requested from the tile cache, encoded so that NaN values may be restored, and then compressed. However, the encoding step changes the cached tile data (passed with a shared_ptr) by replacing a NaN values in a block with the average of finite values in that block. When the cached tile data is requested again after animation stops, for image data with a higher compression quality, the image data is distorted since some values are no longer NaN. To fix this, the tile data is assigned to a new vector for encoding and compression, leaving the cached data unchanged.

  • Are there any companion PRs (frontend, protobuf)?
    No

  • Is there anything else that testers should know (e.g. exactly how to reproduce the issue)?
    Zoom in an HDF5 image to an area with NaN values, then start and stop animation. There should be no defect in the image after the animation is stopped.

Checklist

  • changelog updated / no changelog update needed
  • e2e test passing / corresponding fix added / new e2e test created
  • ICD test passing / corresponding fix added / new ICD test created
  • protobuf updated to the latest dev commit / no protobuf update needed
  • protobuf version bumped / protobuf version not bumped
  • added reviewers and assignee
  • added ZenHub estimate, milestone, and release

@pford
Copy link
Collaborator Author

pford commented Apr 17, 2024

Not sure if this should be code reviewed or tested first since it involves copying data, which we try to avoid.

@confluence
Copy link
Collaborator

In this case I think copying is unavoidable -- the compression function should not be modifying data in the tile cache!

I do remember adding a tile pool to the tile cache to avoid constantly creating and destroying vector objects, since this was causing quite a significant slowdown. I wonder if a tile pool would be appropriate here as well (I think that we should test the performance impact of this change first).

If yes, then we could either create a new tile pool to be used by this function, or add a GetCopy function to the tile cache which performs the copy internally and uses the tile cache's own tile pool (we may then need to adjust the tile cache pool size when we start and stop the animation, but maybe the tile requirements aren't going to exceed the default size, which IIRC is enough to fit one row and one column of 2x2 tile chunks).

@confluence
Copy link
Collaborator

Actually, now I see that the current change is in FillRasterTileData, which means that the copy is being performed regardless of how the data was obtained in GetRasterTileData (which means that it will also affect cases where the data does not need to be copied). It's GetRasterTileData which should be modified -- there's already a new vector which is created there (for the cases which don't use the tile cache), which could be used to copy the cached data.

Perhaps adding a new tile pool in GetRasterTileData, and using it in all cases instead of creating a new tile vector, would be a good idea.

@pford
Copy link
Collaborator Author

pford commented May 7, 2024

@confluence I added TileCache::GetCopy. If I added a TilePool to Frame, wouldn't it have the same problem as the TileCache pool?

@confluence
Copy link
Collaborator

@confluence I added TileCache::GetCopy.

Currently GetCopy is creating a new vector. I was originally suggesting using the tile cache's pool to recycle a vector object to use for the copy (i.e. using _pool->Pull() to get a vector pointer).

If I added a TilePool to Frame, wouldn't it have the same problem as the TileCache pool?

I'm not sure what problem you mean here.

I think I've confused the issue by conflating multiple suggestions -- for clarity, I'm talking about two separate things:

  1. In GetRasterTileData we already create a new vector to read data into, in the cases that don't use the tile cache. I would suggest that we use this vector as well in the tile cache case, for consistency, instead of creating a vector in a different place. Perhaps the GetCopy function should be called CopyInto and take that vector as an input/output parameter -- but maybe this function isn't necessary at all, and the copy can just be inlined in GetRasterTileData. (I originally suggested the tile cache function to enable the use of the tile cache's pool, before I realised that the tile cache case could be integrated more generically with the other cases.)

  2. Additionally, as a performance improvement, I think it would be a good idea to use a tile pool for that new vector in GetRasterTileData, instead of repeatedly creating and destroying it. My updated suggestion for this is to add a separate tile pool object to the frame, to be used here in all the cases, not just the tile cache case. I haven't checked if there are other places in the frame where a new tile vector is created (which could make use of the same pool).

@confluence
Copy link
Collaborator

confluence commented May 10, 2024

@pford I was suggesting to use only a tile pool in the frame for all the tiles, not the tile cache -- we only added the full-resolution tile cache to HDF5 files because in the HDF5 schema the main dataset is chunked, which makes it efficient to read a chunk (which is 2x2 tiles) at a time. For FITS files and other formats, reading a chunk at a time is not efficient, which is why we still use the full channel cache for those formats. I don't think it makes sense to copy tiles read from the full channel cache into the tile cache (this is currently bypassing the tile cache's tile pool, and I believe that downsampled tiles are also being saved -- this cache is only written to handle full-resolution tiles).

I am suggesting only that the frame gets its own tile pool object (separate to the tile cache's tile pool), and uses that pool to recycle the tile pointers (instead of creating new pointers with newly allocated data). This wouldn't be caching any tile data; just keeping a buffer of reusable objects allocated in memory. They would be overwritten with new data every time they are used (so it would be fine to use them for all tiles in GetRasterTileData, whether they're downsampled or not).

So the frame would have a std::shared_ptr<TilePool> _pool, with some reasonable capacity (how many tiles are likely to be allocated at the same time during an animation?), and in GetRasterTileData instead of creating a new tile_data vector and later a new pointer with make_shared, we'd get a pointer from the pool with _pool->Pull(), and write to that pointer's data. The pool class handles the recycling internally (with a custom deleter).

So it would be just this small change, plus the copying in the tile cache case.

@pford
Copy link
Collaborator Author

pford commented May 13, 2024

I added a TilePool to Frame with the capacity set to the number of omp threads, since Session fills the tiles in parallel using omp. The TilePool is created the first time tiles are requested, since tiles are not used when Frame holds the downsampled image for PV preview.

Copy link
Collaborator

@confluence confluence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the place the pool is created, I think this is good now. Using the number of threads to calculate the pool capacity makes sense. Perhaps it should be the number of threads plus one or two, to avoid race conditions in which a new tile is needed before any tiles have been returned to the pool?

src/Frame/Frame.cc Outdated Show resolved Hide resolved
Copy link

Code Coverage

Package Line Rate Health
src.Cache 72%
src.DataStream 44%
src.FileList 67%
src.Frame 36%
src.HttpServer 42%
src.ImageData 28%
src.ImageFitter 83%
src.ImageGenerators 43%
src.ImageStats 75%
src.Logger 37%
src.Main 52%
src.Region 69%
src.Session 4%
src.Table 52%
src.ThreadingManager 67%
src.Timer 85%
src.Util 40%
Summary 46% (8614 / 18797)

@confluence confluence added the awaiting testing For pull requests that require testing label May 14, 2024
@confluence
Copy link
Collaborator

@kswang1029 in addition to testing the bug fix, please check the performance impact for animation (of both HDF5 files and other file types). I expect this to slightly decrease performance for HDF5 files but also slightly increase performance across the board, and I'd like to find out what the overall impact is (and to make sure that something unexpected isn't happening).

Copy link
Contributor

@kswang1029 kswang1029 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pford this looks good. @confluence As for performance test, please refer to the following table. I test the performance with a custom e2e test which has a 30s animation playback and then I count the stopped channel index and hence an average fps. I do not see obvious performance improvement/degradation.

Screenshot 2024-05-20 at 14 26 08

@kswang1029 kswang1029 added the awaiting code review For pull requests that require code review label May 20, 2024
@confluence
Copy link
Collaborator

@kswang1029 @pford in that case I'm happy to merge this.

@kswang1029
Copy link
Contributor

kswang1029 commented May 20, 2024

@kswang1029 @pford in that case I'm happy to merge this.

the effect (if any) may be more prominent using "large" cube but I do not have the resources to test with, unfortunately.

@confluence confluence merged commit ce8f6c0 into dev May 20, 2024
14 checks passed
@confluence confluence deleted the pam/1368_hdf5_animation branch May 20, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting code review For pull requests that require code review awaiting testing For pull requests that require testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

odd small pixel tiles seen in hdf5 image after animation stops
3 participants