Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to cache the whole cube image #1276

Closed
wants to merge 159 commits into from

Conversation

markccchiang
Copy link
Contributor

@markccchiang markccchiang commented Jun 28, 2023

Description

  • What is implemented or fixed? Mention the linked issue(s), if any.
    Cache all cube image data #1286
    This is the trial to cache the whole cube image if its memory space is under a specific quota. Users can set the memory quota via the command line --full_image_cache_size_available when starting the backend. For example, ./carta_backend --full_image_cache_size_available 1000 means at most 1000 MB of memory space for the backend can be used. If the memory of a cube image we open is under 1000 MB, it will cache all the cube image data. The upper limit of full image cache is defined as 90% of users' computer ram capacity. If users set the full image cache that exceeds the upper limit, the backend will reset it to the upper limit. By default, the full image cache is 0 MB. This function will not apply to HDF5 images (only available for CASA or FITS files). The cube image cache will be used in the following calculations:
  • image rendering, including animation
  • cube histogram calculations
  • contours
  • vector field calculations
  • spatial profiles
  • spectral profiles for cursor/point region
  • spectral profiles for the other region types, including their statistical calculations
  • PV image generator
  • above functions for computed stokes data
    🚧 image moments (This item will do in a separate PR after this PR is merged)
  • How does this PR solve the issue? Give a brief summary.
    Create an ImageCache struct under the Frame class and use it to deal with the channel or cube image data cache, which is obtained from the casacore.

  • Are there any companion PRs (frontend, protobuf)?
    No.

  • Is there anything else that testers should know (e.g. exactly how to reproduce the issue)?
    The performances of cached cube images are expected to be better than without the cache (i.e., the current dev branch). Especially when accessing spectral profiles or data in the z-direction.

Checklist

  • changelog updated / no changelog update needed
  • e2e test passing / added corresponding fix
  • protobuf updated to the latest dev commit / no protobuf update needed
  • added reviewers and assignee
  • added ZenHub estimate, milestone, and release

@ajm-ska
Copy link
Collaborator

ajm-ska commented Jul 1, 2023

@markccchiang Just to mention, I am aware the macOS12 unit tests are failing here. They consistently failed 4 times in a row when I made Github Actions rerun them. When I let Github Actions re-run the previous commit, it passed the first time.

The error is: Error: Timeout of 120000ms hit.

I have tried building and running it manually with the Address Sanitizer flags as that is what Github Actions has always been doing. The runtime is quite variable: 152.68s, 65.16s, 144.22s.
To make things more confusing, I tried the previous commit (b7fb274) and the Unit Tests run times appear to be more consistent but all longer than 2 minutes: 152.19s, 151.23s, 150.13s. So it seems that it was very lucky that the Github Actions somehow passed twice on the previous commit, (and maybe lucky it passed on earlier commits too)!

I figured a better test would be to compare it with the 'dev' branch. I get: 90.50s, 91.35s, 91.51s.

Therefore, something in this branch has slowed down the Units Tests runtime and they can no longer consistently finish within 2 minutes. It seems to only be an issue on macOS12, which is an original M1 Mac.

I have not gone so far as to identify which commit in your branch has caused the Unit Tests to slow down because first I guess we need to know if this is considered an actual issue?
It may show up later in the performance testing.
The one 65.16s runtime looks great, but unfortunately, it is not consistent.

If it is not considered an issue, you can get around the Github Actions failure by increasing the timeout in .github/workflows/continuous_integration.yml line 130: timeout_minutes: 2 to perhaps 3.

@markccchiang
Copy link
Contributor Author

@markccchiang Just to mention, I am aware the macOS12 unit tests are failing here. They consistently failed 4 times in a row when I made Github Actions rerun them. When I let Github Actions re-run the previous commit, it passed the first time.

The error is: Error: Timeout of 120000ms hit.

@ajm-asiaa Thank you for the reminding and investigating! You are right. This is due to the sample image files I generated are too large. Because I was doing performance tests for the functionalities of cached cube images in unit tests. I will reduce the size of sample image files. So this problem will be solved without changing the timeout setting.

@kswang1029
Copy link
Contributor

kswang1029 commented Jul 18, 2023

@markccchiang here are a few findings I observed so far:

  1. if we have two images loaded and matched, when we enable multiple-profile plot mode in the spectral profiler to see region spectra, the backend will crash.
  2. If we load three images (20, 20, and 90 MB), from the backend log we see (I set --reserved_memory 4)

[2023-07-18 04:28:28.210] [CARTA] [info] 4 GB of reserved memory are available.
[2023-07-18 04:28:28.213] [CARTA] [info] Cache the whole cube image data.
[2023-07-18 04:28:28.287] [CARTA] [info] 3 GB of reserved memory are available.
[2023-07-18 04:28:28.673] [CARTA] [info] Cache the whole cube image data.
[2023-07-18 04:28:28.681] [CARTA] [info] 2 GB of reserved memory are available.
[2023-07-18 04:28:28.935] [CARTA] [info] Cache the whole cube image data.
[2023-07-18 04:28:28.943] [CARTA] [info] 1 GB of reserved memory are available.

This seems not right. The actually RAM usage from the macOS activity monitor shows the right amount.
3. In the backend log, maybe we should show more digits to display the amount of available memory (eg 2.34 GB)
4. When we launch carta_backend with --reserved_memory flag, we should also add [info] log too.

Will have more tests soon.

@kswang1029
Copy link
Contributor

kswang1029 commented Jul 18, 2023

@markccchiang Overall, this does improve UX significantly in many I/O intensive features. @veggiesaurus please have a test as well if possible. We also need a strategy on RAM management for the controller deployment. Currently it is the backend flag --reserved_memory doing the trick. Not sure about the controller side. 🤔

@veggiesaurus
Copy link
Collaborator

We might also need the ability to control the total amount of memory that cached cubes can take up. For example, if I load 10 images each with 10 GB, would that allow it, and use up 100 GB of memory?

@kswang1029
Copy link
Contributor

kswang1029 commented Jul 18, 2023

We might also need the ability to control the total amount of memory that cached cubes can take up. For example, if I load 10 images each with 10 GB, would that allow it, and use up 100 GB of memory?

@markccchiang could you check if the current setup is for the total amount of caching memory, not for individual image? I also found that we need to improve the [info] logs when loading images that can or cannot be cached. The use case could be I reserved 4 GB as image cache. Then I load the following images in order

  1. a.fits 3 GB
  2. b.fits 2 GB
  3. c.fits 500 MB
    In this case, a.fits will be cached. b.fits will not. c.fits will be cached. In any case, we will need to have [info] log to be more informative on whether or not the image being loaded will be cached or not.

Copy link
Collaborator

@confluence confluence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review, which includes some general remarks about the current structure of the code, and ways in which it could be refactored to divide responsibility between these classes more consistently.

void UpdateValidity(int stokes) override;

private:
bool FillCubeImageCache(std::unique_ptr<float[]>& stokes_data, int stokes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed to FillStokesCache.

@@ -0,0 +1,772 @@
/* This file is part of the CARTA Image Viewer: https://github.com/CARTAvis/carta-backend
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file should be renamed (probably to TestImageCaches), and the class and test names should be updated to match the new names in the code.

Comment on lines 102 to 112
casacore::IPosition OriginalImageShape() const; // Image shape from the original file
size_t Width() const; // length of x axis
size_t Height() const; // length of y axis
size_t Depth() const; // length of z axis
size_t NumStokes() const; // if no stokes axis, number of stokes = 1
int XAxis() const;
int YAxis() const;
int ZAxis() const;
int StokesAxis() const;
int CurrentZ() const;
int CurrentStokes() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to add all of these getters for consistency, even if though some of these values aren't currently accessed from outside the frame class, but it isn't necessary to use them inside the frame class. It's fine to use the bare properties (e.g. _width rather than Width()). This will also greatly simplify the diff of the frame class.

Comment on lines 212 to 248
casacore::IPosition start(_image_shape.size());
casacore::IPosition start(OriginalImageShape().size());
start = 0;
casacore::IPosition end(_image_shape);
casacore::IPosition end(OriginalImageShape());
end -= 1; // last position, not length

// Slice x axis
if (_x_axis >= 0) {
if (XAxis() >= 0) {
int start_x(x_range.from), end_x(x_range.to);

// Normalize x constants
if (start_x == ALL_X) {
start_x = 0;
}
if (end_x == ALL_X) {
end_x = _width - 1;
end_x = Width() - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of where the properties should still be used (rather than the getters). All of these little changes should be reverted.

There are a few places in the original code where getters are used instead of properties, and for consistency I think that we should update them, but I would prefer to do that in a separate PR (to keep unrelated changes out of this PR).

Comment on lines 396 to 398
bool Frame::UpdateChannelCache() {
return _image_cache->UpdateChannelCache(CurrentZ(), CurrentStokes());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now a one-liner, and only used in a few places in Frame, I would prefer to remove this wrapper function and call the function on the cache directly.

Comment on lines 47 to 51
float ChannelCache::GetValue(int x, int y, int z, int stokes) {
bool write_lock(false);
queuing_rw_mutex_scoped cache_lock(&_cache_mutex, write_lock);
return _channel_data[(_width * y) + x];
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used in two different ways: from Frame, to access a single value (where this lock is required), and inside a loop in other cache functions (where this lock should not be used; the calling function should acquire the lock once).

But this is part of a broader issue with the way that the cache functions are currently used from frame, which I will describe in more detail below.

}

bool StokesCache::ChannelDataAvailable(int z, int stokes) const {
return (z == ALL_Z) && (stokes == _frame->CurrentStokes()) && _stokes_image_cache_valid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely the (z == ALL_Z) check here is redundant? If the stokes matches, data is available for any z value.

Comment on lines 1055 to 1056
if (ImageCacheAvailable()) {
cursor_value_with_current_stokes = _image_cache->GetValue(x, y, CurrentZ(), CurrentStokes());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several places in Frame where the cache is used like this. It doesn't make sense to check first if the cache is valid and then acquire the cache lock and get the data. The check should be performed after the lock is acquired, in the same function as the data retrieval.

The correct thing to do is probably to refactor all instances like this so that the public function on the cache returns a boolean and sets an input/output parameter instead, checks internally whether the data is available, and returns false if it isn't. Some cache functions already have this kind of interface, but it's not consistent.

In this specific case, there should be two separate cache functions for getting a value: a public one, which acquires the lock and checks if the data is available, and a protected internal function which just fetches the data.

It would also be helpful to unify the behaviour of functions of data which fetch data only if it exists in the cache, and functions which update the cache if necessary and then fetch data. Currently the frame explicitly updates the channel cache in several places. This should be the internal responsibility of the cache -- the frame should only invalidate the cache when the channel or stokes changes. Functions for fetching data from the cache should have an update flag which determines whether the cache should be updated on demand when that function is called. The frame would then set that flag appropriately.

} else {
// Required stokes is not the current stokes or the stokes needs to be computed
if (ImageCacheAvailable(CurrentZ(), stokes)) {
get_spatial_profile_from_cache(stokes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImageCacheAvailable is called a second time here inside get_spatial_profile_from_cache. But this is an example of logic which should be refactored.

Copy link

github-actions bot commented May 8, 2024

Code Coverage

Package Line Rate Health
src.Cache 50%
src.DataStream 44%
src.FileList 67%
src.Frame 44%
src.HttpServer 42%
src.ImageData 32%
src.ImageFitter 83%
src.ImageGenerators 43%
src.ImageStats 81%
src.Logger 37%
src.Main 53%
src.Region 71%
src.Session 4%
src.Table 52%
src.ThreadingManager 67%
src.Timer 85%
src.Util 41%
Summary 48% (9091 / 19126)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting code review For pull requests that require code review awaiting testing For pull requests that require testing enhancement New feature or request
Projects
No open projects
Status: PR test and review
Development

Successfully merging this pull request may close these issues.

None yet

6 participants