Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unload content of external tilesets to save memory usage #876

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

azrogers
Copy link
Contributor

@azrogers azrogers commented May 6, 2024

As described in #739, external tilesets are currently never unloaded by cesium-native. This results in memory accumulating as tiles are loaded, eventually causing the device to run out of memory. This change allows tiles loaded from external sources to be unloaded when they're no longer used, as long as all of their child tiles are also ready to unload.

@javagl
Copy link
Contributor

javagl commented May 11, 2024

Maybe it helps for debugging:

The tileset is here, for convenience:

externalsWithTransform 2024-05-11.zip

It looks like this:

screenshot

It consists of three external tilesets, each containing a unit cube.

  • The red one is at (3,0,0)
  • The green one is at (0,4,0)
  • The blue one is at (0,0,5)

The core of the CesiumCppMain in the linked state is

  runViewUpdate(tileset, {0.0, 0.0, 5.0}, {0.0, 0.0, -1.0});
  runViewUpdate(tileset, {3.0, 0.0, 0.0}, {0.0, 0.0, -1.0});

This means that it will...

  • Load that tileset
  • Look along -z
  • Move to (0,0,5) (which should see the blue cube)
  • Move to (3,0,0) (which should see the red cube)

The output will be

Run view update
  position : dvec3(0.000000, 0.000000, 5.000000)
  direction: dvec3(0.000000, 0.000000, -1.000000)
[2024-05-11 14:59:55.076] [info] [Utils.cpp:104] ViewUpdateResult:
          tilesToRenderThisFrame : 1
                                     ID unitCubeB.glb error 2 height 4
                    tilesVisited : 4
                     tilesCulled : 2
                 maxDepthVisited : 3

Run view update
  position : dvec3(3.000000, 0.000000, 0.000000)
  direction: dvec3(0.000000, 0.000000, -1.000000)
[2024-05-11 14:59:55.083] [info] [Utils.cpp:104] ViewUpdateResult:
          tilesToRenderThisFrame : 1
                                     ID unitCubeR.glb error 2 height 4
                    tilesVisited : 4
                     tilesCulled : 2
                 maxDepthVisited : 3

Showing that the tilesToRenderThisFrame indeed are the unitCubeB.glb and unitCubeR.glb, respectively.

Now, one can probably more easily call some clearChildren there, and set a breakpoint to figure out why something isn't working...

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look and have a few comments below. I noticed that in this branch, tile content never seems to be getting unloaded at all. The immediate cause is that in Tileset::_unloadCachedTiles, the head of the _loadedTiles linked list is nullptr. The tail isn't nullptr, however, which is an invalid and nonsensical state. So the linked list is getting corrupted, possibly by the attempt to clear the _externalTilesPendingClear list by assignment (see below).

}

// Clear list of pending external tiles
this->_externalTilesPendingClear = Tile::LoadedLinkedList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't do what you want. In fact, we should probably disable the assignment operator on this class. Take a look at how this LoadedLinkedList works. It's an "intrusive" linked list, where the links between nodes are actually stored in the data (Tiles) themselves. This means that a tile can only be in one list at a time. And clearing the tiles from a list requires traversing the entire list and setting the previous and next pointers in each tile to nullptr.

// Make sure we unload all children
gsl::span<Cesium3DTilesSelection::Tile> children = tile.getChildren();
for (Tile& child : children) {
this->unloadTileContent(child);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises some performance red flags, though I don't know exactly how to tell you to fix it. When we decide to unload a tile, we'll traverse all of its descendants in isReadyToUnload to see if it can unloaded. And if it can, we'll then unload its children (here), which will involve calling isReadyToUnload again on each child, which will again traverse the entire tree. A tile subtree can have thousands of tiles, and this is an n^2 operation in the number of tiles, so 😱.

// then they might get reloaded before we've had a chance to clear their
// children and cause an error. They'll get their children cleared and their
// state set to Unloaded before next clean up
tile.setState(TileLoadState::Done);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe. Setting it to Done will give the selection algorithm free reign to render this tile. Which will leads to holes. Perhaps Unloading?

@@ -65,6 +68,7 @@ class TilesetContentManager
void updateTileContent(Tile& tile, const TilesetOptions& tilesetOptions);

bool unloadTileContent(Tile& tile);
bool handleUpsampledTileChildren(Tile& tile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use more descriptive names. "Handle" how?

@@ -145,6 +149,7 @@ class TilesetContentManager
int32_t _tileLoadsInProgress;
int32_t _loadedTilesCount;
int64_t _tilesDataUsed;
Tile::LoadedLinkedList& _loadedTiles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous. The TilesetContentManager can and sometimes does outlive the Tileset. So by having this reference here, we need to be careful that we'll never try to use it after the Tileset has been destroyed.

@azrogers azrogers marked this pull request as draft May 16, 2024 20:23
@azrogers
Copy link
Contributor Author

After consulting with the rest of the team, we're going to put this PR on hold for now as it's taking an unexpectedly large amount of effort to get working. In its current state the code works fine with @javagl's example above, and does load and unload Google P3DT when tested with CesiumGS/cesium-unreal#1415, but currently only loads a limited amount of tiles closest to the camera. In theory, any tile that's being visited as part of building the list of tiles to render should never be unloaded, but for some reason this only applies to the closest tiles and not to ones farther away. This likely has something to do with unloadTileContent and loadTileContent - for whatever reason, tiles outside of the immediate circle around the camera remain "Unloaded". Further investigation is needed, if anyone reading this wants to take a crack at it!

@javagl
Copy link
Contributor

javagl commented May 16, 2024

Further investigation is needed, if anyone reading this wants to take a crack at it!

I haven't beem following the details and evolution of cesium-native, and wouldn't claim to be able to even guess the reason for certain issues. But if it is possible to describe "a minimal tileset that is sufficient for testing (and debugging) this", I could try to create such a test tileset.

Comment on lines +1544 to +1552
// Don't clear the children of this tile next frame
this->_externalTilesPendingClear.remove(&tile);
if (tile.getState() == TileLoadState::Unloaded &&
!tile.getChildren().empty()) {
// We were going to clear this tile's children, but it's still in use, so we
// should restore it to Done instead.
tile.setState(TileLoadState::Done);
tile.setContentShouldContinueUpdating(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cause of the missing tiles. tile.getState() == TileLoadState::Unloaded && !tile.getChildren().empty() will be true for a very large number of tiles before the first time they're loaded. Setting them to Done will ensure they never get loaded. Do you need to check that it's also in _externalTilesPendingClear maybe?

@kring
Copy link
Member

kring commented Jun 4, 2024

Let me try to describe the problem and possible solution, as I see it. Most of this probably isn't new to you at this point, but hopefully it helps to have it in one place.

Ok, so first things first, what are we actually trying to do? We're concerned with tiles which are the root of an external tileset. In other words, tiles for which tile.getContent().isExternalContent() returns true. Currently, we simply never unload the content for such tiles (TilesetContentManager::unloadTileContent exits early without doing anything when it sees external content). We'd like to be able to unload them.

But what does it mean to unload the content for such a tile? Usually, when we talk about unloading tile content, we're talking about clearing its _content field. For renderable tiles, this contains the tile's whole glTF, including geometry and textures, so it's pretty big. But for a tile with external content, it's actually pretty tiny. That TilesetMetadata instance is mostly empty in most cases, so it doesn't take up much memory at all.

However, when we load an external tileset, the parsed tileset.json is turned into a tree of Tile instances. Sometimes thousands of them. These tiles don't have content (we sometimes call them "skeletons"), but they still add up when we have a lot of them (which we very often do). So, in the same way we can load extra tiles from an external tileset.json, we'd like to later be able to unload (destroy!) these same tiles. This will save us a lot of memory, particularly with tilesets that use external tilesets heavily like Google Photorealistic 3D Tiles.

So to get started, we can simply allow TilesetContentManager::unloadTileContent to operate on tiles with external content, and we can implement the unload of external content to also clear the tile's _children vector.

(Note that we should never clear the _children vector in any other circumstances - that would just result in losing information that we couldn't easily recover. This is only safe to do for tiles where the children came from an external tileset because we can get them back just by loading that external tileset again.)

The problem with this simple approach is that clearing the _children vector invalidates pointers/references to every Tile in the vector, plus every Tile in those tiles' _children vectors, and so on. So before we can do that safely, we have to make sure that no one is holding pointers to any of those tiles.

So then the question becomes: where might we hold Tile pointers/references, and how can we detect this?

Here are the places I know about. I'm not sure if it's exhaustive, but hopefully it's close:

  1. The ViewUpdateResult, specifically tilesToRenderThisFrame and tilesFadingOut. We generally shouldn't need to worry about tilesToRenderThisFrame because any tile rendered this frame won't be unloading anyway, according to existing unload rules. In fact, if a particular tile deep down in the tree is rendered, all its ancestors will be prevented from unloading, too. So generally we don't need to worry about this one. tilesFadingOut, on the other hand, can happen over multiple frames and after the tile subtree isn't traversed at all. So we definitely need to check that one. Specifically, if any Tile in the external tileset is in this list, we can't unload the external tileset yet.
  2. The load pipeline, initiated at TilesetContentManager::loadTileContent. A Tile reference is put into TileContentLoadInfo. So if any Tile in the external tileset is loading, we can't unload the external tileset yet. We can determine this by looking at Tile states. The Unloaded, ContentLoaded, Done, and Failed states are safe. All others are not.
  3. A Tile that is "upsampling" from a parent Tile holds a pointer to the parent tile, so the parent tile (and its external tileset) can't be unloaded. We can determine if a Tile is upsampling from this Tile with the logic here.

Checking all of these things can be expensive, especially because we need to check every tile in an external tileset in order to determine if we can unload it. That can easily be thousands of tiles. So one idea is keep track of unload safety all along rather than computing it when we need it. Here's how I think I'd approach that (warning: this may not be a fully baked idea):

Add doNotUnloadCount to Tile. Every time we enter a state that forbids unloading (1-3 above), we increment this count. Then we also increment the same counter on all of the Tile's parent Tiles, all the way to the root. When we leave the state that forbids unloading, we decrement the counter (and also do the same for all ancestors). A single Tile can be in multiple unload-forbidding states simultaneously, such as if it's in tilesFadingOut and is also an upsampling source. If that's done correctly, checking if we can unload an external tileset is a simple matter of checking whether the root Tile of the external tileset has a doNotUnloadCount of zero.

Making sure these reference counts are prefectly balanced can be tricky and error prone, so a debug mode that tracks not just the count but also the source of the count (perhaps as a descriptive string) can be helpful.

I haven't thought of the flaw with this plan yet (I'll update if I do), so hopefully it'll work well, and I don't think it'll be too difficult to implement, either. There are probably other reasonable solutions as well.

@kring
Copy link
Member

kring commented Jun 4, 2024

What I've called doNotUnloadCount above is really more like doNotUnloadSubtreeCount. It may be safe to unload renderable content when this count is non-zero (subject to the rules around this that we already have), because unloading renderable content doesn't invalidate any Tile pointers. But it's not safe to unload an external tileset when this count is greater than 0.

@javagl
Copy link
Contributor

javagl commented Jun 4, 2024

I'm watching this, from a huge distance (from on an island in Indonesia, so to speak), and see quite some "reference counting" going on in various forms. And I'm wondering how much code (and how many bugs) could be avoided by just saying: Let's handle each Tile as a std::shared_ptr. Yes, this has to do this ~"expensive reference counting", and "... performance...", but ... apparently, reference counting is necessary, and doing it manually (with different places that increase/decrease the 'reference count', and different threads being involved) is error prone.

(I'm just wondering. I know, it won't happen. So 🤞 that you can sort this out in another way. But don't forget to add comments accordingly, e.g. in ViewUpdateResult, saying things like: "Thou shalt not store these Tile* pointers, because the ownership and lifetime of these is not specified")

@kring
Copy link
Member

kring commented Jun 4, 2024

Even putting aside costs, using a std::shared_ptr wouldn't work here. We can't destroy individual tiles, we only want to destroy entire subtrees loaded from external tilesets. So it is reference counting, but even though Tile stores the count, it's not really the individual Tile instances we're controlling the lifetime of. With std::shared_ptr, the thing counted and the thing whose lifetime is controlled are always one and the same.

@timoore
Copy link
Contributor

timoore commented Jun 4, 2024

Even putting aside costs, using a std::shared_ptr wouldn't work here. We can't destroy individual tiles, we only want to destroy entire subtrees loaded from external tilesets. So it is reference counting, but even though Tile stores the count, it's not really the individual Tile instances we're controlling the lifetime of. With std::shared_ptr, the thing counted and the thing whose lifetime is controlled are always one and the same.

I don't understand external tilesets well, but if the problem is that the lifetime of the external tileset structure is different than the individual tiles in it, then it seems that a shared_ptr would work. You don't really care about the precise reference count, just whether or not any references to a tile are held.

As an old Lisp guy, I get a bit jaded about performance and shared_ptr, putting complaints in the same category as complaints about garbage collection performance. If you need to track object lifetimes, then you need to pay for it somewhere. Was shared_ptr deemed to be too expensive at some point in Cesium Native's history? If so, I'd look at nearby solutions like std::make_shared and intrusive pointers before tossing the whole idea.

@javagl
Copy link
Contributor

javagl commented Jun 4, 2024

Some of this may be too unrelated to this PR (and a too large topic in general) to be discussed in this PR. I just threw it in here, because of the difficulties of the whole "lifetime and ownership" question that seems to be at the core here, and the attempt to solve it with manual reference counting is bound to involve intricate handling of the reference count, and will require deeeep knowledge about what is happening where on which thread.

(An aside: I also doubt that replacing manual reference counting with a shared_ptr would have a prohibitively large performance impact. In fact, I'm pretty sure that it would not have any measurable impact at all. But there's no point in arguing about this without benchmarking. So right now, I cannot contribute to a profound discussion. For that, I'd have to start with basic things, like ... figuring out what the difference between a ReferenceCounted and a shared pointer is to begin with...)

A few abstraction levels higher: I think that the "lifetime and ownership" question that is hidden in the point about 'individual tiles vs. subtrees' is crucial. But I assume that the state space, state transitions, lifetimes, and ownerships will raise enough questions to be documented incrementally. One overly specific example is that I assume that ViewUpdateResult will eventually contain a hint saying something like "The Tile* pointers in the tilesToRenderThisFrame may become invalid at time X under condition Y. Clients may not store these pointers"...

(The question about "tile-vs-subtree" may, at the bottom, involve seemingly trivial aspects like the std::vector<Tile> _children; not being a std::vector<std::shared_ptr<Tile>> _children;, but if the question of "Where to use which kind of pointer" is supposed to be sorted out more holistically, it should be done in a dedicated issue...)

@kring
Copy link
Member

kring commented Jun 4, 2024

I don't understand external tilesets well, but if the problem is that the lifetime of the external tileset structure is different than the individual tiles in it, then it seems that a shared_ptr would work.

There is no external tileset structure. Well, there's TileExternalContent, but there are no Tiles in it (just metadata).

It would be interesting to consider moving to a model where there's an explicit representation of an external tileset, and it owns all of the Tile instances within it. Perhaps TilesetJsonLoader is a start toward that.

But tackling that as part of this PR definitely risks turning a relatively focused (if slightly tricky and possibly error-prone) task into a major refactoring effort.

Was shared_ptr deemed to be too expensive at some point in Cesium Native's history?

No, we use shared_ptr. But not for tiles. Not so much because of the cost of shared_ptr, but because of the cost of heap allocating every Tile individually.

I'm aware of std::make_shared and we always use it. It cuts the heap allocations in half, but that's still a huge number of heap allocations in the case of Tiles.

We also have CesiumUtility::IntrusivePointer for things that are always reference counted, don't need weak_ptr, and benefit from a little bit cheaper reference counting.

@kring
Copy link
Member

kring commented Jun 5, 2024

the attempt to solve it with manual reference counting is bound to involve intricate handling of the reference count, and will require deeeep knowledge about what is happening where on which thread.

cesium-native's rule is that Tile instances can only ever be accessed from the main thread (or the thread that created them if they haven't been passed to the main thread yet, i.e., during loading). So there's no tricky threading issue here, at least.

A non-owning smart pointer class might be a nice elegant way to solve the "manual" part of the reference counting, though. On construction with a Tile*, it increments doNotUnloadSubtreeCount for the tile and all its ancestors. On destruction, it decrements the same. Then anything that holds a Tile pointer outside the normal flow and needs to make sure the TIle doesn't disappear (such as TileLoadInput) just has to use this smart pointer class.

I also doubt that replacing manual reference counting with a shared_ptr would have a prohibitively large performance impact. In fact, I'm pretty sure that it would not have any measurable impact at all. But there's no point in arguing about this without benchmarking.

Yeah and I agree with you on this point anyway. As I mentioned above, it's not the reference counting part that's costly, it's the memory allocation. I don't think I need to do a benchmark to be convinced that heap-allocating every Tile would be a mistake, though. In fact, I bet we'd see a small but measurable performance improvement if we could avoid heap-allocating (via a std::vector) each block of child Tiles, as we're currently doing. Avoiding unnecessary heap allocations is just good practice, particularly in code where performance matters. It's even in the C++ Core Guidelines. I'm not saying we have to obsess over it of course, sometimes dynamic allocation is the best solution.

I'd have to start with basic things, like ... figuring out what the difference between a ReferenceCounted and a shared pointer is to begin with...)

ReferenceCounted is an optional, convenience base class that can be inherited from in order to easily enable a particular type for use with CesiumUtility::IntrusivePointer.

I could try to explain the tradeoffs of IntrusivePointer versus shared_ptr, but I'll just refer you to Boost's documentation on their intrusive pointer to start (not exactly the same, but same idea and same tradeoffs):
https://www.boost.org/doc/libs/1_85_0/libs/smart_ptr/doc/html/smart_ptr.html#intrusive_ptr

But I assume that the state space, state transitions, lifetimes, and ownerships will raise enough questions to be documented incrementally.

Yes, no question, we need to be clear about the lifetime rules. We've gotten away with this with Tiles until now because Tiles, once created, were never destroyed, other than when the entire Tileset was destroyed.

@javagl
Copy link
Contributor

javagl commented Jun 5, 2024

cesium-native's rule is that Tile instances can only ever be accessed from the main thread (or the thread that created them if they haven't been passed to the main thread yet, i.e., during loading). So there's no tricky threading issue here, at least.

That seems to be focussed on the usage side (i.e. what is supposed to be visible for clients). One of my concerns about manual reference counting referred to the guts of the implementation. I think that it could easily happen that someone is doing something on some thread and this affects the reference counter. This does not refer to something explicit and obvious like

something.thenInWorkertThread([...] {
    someTileThatSomeoneGotFromSomewhere.decreaseReferenceCount();
}

that could easily spotted during a review. It rather referred to the "implicit" changes, in scenarios like

  • an issue is caused: Some resource or memory is not properly cleaned up
  • it has to be fixed: When a certain function is called, then something else has to happen, to "clean up/release" something
    • (This will affect some reference counter, down the call chain, which may not be obvious)
  • it is not clear or obvious which thread is calling the function, and/or which thread is allowed to touch the reference counter

With very hazy memories about some details, but roughly looking at your list above, there may be something like

Tile::upsample() {
    parent.increaseReferenceCount();
    ....
}

Which thread is calling 'upsample'? You probably know that. You could claim that "this sort of operation is always done by thread X". But it's probably not documented (and even less enforced) explicitly.

Now, there has to be a matching parent.DEcreaseReferenceCount(); call somewhere. This could be part of some finishedUpsampling() method or so. But where it's not clear or obvous which thread is calling this method either.

(All this could probably be summarized with: "See ReferenceCountedThreadSafe vs ReferenceCountedNonThreadSafe" 😉 )


The Boost documentation does not go into details about reasons for using intrusive_ptr (but mentions that one should start with shared_ptr by default). And the reasons that I found by 5 minutes of websearching always seem to refer to "performance". I know that there are general rules and best practices, and some design decisions on a high level (~"preferring stack over heap") that have to be carefully considered in view of their implications on performance and (IMHO more importantly:) lifetime of objects.

But as a baseline: I doubt any performance claim that is made without a benchmark. And when there is a benchmark, then I doubt that benchmark 😈 . Quickly looking at the first, naive, shallow search results like https://stackoverflow.com/questions/14837829/intrusive-ptr-shared-ptr-performance-tests or https://groups.google.com/g/benchmark-discuss/c/uENrnU4J3AM :

  • they are sometimes pretty old (comparing Boost intrusive_ptr to Boost shared_ptr might have made sense 10 years ago, but now there is std::shared_ptr, which may have wildly different performance characteristics...)
  • they are varying wildly in terms of the performance/timing claims that they make
  • important: they seem to be run single-threaded, meaning that they may be "distorted" when it comes to the impact of thread safety on overall performance of a multi-threaded application

(Whether or not intrusive_ptr is thread safe to begin with? I don't know. It can probably be derived from https://www.boost.org/doc/libs/1_85_0/libs/smart_ptr/doc/html/smart_ptr.html#intrusive_ref_counter , but I didn't look into the details....)

One reason why I didn't look into all the details: Imagine there is a Tile that has to be created, and this can be a shared_ptr and be allocated on the heap and require atomic access for reference counting. All this is expensive. But then this Tile is filled with life, by doing a network request and donwloading 5 MB of data that has to be processed and uploaded to the GPU. I'm convinced that even the most naive approach for managing tiles will (in a real application, i.e. outside the scope of a microbenchmark) change the "average FPS over 100 frames" from 42.123 to something like 41.987 or so.

Proving or disproving this with a benchmark could be difficult: One can not just introduce some #define USE_APPROACH_X true somewhere and run both versions. The structural changes all over the codebase (implied by the change from "manual reference counting and intrusive pointers" to "naive, blanket std::shared_ptr everywhere") would be so large that the results could hardly be compared objectively. But even in a very narrow scope: If there was a measurable performance difference between Tile::_children being a vector<Tile> and a vector<shared_ptr<Tile>>, then I would at least consider this as "evidence". But I'm not sure whether it's worth trying that out.

@kring
Copy link
Member

kring commented Jun 5, 2024

That seems to be focussed on the usage side (i.e. what is supposed to be visible for clients). One of my concerns about manual reference counting referred to the guts of the implementation. I think that it could easily happen that someone is doing something on some thread and this affects the reference counter.

I'm talking about implementation, too. What you describe can't happen as long as we follow our rule of never passing a Tile to a worker thread. That's why we have that rule.

Which thread is calling 'upsample'?

Upsampling is done in a worker, but it does not require access to a Tile.

The glTF, however, is made available to the upsample process, which is ugly and dangerous but necessary.

The Boost documentation does not go into details about reasons for using intrusive_ptr

It lists three reasons:

  • Some existing frameworks or OSes provide objects with embedded reference counts;
  • The memory footprint of intrusive_ptr is the same as the corresponding raw pointer;
  • intrusive_ptr can be constructed from an arbitrary raw pointer of type T*.

shared_ptr is quite complicated, by necessity. Primarily because of two design decisions: the count being stored separately from the object being controlled, and the availability of weak_ptr. An intrusive pointer is simpler. That makes it smaller and faster.

One reason why I didn't look into all the details: Imagine there is a Tile that has to be created, and this can be a shared_ptr and be allocated on the heap and require atomic access for reference counting. All this is expensive. But then this Tile is filled with life, by doing a network request and donwloading 5 MB of data that has to be processed and uploaded to the GPU. I'm convinced that even the most naive approach for managing tiles will (in a real application, i.e. outside the scope of a microbenchmark) change the "average FPS over 100 frames" from 42.123 to something like 41.987 or so.

You missed a critical detail. Tile instances are created in bulk, sometimes many thousands at a time, and only a small fraction of those, sometime later, may have content loaded for them.

As a completely unsurprising (at least to me) point of reference, at one point I tried to switch the Tile loading code from its current approach, which reads the tile hierarchy from a tileset.json into a RapidJSON Document and then creates Cesium3DTilesSelection::Tile instances for each tile represented in the Document, to a slightly different approach that reads a Cesium3DTiles::Tileset and then creates Cesium3DTilesSelection::Tile instances from those. It took somewhere around twice as long to load the tile hierarchy, adding as much as a full second to the initial load time of a Tile-rich tileset like Melbourne Photogrammetry. Why? The JSON reader for Cesium3DTiles::Tileset should be extremely efficient. However, Cesium3DTiles::Tileset uses vectors of Tiles (just like Cesium3DTilesSelection::Tile), which requires a lot of heap allocations. Heap allocations are expensive.

Believe it or not, I'm not some performance zealot trying to make your life hard by imposing questionable rules without evidence. 😀

@javagl
Copy link
Contributor

javagl commented Jun 5, 2024

Believe it or not, I'm not some performance zealot trying to make your life hard by imposing questionable rules without evidence. 😀

I assume that you did tests and comparisons (beyond what you already know), and the point about "many tiles being created at once (and sometimes only few of them being filled later)" is certainly a valid one: It sounds like the two factors here are "multiplicative": "Heap-vs-Stack" (with the first one being slower) and "Individual-vs-Bulk" (with the first one being slower).

I'm lacking some context, but

Cesium3DTiles::Tileset uses vectors of Tiles ...

I don't see a vector<Tile> there, and if this referred to the Tile::children, I'd have to think through why this makes a difference. A guess: Could collecting these Cesium3DTIles::Tile instances (in one huge vector) and then bulk-creating the matching Cesium3DTilesSelection::Tile instances from them have been a reasonable attempt for alleviating the problem?

And an aside: I assume that this experiment was in the context (or at least related to) the consideration to replace the whole third-party-JSON parsing with the auto-generated cesium-native JSON handling. A devil's advocate question could be: If the auto-generated structures do not allow an implementation that is as efficient as manual JSON twiddling, what would have to be changed in order to allow such an efficient implementation?


In my naive, convenient Java world, there is no real "stack", and nothing that is not a "shared pointer", so to speak. And that may be a reason for me to (always keep performance in mind, but) lean towards the "safe and simple" approach, Often, time is wasted in unexpected places. For example, for that JSON parsing layer: I've heard rumors about ~20 MB tileset.json files, where questions like "SAX-vs-DOM" and maybe even "indented-vs-non-indented" may come into play, and the overarching question would be "Couldn't you split that into '1MB + (19 x 1MB)' tilesets?" ... with the latter being external ones... leading back to the purpose of this PR (sorry for derailing that a bit...).

@kring
Copy link
Member

kring commented Jun 8, 2024

if this referred to the Tile::children

Yep, that's the vector of tiles I'm referring to.

A guess: Could collecting these Cesium3DTIles::Tile instances (in one huge vector) and then bulk-creating the matching Cesium3DTilesSelection::Tile instances from them have been a reasonable attempt for alleviating the problem?

Yes, that would help significantly. Switching instead to a std::vector<any_kind_of_pointer<Tile>> would be a significant step in the wrong direction, even if we also moved to a single vector to hold all the tiles.

In theory it's possible to use a custom allocator to get std::vector<any_kind_of_pointer<Tile>> semantics while keeping close to std::vector<Tile> performance, but it's tricky. Particularly if the reason you're doing that is so that you can allocate/free individual Tiles whenever you like.

I assume that this experiment was in the context (or at least related to) the consideration to replace the whole third-party-JSON parsing with the auto-generated cesium-native JSON handling.

Yes that's right.

If the auto-generated structures do not allow an implementation that is as efficient as manual JSON twiddling, what would have to be changed in order to allow such an efficient implementation?

It's not that "manual JSON twiddling" is faster. It's that RapidJSON Document is obsessively focused on avoiding heap allocations. Basically, it's fast in all the ways the naive implementation is not. And that makes it win even though it requires some manual JSON twiddling.

In my naive, convenient Java world, there is no real "stack", and nothing that is not a "shared pointer", so to speak.

Interestingly, one of the advantages of having a garbage collector - especially a compacting, generational one - is that heap allocations are much faster.

@javagl
Copy link
Contributor

javagl commented Jun 8, 2024

Interestingly, one of the advantages of having a garbage collector - especially a compacting, generational one - is that heap allocations are much faster.

The other "advantage" is that as a developer, you don't have to give a ... ... care about why this is the case 🙂
Even though it might be a disadvantage now - I suspect this is related to fragmentation and the bookkeeping that is necessary for non-GC'ed heap allocation. As a "Java fanboi" (looking at my name here...) I did dig far deeper into the details of the language and JVM than most other developers, but eventually, the memory management and GC has to be treated as a "black box" (and its details may change over time). Things like https://softwareengineering.stackexchange.com/a/208676 seem to be reasonable summaries, though...


It's surprisingly hard to find concise, reliable information about the performance differences of heap-vs-stack in C++.

(By "hard" I mean that there doesn't seem to be much in the first 10 search results. By "concise" I mean that there are things like stack overflow answers that talk about this, and why one is faster than the other, but not how much. And by "reliable" I mean that https://publicwork.wordpress.com/2019/06/27/stack-allocation-vs-heap-allocation-performance-benchmark/ is easy to find and concise, but too narrowly focused on a too artifical microbenchmark).

One approach for avoiding the heap costs could be some sort of "pool" - that's probably what you referred to with..

In theory it's possible to use a custom allocator to get std::vector<any_kind_of_pointer<Tile>> semantics while keeping close to std::vector<Tile> performance, but it's tricky. Particularly if the reason you're doing that is so that you can allocate/free individual Tiles whenever you like.

And the trickiness obviously is: What can you do manually that is faster than what the new/delete operators are doing?
The first thought would be: Not much, probably. Except for very narrow, special cases of allocation/deallocation patterns. But... I could imagine that for Tile, we do have such a narrow case: Tiles form a hierarchy, and I could imagine that the constraints and patterns that are implied by having a tree structure could allow for specialized/optimized implementations of such a "pool". (Not trivial by any means, and ... even more unrelated to this PR than much of the previous disucssion, but a thought...)

@kring
Copy link
Member

kring commented Jun 8, 2024

And the trickiness obviously is: What can you do manually that is faster than what the new/delete operators are doing?

Yep. An allocator for a single type of object is a bit simpler than a general one, so that helps. It's also possible to do the sort of tricks that RapidJSON does: https://rapidjson.org/md_doc_internals.html#MemoryPoolAllocator
So something like: give each tileset its own pool for allocating Tile instances, and once memory is allocated from that pool, never return it. The pool itself can be freed by unloading the entire tileset. (by tileset I also mean external tileset, so external tileset subtrees can be unloaded without unloading the entire top-level tileset)

Actually now that I'm looking at the stackoverflow link you posted above, it actually kind of says what I just wrote above better than I did.

By the way, you don't have to convince me that good garbage collectors are awesome. I completely agree.

@javagl
Copy link
Contributor

javagl commented Jun 8, 2024

From a naive, high-level point, this now sounds like something that could actually be ... "less non-trivial" ... than many other forms of pooling: Storing the tile.children as a std::vector<Pointer<Tile>>, where these are just pointing into one huge Tile array that is maintained in the tileset. But one would certainly have to think about some details here. For example, conditions for "compacting/defragmenting" this array. Unloading some external tileset might be one of them, but depending on how this "mapping" of "pointers to array entries" is established and maintained, the challenge would then be to do this without causing the usual "GC compacting pauses" that can easily ruin every interactive experience...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants