Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak on Windows when using fibers #740

Open
shadowndacorner opened this issue Feb 28, 2024 · 0 comments
Open

Possible memory leak on Windows when using fibers #740

shadowndacorner opened this issue Feb 28, 2024 · 0 comments

Comments

@shadowndacorner
Copy link

I recently added Tracy to my game engine, which uses fibers as an optional part of its job system (simple jobs are executed directly on the worker threads, whereas interruptible jobs are executed on a fiber). Shortly after doing so, I started to notice what looked like a very difficult to track down memory leak which seemed to only occur when fiber jobs were used. After several days of experimentation (as well as switching from boost::context to WinFibers to sanity check that my usage of boost::context wasn't causing a leak), I realized that if I disabled Tracy by not defining TRACY_ENABLE, no such leaks occurred.

The leak seems to happen whenever a fiber switch occurs, but I could be wrong about that - actually tracking down the allocations has proven extremely difficult as they don't seem to show up in Visual Studio's memory profiler. I've just observed that the more fiber jobs I have in flight, the faster it leaks memory. In extreme cases (hundreds/thousands of fiber jobs per frame), it leaks hundreds of megabytes per second (typically running between 60-300hz depending on the number of fiber jobs). Performance degrades substantially when the Tracy client is connected (which seems to be a known issue with fiber profiling), but the leak occurs regardless of whether or not the client is connected. Note that I'm using TRACY_ON_DEMAND as well, and no leak occurs if I never launch any fibers.

The following is my current fiber calling and switching code (still using WinFibers), in case I'm simply instrumenting them incorrectly...

static void switch_to_fiber_handle(fiber_handle_t target)
{
    move_win_fiber_set_current(target);
    TracyFiberEnter(get_fiber_name(target));
    SwitchToFiber(target->fiber);
}

static fiber_handle_t yield_to_caller_fiber(fiber_handle_t target)
{
    switch_to_fiber_handle(target);
    return move_win_fiber_get_current();
}

static void call_fiber(fiber_handle_t target)
{
    auto old_caller = move_win_fiber_get_calling();
    move_win_fiber_set_calling(move_win_fiber_get_current());
    switch_to_fiber_handle(target);

#if defined(TRACY_ENABLE)
    if (is_fiber_complete(target))
    {
        TracyFiberLeave;
    }
#endif
    move_win_fiber_set_calling(old_caller);
}

static void CALLBACK fiber_entrypoint(void* data)
{
    auto self_fiber = static_cast<allocated_fiber_t*>(data);

    while (true)
    {
        if (self_fiber->func)
        {
            MOVE_ASSERT_MSG(
                move_win_fiber_get_calling(), "Calling fiber is null");
            MOVE_ASSERT_MSG(move_win_fiber_get_calling()->fiber,
                "Calling fiber's fiber is null");
            self_fiber->is_running = true;
            self_fiber->ready_for_switch = true;

            self_fiber->func(self_fiber->userdata);
            self_fiber->is_running = false;
            self_fiber->func = 0;
            self_fiber->userdata = 0;
        }
        self_fiber->ready_for_switch = false;
        self_fiber =
            yield_to_caller_fiber(move_win_fiber_get_calling());
    }
    self_fiber->fiber = 0;
}

The following test also segfaults when Tracy is enabled, but not when it is disabled. I'm assuming the segfault is related to whatever is causing the leak. I have not been able to reproduce a segfault outside of pathological cases similar to the following test. In a somewhat more reality-adjacent stress test where I spawn 5000 dummy fiber jobs every frame, it leaks memory proportional to the number of jobs dispatched (which I'll note is unrelated to the number of fibers actually created because they are reused).

WHEN("A fiber is set up that calls into several nested fibers")
{
    REQUIRE_NOTHROW(run_on_fiber(
        [&]()
        {
            INFO("Starting nested fiber test");

            constexpr static int num_fibers = 100;
            move::vector<fiber_handle_t> fibers;
            for (int fiberid = 0; fiberid < num_fibers; ++fiberid)
            {
                REQUIRE_NOTHROW(fibers.push_back(start_on_fiber(
                    [&, fiberid]()
                    {
                        INFO("Started nested fiber " << fiberid);
                        for (int i = 0; i < 10; ++i)
                        {
                            INFO("Yielding nested fiber " << fiberid << " ("
                                                            << i << ")");
                            REQUIRE_NOTHROW(yield());
                            INFO("Returned control to nested fiber "
                                    << fiberid << " (" << i << ")");
                        }
                        INFO("Completed nested fiber " << fiberid);
                    })));
                REQUIRE(!is_fiber_ready_to_start(fibers.back()));
                REQUIRE(is_fiber_executing_work(fibers.back()));
                REQUIRE(!is_fiber_complete(fibers.back()));
            }

            INFO("Validating nested fibers");
            for (auto& it : fibers)
            {
                REQUIRE(it);
            }

            INFO("Stepping nested fibers to completion");

            bool were_any_incomplete = true;
            while (were_any_incomplete)
            {
                were_any_incomplete = false;
                for (auto& it : fibers)
                {
                    if (!is_fiber_complete(it))
                    {
                        were_any_incomplete = true;
                        REQUIRE_NOTHROW(resume_fiber(it));
                    }
                }
            }

            for (int i = 0; i < 10; ++i)
            {
                INFO("Releasing nested fiber " << i);
                REQUIRE_NOTHROW(release_fiber(fibers[i]));
            }
        }));
}

// In fiber.hpp, for context
template <typename F>
inline void run_on_fiber(F&& f)
{
    fiber_handle_t handle = allocate_fiber();
    setup_fiber_function(
        handle,
        [](void* data)
        {
            auto f = reinterpret_cast<F*>(data);
            (*f)();
        },
        &f);
    start_fiber(handle);
    while (!is_fiber_complete(handle))
    {
        step_fiber(handle);
    }
    release_fiber(handle);
}

template <typename F>
inline fiber_handle_t start_on_fiber(F&& f)
{
    fiber_handle_t handle = allocate_fiber();
    setup_fiber_function(
        handle,
        [](void* data)
        {
            auto f = *reinterpret_cast<F*>(data);
            (f)();
        },
        &f);
    start_fiber(handle);
    return handle;
}

My gut tells me that I'm misusing TracyFiberExit in some way, but the documentation is a bit unclear as to whether it should be called ONLY when a fiber yields to the calling thread, or if it should yield on work completion in general. I've tried both ways, and, assuming I didn't screw something up, both seem to exhibit the same behavior, which makes me question whether the leak is inherent to how Tracy is currently tracking fibers.

Any guidance you could provide would be appreciated, as would confirmation that this is a bug in Tracy rather than my code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant