Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finer control over GPU memory (access to cu*Destroy from Python) #370

Open
CarlosRDomin opened this issue Aug 8, 2022 · 4 comments
Open

Comments

@CarlosRDomin
Copy link

CarlosRDomin commented Aug 8, 2022

Is your feature request related to a problem? Please describe.
We recently identified a (GPU) memory leak in a routine that creates a new cuda stream on a given context every time it is called. Even though the scope of the stream variable is constrained to this function, it doesn't seem to be garbage collected. Every N times the function is called, the total GPU memory allocated (measured through mem_get_info() or nvidia-smi) increases by a small amount (e.g. 2MB with N=16). After enough times, we completely starve our GPU's memory. See details on a minimal example to replicate in the Additional context section.

Describe the solution you'd like
I'd like the destructor of Stream objects to be exposed (e.g. through an explicit destroy method), so that we could force free that memory, or a mechanism to ensure streams out of scope are automatically freed.

Describe alternatives you've considered
So far, the only approach that seems to get the job done (although not ideal) is to detach the context and create a new one every once in a while. However, this has latency and performance drops associated with it, and involves having to reinitialize any other tasks that were running in that context.

Additional context
Minimal example to reproduce:

import gc
from pycuda import driver as cuda
cuda.init()
ctx = cuda.Device(0).make_context()

def create_streams(N=16):
    return [cuda.Stream() for _ in range(N)]

def print_memory():
    #ctx.push()
    free, total = cuda.mem_get_info()
    print(f"Used: {(total - free) / (1024**2)}MB")
    #ctx.pop()

print_memory()  # x MB
# Allocate some streams
streams = create_streams(N=10)
print_memory()  # x+2 MB
# Allocate some more
streams.extend(create_streams(N=20))
print_memory()  # x+4 MB
# Delete all streams
del streams
gc.collect()  # Even trying to force the garbage collector won't do anything
print_memory()  # Still x+4 MB instead of x MB :(

ctx.pop()

prints out:

Used: 5651.25MB
Used: 5653.25MB
Used: 5655.25MB
Used: 5655.25MB
@inducer
Copy link
Owner

inducer commented Aug 9, 2022

Thanks for this request! I would be happy to consider a PR that adds this. This should be simple to do. Within the stream object here, simply track whether the stream has already been destroyed, and make sure to raise an appropriate exception when handle is called:

pycuda/src/cpp/cuda.hpp

Lines 994 to 1041 in db77dc7

class stream : public boost::noncopyable, public context_dependent
{
private:
CUstream m_stream;
public:
stream(unsigned int flags=0)
{ CUDAPP_CALL_GUARDED(cuStreamCreate, (&m_stream, flags)); }
~stream()
{
try
{
scoped_context_activation ca(get_context());
CUDAPP_CALL_GUARDED_CLEANUP(cuStreamDestroy, (m_stream));
}
CUDAPP_CATCH_CLEANUP_ON_DEAD_CONTEXT(stream);
}
void synchronize()
{ CUDAPP_CALL_GUARDED_THREADED(cuStreamSynchronize, (m_stream)); }
CUstream handle() const
{ return m_stream; }
intptr_t handle_int() const
{ return (intptr_t) m_stream; }
#if CUDAPP_CUDA_VERSION >= 3020
void wait_for_event(const event &evt);
#endif
bool is_done() const
{
CUDAPP_PRINT_CALL_TRACE("cuStreamQuery");
CUresult result = cuStreamQuery(m_stream);
switch (result)
{
case CUDA_SUCCESS:
return true;
case CUDA_ERROR_NOT_READY:
return false;
default:
CUDAPP_PRINT_ERROR_TRACE("cuStreamQuery", result);
throw error("cuStreamQuery", result);
}
}
};

@CarlosRDomin
Copy link
Author

After some more investigation, I was actually able to identify the cause of our memory leak! Perhaps this is already documented somewhere but we weren't explicitly pushing the current CUDA context at the time that Python's garbage collector would delete the stream, and therefore cuStreamDestroy was never getting called.

The pycuda.Stream's destructor was in fact getting correctly called after our function returned, but for some reason this exception was silently thrown (so we never found out):

pycuda/src/cpp/cuda.hpp

Lines 967 to 969 in db77dc7

if (boost::this_thread::get_id() != m_context->thread_id())
throw pycuda::cannot_activate_out_of_thread_context(
"cannot activate out-of-thread context");

I assume there's a good reason to raise the exception instead of pushing the context from the current thread, so we implemented a method on our end to destroy each Stream we create by wrapping the del self.stream with a context push/pop. Now our GPU memory usage is as flat as a pancake :)

I'm happy to contribute to any docs if you believe this could be helpful to someone else, otherwise I'm happy to close this issue.

@CarlosRDomin
Copy link
Author

For any reader that comes across this, here's a minimal example to reproduce our leak (and how to fix it):

import gc
from pycuda import driver as cuda
from threading import Thread

cuda.init()
ctx = cuda.Device(0).make_context()
ctx.pop()

def create_streams(N=16):
    ctx.push()
    streams = [cuda.Stream() for _ in range(N)]
    ctx.pop()
    return streams

def print_memory():
    ctx.push()
    free, total = cuda.mem_get_info()
    print(f"{(total - free) / (1024 ** 2)}MB")
    ctx.pop()

def main_loop(N=10, push_ctx=False):
    for i in range(N):
        print(f"-- {i=} --")
        # Create some CUDA streams
        streams = create_streams()
        # Delete them (wrap the deletion with a context push/pop if selected)
        if push_ctx:
            ctx.push()
        del streams
        if push_ctx:
            ctx.pop()
        # Print memory usage. If there's a leak, this should increase on every iter
        print_memory()
        print("-" * 10)

# 1) Create and delete streams several times in the main thread.
# This doesn't leak memory because the stream destructor will be called
# from the same thread that created the context, so pycuda handles it nicely
main_loop()

# 2) Repeat step 1 from a separate thread.
# This will leak memory because the stream destructor isn't able to call
# cuStreamDestroy (isn't able to activate the context)
t = Thread(target=main_loop)
t.start()
t.join()

# 3) Repeat step 2 but pushing the context before deleting the streams. NO LEAK
t = Thread(target=main_loop, kwargs={"push_ctx": True})
t.start()
t.join()

Bottom line is: if you use multi-threading, remember to push the context before deleting the stream(s)!

@inducer
Copy link
Owner

inducer commented Aug 15, 2022

Thanks for following up! I took a look why there is no warning about the silenced exception.

Turns out it's getting caught here:

CUDAPP_CATCH_CLEANUP_ON_DEAD_CONTEXT(stream);

and then simply silenced here:

pycuda/src/cpp/cuda.hpp

Lines 161 to 171 in a25ed98

#define CUDAPP_CATCH_CLEANUP_ON_DEAD_CONTEXT(TYPE) \
catch (pycuda::cannot_activate_out_of_thread_context) \
{ } \
catch (pycuda::cannot_activate_dead_context) \
{ \
/* PyErr_Warn( \
PyExc_UserWarning, #TYPE " in dead context was implicitly cleaned up");*/ \
}
// In all likelihood, this TYPE's managing thread has exited, and
// therefore its context has already been deleted. No need to harp
// on the fact that we still thought there was cleanup to do.

That struck me as a bad idea, given that that warning likely could have saved you a few hours of grief.

https://gitlab.tiker.net/inducer/pycuda/-/merge_requests/80 adds a warning in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants