Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict IP pool size to prevent CRDB memory budget exceeded error #5651

Open
askfongjojo opened this issue Apr 27, 2024 · 5 comments
Open

Comments

@askfongjojo
Copy link

I hit the error accidentally while blindly plugging in a large CIDR start/end IP range, e.g. 240.0.0.1 - 240.4.255.254.

Nexus failed the request during floating IP create request against such a large IP range in the pool I chose:

00:47:16.243Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = unexpected database error: root: memory budget exceeded: 4218880 bytes requested, 133364712 currently allocated, 134217728 bytes in budget
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
    latency_us = 624875
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.17.42:50856
    req_id = 2db527c4-aa8f-4af4-96c7-581e2e5db726
    response_code = 500
    uri = https://silo11.sys.rack2.eng.oxide.computer/v1/floating-ips?project=bob

This is very much a corner case but having some upfront validation on the pool size will be a good thing. I poked a bit into what that maximum is and it looks like /14 works fine and covers an insanely large number of addresses.

@david-crespo
Copy link
Contributor

There has to be a way to make this endpoint use constant memory regardless of the range size.

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 28, 2024 via email

@askfongjojo askfongjojo transferred this issue from oxidecomputer/console Apr 28, 2024
@david-crespo
Copy link
Contributor

/// Performance notes
/// -----------------
///
/// This query currently searches _all_ available IP Pools and _all_ contained
/// ranges. It is similar to the common "next item" queries, in that its
/// worst-case peformance is proportional to the total number of IP addresses in
/// all pools. Specifically, its runtime is proportional to the lowest
/// unallocated address in any IP pool. However, it's important to note that
/// Cockroach currently completely runs all subqueries and caches their results
/// in memory, meaning this query will need to be redesigned to limit the
/// number of results it checks. As an example, suppose there were a large
/// number of IP pools each containing an IPv6 /64 subnet. That's 2**64 items
/// _per_ pool.
///
/// A good way to start limiting this is cap the address-search subquery. It's
/// not clear what that limit should be though.

@bnaecker
Copy link
Collaborator

Here's the VPC fix, which could be implemented for this case as well: #4298

@bnaecker
Copy link
Collaborator

Played around with this on Friday. The fix I used for VPCs won't work here, since this isn't a NextItem query. It looks like we'll need to add the chunked search strategy manually to the inner subquery that generates the sequence of addresses from the ranges. The trickiest part will be correctly detecting when we've exhausted the ranges in the pool, rather than just hit the end of the chunk, but I have a few ideas on how to make that work.

bnaecker added a commit that referenced this issue May 14, 2024
- Adds a server kind enum, used to distinguish which API server is
  running any particular handler.
- Wraps the existing `ServerContext` into a higher-level `ApiContext`
  type, which includes the former in addition to the kind of server
  handling the request.
- Fixes #5651
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants