Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 [Bug]: Cache stampede with caching middleware #2182

Open
3 tasks done
demming opened this issue Oct 31, 2022 · 12 comments
Open
3 tasks done

馃悰 [Bug]: Cache stampede with caching middleware #2182

demming opened this issue Oct 31, 2022 · 12 comments

Comments

@demming
Copy link

demming commented Oct 31, 2022

Bug Description

I've come across several instances of the so-called cache stampede in a variety of web app frameworks. Here are the issues I opened in the respective GitHub repos:

  • ASP.NET Core (only Response Caching affected, performance issue remaining),
  • Quarkus (WIP & performance),
  • PlayFramework (awesomely fixed, only minor performance issue remaining),
  • Micronaut (status unknown).

In those cases cache stampede brings the GC to its knees, while with Fiber the effects are less pronounced but still raise the typical 20m RSS to over 300m RSS in just a couple of minutes or so for just 100 connections, which in relative terms is a lot and even compared to my Akka HTTP implementation running on OpenJDK 19 Hotspot (not even the low-profile Semeru OpenJ9) consumes up to 50% more memory.

image

It appears Fiber's caching middleware suffers from it too. It's a kind of cascading failure in distributed systems. Specifically, when I start several concurrent connections to a cached Fiber endpoint, multiple method invocations and (likely) cache populations take place, ranging from "for more than one connections" to "for each connection" which is incredibly inefficient.

There are three basic mitigation strategies, of which one most commonly introduces a mutex, so that all but one (virtual) threads must wait.

For additional details just look into the issues I've linked above. Let me know if you need additional input and how I can help here out. Would appreciate if someone could confirm my findings by reproducing it.

How to Reproduce

One can set up a cached endpoint that can do whatever one likes, I make it a bit less boring and make it sanitize remote HTML, this way I can also gauge the throughput and framework-induced latencies.

One can run

bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080"

or any other target URL. You can also run something less trivial, such as a chain of services to assure data equivalence

bombardier -c 100 -d 10s -l "http://localhost:3000/sanitize?address=http://localhost:8080/website?address=<remote_url>"

or even run a recursive request.

As you certainly are aware, this will create 100 concurrent connection and use them for a load test of a short burst ranging over 10 seconds, with implicit connection reuse, and display a histogram of observed latencies.

Expected Behavior

I expect only one cache population on the initial cache miss, not 100. Specifically, getWebpageHandler below should not be invoked more than once per cache cycle, regardless of the number of inbound concurrent connections.

Fiber Version

v2.39.0

Code Snippet (optional)

import (
  "log"
  "github.com/gofiber/fiber/v2"
  "github.com/gofiber/fiber/v2/middleware/cache"
  "github.com/microcosm-cc/bluemonday"
)

var counterGetWebpage = 0
var counterGetWebpageHandler = 0

func main() {
  log.Print("Starting the app...")

  cacheConfig := cache.Config{
    Next: func(c *fiber.Ctx) bool {
      return c.Query("refresh") == "true"
    },
    Expiration:   30 * time.Second,
    CacheControl: true,
    KeyGenerator: func(c *fiber.Ctx) string {
      return c.Path()
    },
  }


  app := fiber.New()

  // ![ ] FIXME: Cache stampede!
  app.Use(cache.New(cacheConfig))

  app.Get("/sanitize/", getWebpageHandler)

  log.Fatal(app.Listen(":3000"))
}

func getWebpageHandler(ctx *fiber.Ctx) error {
  counterGetWebpageHandler += 1
  log.Print("getWebpageHandler invoked #", counterGetWebpageHandler)

  url := ctx.Query("url")
  webpage := getWebpage(url)

  return ctx.SendString(webpage)
}

func getWebpage(url string) string {
  counterGetWebpage += 1
  log.Print("getWebpage invoked #", counterGetWebpage)

  client := fiber.AcquireClient()
  a := client.Get(url)
  status, body, err := a.String()

  if err != nil {
    log.Print("Could not obtain the remote resource: ", err)
  }

  if status > 200 {
    log.Print("Bad return status: ", status)
  }

  fiber.ReleaseClient(client)

  return body
}

Checklist:

  • I agree to follow Fiber's Code of Conduct.
  • I have checked for existing issues that describe my problem prior to opening this one.
  • I understand that improperly formatted bug reports may be closed without explanation.
@welcome
Copy link

welcome bot commented Oct 31, 2022

Thanks for opening your first issue here! 馃帀 Be sure to follow the issue template! If you need help or want to chat with us, join us on Discord https://gofiber.io/discord

@ReneWerner87
Copy link
Member

@demming can you support us with a pull request with the best solution from your point of view which solves the problem ?

@keyvan-dadash
Copy link

Can i try this?

@li-jin-gou
Copy link
Contributor

Can i try this?

Of course

@demming
Copy link
Author

demming commented Nov 16, 2022

@ReneWerner87 Sure, will see what I can do.

@sod-lol: I was going to suggest a PR this weekend. A trivial "fix" is to move the call to c.Next in middleware/cache.go inside the critical section (watch out for the expiration check though). But this will degrade performance a bit. Running a separate signal makes sense but adds to the overhead too. One can also broadcast through a condition variable. I've tried to fix it today, the stampede is gone but at the cost of a slight performance degradation. In addition, an interfering load on another route will unlock the section and let the connections on the former endpoint slip through, since the mutex is shared across the routes; I believe it should depend on the key to avoid the data race. There were also reports of a potential deadlock (see below). So it's not the best solution in my opinion. But it leaves the scope of the implementation the way it is.

That spot I linked above is where the stampede occurs---Next is app.Get in my example at the top of the page. I see @apoq decided to unlock the section to fix a bug. See also @apoq's #2057 (comment). Looks like a regression. Would appreciate you chiming in here. I believe your original issue was due to using a non-concurrent map, the lock should depend on the key.

I suggest we try to add a synchronized map of mutexes. The synchronization might curb performance a bit and their choreography might be a little difficult. I'll try to come up with a PR this weekend, unless you guys have got any other ideas?

On a side note, I've been looking into a different solution with proper thread-safe LFU and ARC policies for this middleware layer. There are a couple of popular Go implementations we could use underneath. But it's a deviation from the present implementation and would mean a return in spirit to the original implementation.

I suggest we fix the stampede first, and then discuss the rest. Quite some effort has been put into the current implementation. But I'm not so sure about the merits of maintaining a separate process-memory cache implementation.

@demming
Copy link
Author

demming commented Nov 16, 2022

@li-jin-gou

Your comment hadn't appeared before I posted. I'll wait for @sod-lol's reply then.

@demming
Copy link
Author

demming commented Nov 16, 2022

Just to follow up, I've just tried replacing the shared mutex with a synced mutex map, looks promising, performance is good.

@keyvan-dadash
Copy link

Hi @demming.
Honestly, I didn't have yet chance to analyze the issue thoroughly, so it would be best if you send your PR since you analyzed the source code and source of the problem. Let me know if I can help you. Thanks.

@keyvan-dadash keyvan-dadash removed their assignment Nov 19, 2022
@demming
Copy link
Author

demming commented Nov 24, 2022

Hi @demming.

Honestly, I didn't have yet chance to analyze the issue thoroughly, so it would be best if you send your PR since you analyzed the source code and source of the problem. Let me know if I can help you. Thanks.

Alright, I'll take a shot this weekend then. Hoped to coordinate with the other contributors involved previously in it. Hope we still get to discuss it when I suggest my PR so we can test it from different angles.

Thank you too, I'll keep it in mind 馃槉

@ReneWerner87
Copy link
Member

@demming any progress?

@demming
Copy link
Author

demming commented Feb 6, 2024

@ReneWerner87: apologies, been out of the loop for the while here, thanks for pinging me, let me reassess the issue, happy to contribute back. Btw, I met a person just recently you might know.

@ReneWerner87
Copy link
Member

Nice to hear that

Btw, I met a person just recently you might know.

Whom?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants