Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some collections stop responding while the server seems ok #4199

Open
hugh2slowmo opened this issue May 8, 2024 · 10 comments
Open

Some collections stop responding while the server seems ok #4199

hugh2slowmo opened this issue May 8, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@hugh2slowmo
Copy link

hugh2slowmo commented May 8, 2024

Current Behavior

version: 1.9.1
We facing some weird not response issue yesterday, some of the collections stop working on any request including /search /search/batch /scroll, even /collections/collection_name, i try to get collection status to give me some insight, but it hangs, i also try using curl on server itself so there may not any kind of network issues, and there are no any timeout logs, we also check the cpu, mem, disk io, but everything looks quite well, for make our app back in life quickly, we choose to reboot qdrant then.
Some related machine info shows below, the peek in the middle is cause by a short query benchmark to make sure it works fine after we reboot:
image
image
image
image

Steps to Reproduce

Every thing looks good since we restarted it yesterday, don't know how to reproduce.
Through digging around logs i found one thing that i don't understand but might lead to some issues:

2024-05-07T04:23:15.180072Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:23:28.601938Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:23:55.445584Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:24:49.132774Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:26:36.507051Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:26:57.573629Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1960 "-" "python-httpx/0.25.2" 0.003233    
2024-05-07T04:26:57.911320Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1773 "-" "python-httpx/0.25.2" 0.002333    
2024-05-07T04:27:01.561033Z  INFO actix_web::middleware::logger: 10.0.8.16 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000456    
2024-05-07T04:27:04.309453Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 16240 "-" "python-httpx/0.25.2" 0.013256    
2024-05-07T04:30:11.255518Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:30:35.034610Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:30:48.456474Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:31:15.300084Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:32:01.525581Z  INFO actix_web::middleware::logger: 10.0.10.11 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000441    
2024-05-07T04:32:08.987269Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:33:56.361553Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:37:01.584066Z  INFO actix_web::middleware::logger: 10.0.10.198 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000424    
2024-05-07T04:37:14.223721Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1722 "-" "python-httpx/0.25.2" 0.002675    
2024-05-07T04:37:31.110006Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:38:01.258506Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:38:14.680405Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:38:41.524051Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:39:35.211267Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:41:22.585551Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:42:01.696012Z  INFO actix_web::middleware::logger: 10.0.10.11 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000374    
2024-05-07T04:44:23.131041Z  INFO actix_web::middleware::logger: 3.113.30.49 "GET /dashboard/logo192.png HTTP/1.1" 200 8870 "-" "Slack-ImgProxy (+https://api.slack.com/robots)" 0.008243    
2024-05-07T04:44:57.334028Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:45:44.079306Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:45:57.501191Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:46:24.344836Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:47:01.592443Z  INFO actix_web::middleware::logger: 10.0.8.16 "HEAD /dashboard HTTP/1.1" 200 942 "https://qdrant.pea.ai/dashboard" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 0.000460    
2024-05-07T04:47:18.032024Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    
2024-05-07T04:47:58.042329Z  INFO actix_web::middleware::logger: 10.0.10.99 "POST /collections/entities/points/search/batch HTTP/1.1" 200 3321 "-" "python-httpx/0.25.2" 0.003748    
2024-05-07T04:47:58.353717Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1184 "-" "python-httpx/0.25.2" 0.003491    
2024-05-07T04:47:58.354665Z  INFO actix_web::middleware::logger: 10.0.9.180 "POST /collections/entities/points/search/batch HTTP/1.1" 200 1620 "-" "python-httpx/0.25.2" 0.004120    

And Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates also appears in today's logs, i'm not sure what makes these logs keeps showing and what possible issues they may lead to.
Hope you guys can provide some solution or insight to figure it out, thanks!

@hugh2slowmo hugh2slowmo added the bug Something isn't working label May 8, 2024
@generall
Copy link
Member

generall commented May 8, 2024

hey @hugh2slowmo, thanks for reporting! Will try to reproduce it on our side

@timvisee
Copy link
Member

timvisee commented May 8, 2024

Question: what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations?

Also, are you making snapshots by any chance?

@hugh2slowmo
Copy link
Author

hi @timvisee

what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations?

The batch searches shown in logs are request to 'entities' collection, it performs normal in that time.
We noticed this issue around 2024-05-07 02:30 UTC, map to image which using CST needs to plus 8, around 10:30, sorry about that confusing timeline. And due to deployment and some reason, we lost some logs, seems i'm providing not related logs to you guys, my bad.
And we're getting not response issue on a collection call 'texts', in our app will send points/search and points/search/batch request to it most the time, and we try to send /collections/texts and points/scroll via dashboard's console and curl from remote and server localhost.
The most weird thing is that there's no any logs showing, i doubt there's network issues at first, then i try curl from server, which is curl http://localhost:6333/collections/texts, and it also not responding, and some other collections are serving normally from inside and outside traffic, so i have no idea what's happening out there. By that reason, i cannot provide collection status and info at that time.

Also, are you making snapshots by any chance?

We normally take snapshots on midnight trigger by ops scripts, and i didn't see any snapshots request logs at least.

@hugh2slowmo
Copy link
Author

Maybe it's not a easy thing to reproduce by the information i provided. Or should we use TRACE log level in production? I guess it might be helpful when something bizarre happens.

Any performance impact by setting to that?

@timvisee
Copy link
Member

timvisee commented May 9, 2024

Thank you for elaborating.

Yeah this is definitely hitting a deadlock. It means things get stuck and access to a specific collection is blocked. That's why you're seeing the above mentioned behavior. Of course, this should never happen no matter what requests you send to it.

It looks like I've been able to reproduce it somewhat reliably locally, which makes it easier to debug this.

You can enable trace logging, but I'm not sure how valuable it might be. It should not have a significant impact on performance, but there will be a lot more logs to store. Maybe we can give this a try later if I cannot resolve this locally.

@timvisee
Copy link
Member

timvisee commented May 9, 2024

I've located and fixed a deadlock issue in #4206, and am quite confident it'll fix the issue you're seeing as well.

Once merged our plan is to release a patch release shortly after. Once that's done, maybe you can try to upgrade to see whether you see the same issue again. 🤞

@hugh2slowmo
Copy link
Author

Well done Tim, we'll give it a try when the next patch has been released.

@timvisee
Copy link
Member

I've just published the release: https://github.com/qdrant/qdrant/releases/tag/v1.9.2

Please let us know if it resolves the problems you're experiencing.

@hugh2slowmo
Copy link
Author

hugh2slowmo commented May 30, 2024

@timvisee Hey Tim, seems we met the issue again, we're now using v1.9.2

curl localhost:6333
{"title":"qdrant - vector search engine","version":"1.9.2","commit":"34f7f8ec2d8455c0b3fe0f1fb134907ba7816a08"}
  • and i try to see if there some slow query which response time greater than 1 second, but only a few through multiple days, logs time using default UTC+0
"2024-05-22T07:23:19.627902Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 313 \"-\" \"python-httpx/0.25.2\" 1.988359    \n"
"2024-05-22T07:23:19.750302Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 767 \"-\" \"python-httpx/0.25.2\" 2.453608    \n"
"2024-05-22T07:23:20.177431Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/similarity_dispatcher_examples/points/search/batch HTTP/1.1\" 200 908 \"-\" \"python-httpx/0.25.2\" 2.654047    \n"
"2024-05-24T02:52:01.997137Z  INFO actix_web::middleware::logger: 10.0.10.28 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 11457 \"-\" \"python-httpx/0.25.2\" 2.598561    \n"
"2024-05-25T00:39:53.713344Z  INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 12346 \"-\" \"python-httpx/0.25.2\" 2.015387    \n"
"2024-05-26T03:07:59.902028Z  INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 10301 \"-\" \"python-httpx/0.25.2\" 1.368371    \n"
"2024-05-27T00:43:35.108451Z  INFO actix_web::middleware::logger: 10.0.10.48 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 13381 \"-\" \"python-httpx/0.25.2\" 1.037464    \n"
"2024-05-27T04:12:22.801538Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 121083 \"-\" \"python-httpx/0.25.2\" 2.275867    \n"
"2024-05-27T04:12:45.348556Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 104146 \"-\" \"python-httpx/0.25.2\" 1.028009    \n"
"2024-05-27T04:13:22.647102Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 112274 \"-\" \"python-httpx/0.25.2\" 1.946678    \n"
"2024-05-27T04:24:06.112081Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 88231 \"-\" \"python-httpx/0.25.2\" 2.080808    \n"
"2024-05-27T04:25:44.231897Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 56028 \"-\" \"python-httpx/0.25.2\" 1.056930    \n"
"2024-05-27T04:26:01.933222Z  INFO actix_web::middleware::logger: 219.135.145.130 \"POST /collections/labels/points/search/batch HTTP/1.1\" 200 57710 \"-\" \"python-httpx/0.25.2\" 1.935638    \n"
"2024-05-30T05:39:29.992428Z  INFO actix_web::middleware::logger: 10.0.9.92 \"POST /collections/texts/points/search/batch HTTP/1.1\" 200 8447 \"-\" \"python-httpx/0.25.2\" 24.165537    \n"
  • read-lock warning logs appears
"2024-05-18T01:40:19.350616Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-18T01:40:32.772502Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-18T01:40:56.198888Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-18T01:41:09.620751Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:32:04.972081Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:32:18.393947Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:32:45.237588Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:33:38.924771Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:35:26.299061Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:39:01.047518Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:43:33.908206Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:43:47.330069Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:44:14.173709Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:45:07.860895Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:46:55.235176Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:50:29.983639Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:56:10.275125Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:56:23.696990Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:56:50.540626Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:57:44.227817Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T05:59:31.602091Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
"2024-05-30T06:03:06.350550Z  WARN collection::collection_manager::holders::segment_holder: Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates.    \n"
  • can not found any timeout logs
  • picture below are some health check calling /healthz, server node are unhealthy since 13:40, time shows in it is UTC+8
    image
  • I confirm that we're no any snapshots operation, we already done it several hours ago
  • machin resouces looks good, below's time also UTC+8, we reboot the server node around 14:05 UTC+8 which causing disk io read increase in img 4, and there's a lower peak around 8:30 UTC+8 was the moment we take the snapshot
    image
    image
    image

Just let me know if any other information can help to figure it out, thx!

@timvisee
Copy link
Member

timvisee commented May 30, 2024

That's very unfortunate. Thank you for the sharing all these details!

Just let me know if any other information can help to figure it out, thx!

Could you elaborate once more on what kind of operations you're running. In the above log I only see batch searches. What kind of update operations are you running, lets say, in the past 12 hours?

We'd love to hear about any leads that may help us reproduce this. Even if it seems insignificant.

I confirm that we're no any snapshots operation, we already done it several hours ago

I assume that this also means there were none still running in the background either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants