-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some collections stop responding while the server seems ok #4199
Comments
hey @hugh2slowmo, thanks for reporting! Will try to reproduce it on our side |
Question: what requests are you sending to Qdrant when this problem comes up? Is it just batch searches as shown in your log? Or are you also sending other operations? Also, are you making snapshots by any chance? |
hi @timvisee
The batch searches shown in logs are request to 'entities' collection, it performs normal in that time.
We normally take snapshots on midnight trigger by ops scripts, and i didn't see any snapshots request logs at least. |
Maybe it's not a easy thing to reproduce by the information i provided. Or should we use TRACE log level in production? I guess it might be helpful when something bizarre happens. Any performance impact by setting to that? |
Thank you for elaborating. Yeah this is definitely hitting a deadlock. It means things get stuck and access to a specific collection is blocked. That's why you're seeing the above mentioned behavior. Of course, this should never happen no matter what requests you send to it. It looks like I've been able to reproduce it somewhat reliably locally, which makes it easier to debug this. You can enable trace logging, but I'm not sure how valuable it might be. It should not have a significant impact on performance, but there will be a lot more logs to store. Maybe we can give this a try later if I cannot resolve this locally. |
I've located and fixed a deadlock issue in #4206, and am quite confident it'll fix the issue you're seeing as well. Once merged our plan is to release a patch release shortly after. Once that's done, maybe you can try to upgrade to see whether you see the same issue again. 🤞 |
Well done Tim, we'll give it a try when the next patch has been released. |
I've just published the release: https://github.com/qdrant/qdrant/releases/tag/v1.9.2 Please let us know if it resolves the problems you're experiencing. |
@timvisee Hey Tim, seems we met the issue again, we're now using v1.9.2
Just let me know if any other information can help to figure it out, thx! |
That's very unfortunate. Thank you for the sharing all these details!
Could you elaborate once more on what kind of operations you're running. In the above log I only see batch searches. What kind of update operations are you running, lets say, in the past 12 hours? We'd love to hear about any leads that may help us reproduce this. Even if it seems insignificant.
I assume that this also means there were none still running in the background either. |
Current Behavior
version: 1.9.1
We facing some weird not response issue yesterday, some of the collections stop working on any request including /search /search/batch /scroll, even /collections/collection_name, i try to get collection status to give me some insight, but it hangs, i also try using curl on server itself so there may not any kind of network issues, and there are no any timeout logs, we also check the cpu, mem, disk io, but everything looks quite well, for make our app back in life quickly, we choose to reboot qdrant then.
Some related machine info shows below, the peek in the middle is cause by a short query benchmark to make sure it works fine after we reboot:
Steps to Reproduce
Every thing looks good since we restarted it yesterday, don't know how to reproduce.
Through digging around logs i found one thing that i don't understand but might lead to some issues:
And
Trying to read-lock all collection segments is taking a long time. This could be a deadlock and may block new updates
also appears in today's logs, i'm not sure what makes these logs keeps showing and what possible issues they may lead to.Hope you guys can provide some solution or insight to figure it out, thanks!
The text was updated successfully, but these errors were encountered: