Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Search for Events #8980

Open
17 tasks done
hunterjm opened this issue Dec 15, 2023 · 9 comments
Open
17 tasks done

Semantic Search for Events #8980

hunterjm opened this issue Dec 15, 2023 · 9 comments
Labels
enhancement New feature or request pinned

Comments

@hunterjm
Copy link
Contributor

hunterjm commented Dec 15, 2023

Semantic Search for Events

When I started working on #8959, I was initially looking at making it near-real-time to include additional labels and descriptions that could be send along with the MQTT messages, but the latency in LLMs prevents that. The PR then shifted to processing descriptions for events after they ended.

My interest has shifted towards adding vector similarity search capability to Frigate. The end result would be having a search box on the events page that would be able to take in free form text. @blakeblackshear gave some good examples in an earlier comment:

You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I've identified a couple new dependencies and thought through a potential implementation that I'd like some feedback on.

Suggested Default Embeddings Model:

CLIP ViT-B/32 or ResNet50 (need to test speed/performance)
CLIP is selected because it is a multi-modal embeddings model that embeds images and text into the same embedding space. It was trained on public image/caption pairs so can be used directly on thumbnails of finished events and return results based on text queries.

I found an ONNX conversion for CLIP that will allow us to run this model with just the onnxruntime dependency. I plan to use

ViT-B/32 Models
Image Model
Text Model

ResNet50 Models
Image Model
Text Model

New Core Dependencies

onnxruntime
Used to generate embeddings from CLIP

ChromaDB
An open source embeddings database which allows you to store metadata along with embeddings, and do vector similarity searches. It supports multiple embeddings models, and allows you to implement your own embedding functions (more on this later). Unlike many other options, the base installation is relatively lightweight without a huge dependency tree. Most notably, it does not install Torch or Torch Vision out of the box, though can support models that require it such as sentence-transformers. We could potentially offer a container that has those dependencies and allows users to self-select an embeddings model.

Google Gemini

Gemini will be able to provide a more detailed description than will undoubtably outperform CLIP's embeddings. Currently only images are supported in multi-modal prompts, but Video is soon to be released. Initially, I would like to generate descriptions off of the thumbnail and use ChromaDB's built in ONNX sentence embeddings encoder based off of the all-MiniLM-L6-v2 model. Google Gemini also provides an API for it's own embeddings generator, but using this model will allow us to support multiple methods of adding descriptions to events. (API, Third Party Integrations, etc)

Next Steps

  • Add a "description" to the events table data field
  • Create an HTTP API for POST /api/events/<id>/description allowing external systems creating descriptions
  • Create a new config entry for semantic_search and gemini
    • Able to configure whether or not it is enabled, default disabled
    • Able to configure and enable Google Gemini support for descriptions
  • Create a new app service for embeddings
    • Instantiate a ChromaDB collection using CLIP
    • On startup, load existing ended event thumbnails into Chroma if they do not yet exist
      • Implemented via .reindex_events file in the /config directory. It takes a while though... this is on a 12 core CPU: Embedded 10987 thumbnails and 851 descriptions in 730.0870163440704 seconds
      • Recommend just letting build up index naturally as new events come in and old ones are removed.
    • Implement a processor for ended events to add them to ChromaDB
    • Create structure for additional captioning and embedding services like Gemini
    • Handle updates when descriptions change after the event has ended
  • Implement Similarity Search in Frontend
    • Take into account existing filters on metadata when returning results
    • Implement HTTP API on backend modifying /events endpoint to allow for similarity search as well
    • Search box for text similarity
    • Button for thumbnail similarity, pulling up events with very close thumbnails
  • Allow users to edit descriptions in the frontend

Thanks in advance for any feedback or thoughts.

@NickM-27
Copy link
Sponsor Collaborator

one thing to be noted here is that the UI is (in progress) being rewritten for 0.14, so the events page won't exist as it is today.

@hunterjm
Copy link
Contributor Author

one thing to be noted here is that the UI is (in progress) being rewritten for 0.14, so the events page won't exist as it is today.

0.13 isn't even out and you're planning that far ahead! I'd love to work with you on envisioning what the UX could be. I don't pretend to assert frontend being a strength of mine.

@NickM-27
Copy link
Sponsor Collaborator

NickM-27 commented Dec 15, 2023

to be clear it is a collaborative effort between myself, Blake, and @hawkeye217

there may be some outside help on the UX & design side as well

@hunterjm
Copy link
Contributor Author

It was the collective "you". I'll make sure I base my work off of the 0.14 branch moving forward and if I implement any POC I'll do it in web-new.

@blakeblackshear
Copy link
Owner

I think what you have laid out here is a strong backend foundation. I'm sure there are some good use cases here even if we haven't come up with them yet. The best way to figure them out is to start experimenting.

@hunterjm
Copy link
Contributor Author

We've got progress! I've got the backend mostly hooked up locally now. Still need to add the Search API and frontend components. Embedding models are relatively fast on my dev machine. Will need to test still on embedded devices.

[2023-12-16 07:24:50] frigate.chromadb               INFO    : Embedded thumbnail for 1702729481.753491-7a7xb6 on office in 0.0683 seconds
[2023-12-16 07:24:55] frigate.chromadb               INFO    : Generated description for 1702729481.753491-7a7xb6 on office in 5.6522 seconds: A man with long hair and a beard is sitting in a chair looking to the right
[2023-12-16 07:24:56] frigate.chromadb               INFO    : Embedded description for 1702729481.753491-7a7xb6 on office in 0.2163 seconds

Chroma was a bit harder than I wanted. It requires a version of SQLLite that Python 3.9 and Debian Bullseye don't provide. I went down a path of trying to upgrade the docker image, but there is A LOT involved in that. I found a solution that lets us use a pre-compiled binary and replace it in the Python import runtime.

@mabed-fr
Copy link

Who will be able to benefit from it?

Users of the default template?

Coral users?

frigate+ users?

@NickM-27
Copy link
Sponsor Collaborator

anyone could use this regardless of what model / hardware they are using for object detection

@mabed-fr
Copy link

Great !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned
Projects
None yet
Development

No branches or pull requests

4 participants