Skip to content

Multimodal Computer Vision application leveraging object detections, gesture recognition and speech to text, in order to help user ask questions about their environment.

Notifications You must be signed in to change notification settings

darmangerd/vubot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VuBot

Description

VuBot is an application that combines speech and gesture recognition to interact with objects in a real-time video feed. Using a webcam, users can point at objects and issue voice commands to perform actions such as detecting individual objects, recognizing all objects in the scene, or querying the color of a specific object. VuBot leverages powerful libraries and models like MediaPipe for gesture detection, OpenCV for video processing, and OpenAI whisper for capturing and processing voice commands.

Key Features

  • Gesture Recognition: Detects gestures such as pointing, closed fist, and victory using MediaPipe.
  • Speech Recognition: Processes voice commands to trigger actions like object detection and color recognition.
  • Object Detection: Identifies objects in the video feed and draws bounding boxes around them.
  • Color Recognition: Determines the color of objects by averaging the colors within the bounding boxes.

VuBot is designed to be intuitive and user-friendly, making it a versatile tool for various applications.

Models used

Installation

1 - First, clone the project from the repository and navigate to the project root:

git clone https://github.com/darmangerd/vubot.git

cd vubot

2 - Next, install the project dependencies, preferably in a virtual environment. To do this, execute the following commands from the project root:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3 - Finally, run the project:

python app.py
  • Be sure to have a functional microphone and webcam connected to your computer.

Guide

Gesture Trigger Word Output
Pointing 'object' Return object's name
Pointing 'color' Return object's color
Closed Fist 'every item' Highlight all detected objects

Project Structure

  • app.py: Main file to run the project.
  • /evaluation: Folder containing the material used for the evaluation phase of the project.
    • /evaluation/evaluation_keys.py: File containing the alternative keys method, used to evaluate the project (object and color names are manipulated).
    • /evaluation/evaluation_speech.py: File containing the alternative speech method, used to evaluate the project (object and color names are manipulated).
    • /evaluation/main_evaluation.csv: File containing the evaluation data obtained during the evaluation trials.
    • /evaluation/accuracy_evaluation.py: File containing the evaluation of the accuracy.
    • /evaluation/runtime_evaluation.py: File containing the evaluation of the runtime.
  • requirements.txt: File containing the project dependencies.
  • /images: Folder containing the images saved for debugging purposes.
  • /utils: Folder containing the utility functions used in the project.
    • /utils/models: Folder containing the gesture recognition model used in the project (mediapipe).
  • /docs: Folder containing the project documentation. Includes the report, presentation and demo video.

Future Work

Future enhancements include developing a mobile version, improving audio speech handling, adding more interaction methods, integrating a large language model (LLM) for richer interactions, and implementing features to remember and locate specific objects.

About

Multimodal Computer Vision application leveraging object detections, gesture recognition and speech to text, in order to help user ask questions about their environment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published