EC_601_Project_VQA

Introduction

Visual Question Answering (VQA) is the system that combines Knowledge Representation & Reasoning, Computer Vision, and Natural Language Processing. [1] In this system, it takes images and natural language as input and gives the natural language answers as output. VQA could be very helpful when it is applied to scenarios of object detections or acquisition of information.

Application

Visual Question Answering has been widely used in different real-life areas. The most direct application is to help visually impaired persons; thus, they could ask questions by just taking photos. Also, it could be used in the medical or biological field by detecting X-rays or some other medical images to tell information.

Users and Product MVP

Our main users are people with vision imparity.

Asking the VQA product question in voice with an image, model is able to give you the answer in voice.

A man get blind in a sudden, and he became unfamiliar with the world around him. With our work, he can understand his situation in a more convenient way, and “see” the world again.

A user want to figure out the situation in a certain picture, and VQA will help to find out.

Demo Structure

Next, this is the structure of VQA demo. Our demo supports conversion between text and speech by gTTS. There are two inputs, speech (question) and image, by using gTTS, the speech will transfer to text. Model takes text and image as input, giving text (answer) as output. Then the text output will be trasferred to speech (answer).

Figure 1. Demo Structure

Model Structure

In this VQA product, model is BLIP recaped by Salesforce, Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [2]. The structure is shown below:

Figure 1. Structure of BLIP Model changed to LAVIS, A Library for Language-Vision Intelligence, which support 2 kinds of VQA models: albef_vqa and blip_vqa

To change the model, you should change the name from "blip_vqa" to "albef_vqa"

model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="aokvqa", is_eval=True,
                                                                      device=device)

GUI tutorial:

Our GUI supports Windows 11 operating system. By click the "Push to Ask", the front camera of computer will take the image as input, then you need to speak to ask questions. The demo will speak the answers to you. At the same time, demo will automatically record the history of your question, answers given by model, and time. The demo image is shown below:

By clicking the "Push to Ask", demo will take the picture of the water bottle, then ask the demo "what is in the picture". Demo will speak the anwer "water bottle".

References

[1] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425-2433).

[2]Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086.

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Codes		Codes
LAVIS		LAVIS
Screenshots		Screenshots
BLIP_model.png		BLIP_model.png
GUI_image.png		GUI_image.png
README.md		README.md
VQA_demo_structure.png		VQA_demo_structure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codes

Codes

LAVIS

LAVIS

Screenshots

Screenshots

BLIP_model.png

BLIP_model.png

GUI_image.png

GUI_image.png

README.md

README.md

VQA_demo_structure.png

VQA_demo_structure.png

Repository files navigation

EC_601_Project_VQA

Introduction

Application

Users and Product MVP

Demo Structure

Model Structure

GUI tutorial:

References

About

Releases

Packages

Contributors 2

Languages

CMander02/VQA_2022_A1_18

Folders and files

Latest commit

History

Repository files navigation

EC_601_Project_VQA

Introduction

Application

Users and Product MVP

Demo Structure

Model Structure

GUI tutorial:

References

About

Resources

Stars

Watchers

Forks

Languages