Skip to content

3D CNN based video classification android application. Transcribes lip movements of the speaker in a silent video to text. The neural network captures spatio temporal information from video required to generate words from video. MLOps using Vertex AI was used to deploy the model in a CI/CD fashion on android app

Notifications You must be signed in to change notification settings

surabhigovil/LipScribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LipScribe: An application that converts lip movements of speakers in a silent video to text and display that using an android application. Exploiting the capabilities of 3D CNN to extract information from spatio temporal data this Deep Learning aims at creating words from a sequence of frames in a video.

Android application installation and using:

  1. Install the android application using the APK from https://drive.google.com/drive/folders/10pGHK0VYddb7Kn0rjqMDCVR4Nh-bR_U3?usp=sharing
  2. On launching the application on android phone, it checks for the camera and requests the permission for recording videos and capturing pictures.
  3. Choose to allow the application to record videos and capture pictures.
  4. Click on allow for the application on request for accessing the files and media.
  5. Click on 'start camera' on start page to start recording the video.
  6. The recorded video is processed from external storage and given to model for prediction this happens in background and loading screen is displayed on screen.
  7. The prediction of the word speaker utterted is displayed on the screen.

Working of the application:

  1. Use the android aplication to record a video.
  2. The process goes through preprocessing where Haar Cascade Classifier extract frames video and subsequently lips of a speaker from those frames.
  3. This is sent to a 3D CNN model which outputs a word as the final output.

Model Evaluation:

image

Android Application:

An android operating system compatible application is developed to deploy the predictions from the model. The application requires model built with tensorflow version 1.15. ffmpeg library is used to extract frames for data preprocessing from a video. The mouth region is extracted and converted into embeddings and passed as input to the model .

Demo:

Deetcting a word on app: Screenshot_20211123_143827 Screenshot_20211123_143929 Screenshot 2021-11-23 at 3 22 44 PM

No Faace Detection on App: image Screenshot 2021-11-23 at 3 23 53 PM

About

3D CNN based video classification android application. Transcribes lip movements of the speaker in a silent video to text. The neural network captures spatio temporal information from video required to generate words from video. MLOps using Vertex AI was used to deploy the model in a CI/CD fashion on android app

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •