My solutions to CS231N CNN assignments
-
Updated
Mar 14, 2018 - Jupyter Notebook
My solutions to CS231N CNN assignments
PyTorch code for Finding in NAACL 2022 paper "Probing the Role of Positional Information in Vision-Language Models".
Arabic WordNet matches for synsets in ImageNet
Source code and documentation for the LREC-COLING'24 paper "Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies"
Targeted semantic multimodal input ablation. Official implementation of the ablation method introduced in the paper: "What Vision-Language Models 'See' when they See Scenes"
Training and inferencing model to extract license number plate
A (hopefully) relatively straightforward, easy to modify code base for running a variety of multi-task optimization setups, with a focus on gradient aggregation methods and model analysis.
alt text for lazy people
A comparitive study between the two of the best performing open source Vision Language Models - Google Gemini Vision and CogVLM
A multimodal model for language-guided socially compliant robot navigation.
Code and models for the paper 'Exploring Multi-Modal Representations for Ambiguity Detection & Coreference Resolution in the SIMMC 2.0 Challenge' published at AAAI 2022 DSTC10 Workshop
an API built on FastAPI for visual question answering. It's open source
Reading group for Vision and Language research
[Frontiers in AI Journal] Implementation of the paper "Interpreting Vision and Language Generative Models with Semantic Visual Priors"
An end-to-end multimodal framework incorporating explicit knowledge graphs and OOD-detection. (NeurIPS23)
"Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"
VinVL+L: Enriching Visual Representation with Location Context in Visual Question Answering (VQA)
[INLG2023] The High-Level (HL) dataset is a Vision and Language (V&L) resource aligning object-centric descriptions from COCO with high-level descriptions crowdsourced along 3 axes: scene, action, rationale.
Vision-Controllable Natural Language Generation
Add a description, image, and links to the vision-and-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-and-language topic, visit your repo's landing page and select "manage topics."