Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models - 🔥 ICLR 2024 Spotlight - 🏆 Best Paper Award SoCal NLP 2023
-
Updated
Jun 6, 2024 - Python
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models - 🔥 ICLR 2024 Spotlight - 🏆 Best Paper Award SoCal NLP 2023
🐢 Open-Source Evaluation & Testing for LLMs and ML models
A curated list of awesome responsible machine learning resources.
RuLES: a benchmark for evaluating rule-following in language models
Aira is a series of chatbots developed as an experimentation playground for value alignment.
Scan your AI/ML models for problems before you put them into production.
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
Evaluation & testing framework for computer vision models
DPLL(T)-based Verification tool for DNNs
Universal Neurons in GPT2 Language Models
Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.
The official implementation of the paper "Data Contamination Calibration for Black-box LLMs" (ACL 2024)
Code for our paper "Modelobfuscator: Obfuscating Model Information to Protect Deployed ML-Based Systems" that has been published by ISSTA'23
Attack to induce LLMs within hallucinations
Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.
To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."