Our research on LLM safety
-
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
Paper • 2402.11746 • Published • 2 -
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Paper • 2308.09662 • Published • 3 -
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Paper • 2310.14303 • Published • 1 -
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
Paper • 2406.11654 • Published • 6