Alignment and Unlearning
updated
Learn Your Reference Model for Real Good Alignment
Paper
• 2404.09656
• Published • 90
Aligning Teacher with Student Preferences for Tailored Training Data
Generation
Paper
• 2406.19227
• Published • 25
Self-Play Preference Optimization for Language Model Alignment
Paper
• 2405.00675
• Published • 28
CantTalkAboutThis: Aligning Language Models to Stay on Topic in
Dialogues
Paper
• 2404.03820
• Published • 25
Iterative Nash Policy Optimization: Aligning LLMs with General
Preferences via No-Regret Learning
Paper
• 2407.00617
• Published • 7
UnUnlearning: Unlearning is not sufficient for content regulation in
advanced generative AI
Paper
• 2407.00106
• Published • 6
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
• 2406.12624
• Published • 37
Simulating Classroom Education with LLM-Empowered Agents
Paper
• 2406.19226
• Published • 32
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
and Refusals of LLMs
Paper
• 2406.18495
• Published • 13
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially)
Safer Language Models
Paper
• 2406.18510
• Published • 10
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Paper
• 2406.11614
• Published • 5
Large Language Model Unlearning via Embedding-Corrupted Prompts
Paper
• 2406.07933
• Published • 9
Deep Bayesian Active Learning for Preference Modeling in Large Language
Models
Paper
• 2406.10023
• Published • 2
Transforming and Combining Rewards for Aligning Large Language Models
Paper
• 2402.00742
• Published • 12
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper
• 2401.18058
• Published • 24
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
Paper
• 2407.10058
• Published • 31
To Forget or Not? Towards Practical Knowledge Unlearning for Large
Language Models
Paper
• 2407.01920
• Published • 17
Rethinking Entity-level Unlearning for Large Language Models
Paper
• 2406.15796
• Published
The Art of Saying No: Contextual Noncompliance in Language Models
Paper
• 2407.12043
• Published • 5
Instruction Following without Instruction Tuning
Paper
• 2409.14254
• Published • 29
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
• 2410.09584
• Published • 48