ICLR Steering Fine Tuning with Targeted Concept Ablation

Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

Steering Fine Tuning with Targeted Concept Ablation

Helena Casademunt · Caden Juang · Senthooran Rajamanoharan · Neel Nanda

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 26 Apr 8:10 p.m. PDT — 9:10 p.m. PDT

Abstract:

Models often learn unintended behaviors during fine-tuning, such as adopting spurious correlations present in training data. We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts. Our approach steers models toward intended generalizations even in cases where multiple policies correctly fit the training data. We evaluate our method on two tasks: a gender bias task containing spurious correlations and a multiple choice task where models must learn to focus on intended questions while ignoring others. Our technique successfully guides models to learn the intended generalization on 13 out of 17 cases, significantly outperforming random ablation baselines. The results demonstrate a practical application of interpretability techniques for ensuring the safe and reliable deployment of AI models.

Chat is not available.

Poster in Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

Steering Fine Tuning with Targeted Concept Ablation

Helena Casademunt · Caden Juang · Senthooran Rajamanoharan · Neel Nanda

Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference