Unsupervised Identification and Removal of Spurious Correlations During Fine-Tuning

ArXi:2605.27676v1 Announce Type: cross Fine-tuning a pretrained language model on a curated dataset can produce spurious correlations between the fine-tuning task and unintended latent factors -- such as misaligned personas or political slant -- that the curation procedure has entangled with the task. The model can latch onto these spurious correlations, leading to bias and reduced out-of-distribution generalisation.