AI SAFETY & ETHICS
Synthetic Persona Pretraining: Alignment from Token Zero
LessWrong AI
•
Julian Minder, Viktor Moskvoretskii, Ragha Singhal, Difan Jiao, Kartik Bali, Yiderigun Borjigin, Shaobo Cui, Stefan Krsteski, Ashton Anderson, Roland Aydin, Robert West ( equal contribution ) These are early results, but we wanted to share them with the community now. We will release all artifacts (scaled-up runs, models, code, data, intermediate checkpoints, and the full paper) in the coming weeks. Figure 1: Mean attack success rate across five adversarial benchmarks. All models are 1.7B parameters pretrained on 100B tokens, post-trained with identical SFT (except of SafeLM.