Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

ArXi:2605.26840v1 Announce Type: new Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model behaviour. While individual factuality metrics are unreliable, their combination can effectively capture diverse factual errors. We leverage this insight to