OXRL Study: Post-Training Algorithm Rankings Invert with Model Scale, Loss Modifications Offer Negligible Gains
A controlled study of 51 post-training algorithms across 240 runs finds algorithm performance rankings completely invert between 1.5B and 7B parameter models. The choice of loss function provides l...

Source: DEV Community
A controlled study of 51 post-training algorithms across 240 runs finds algorithm performance rankings completely invert between 1.5B and 7B parameter models. The choice of loss function provides less than 1 percentage point of leverage compared to model scale. OXRL Study: Post-Training Algorithm Rankings Invert with Model Scale, Loss Modifications Offer Negligible Gains A comprehensive, controlled study from researchers at arXiv has delivered a sobering reality check for the post-training alignment community. The paper, "Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions," presents the results of the OXRL framework—a unified system that implemented 51 different post-training algorithms with identical infrastructure to enable the first true apples-to-apples comparison. The study, which required approximately 240 training runs on H100 GPUs, systematically evaluated 8 core algorithms across 4 model scales (0.5B