AI DailyAI Daily – 2025-05-28(Morning)erroneous rewardsMATH-500MATH-500 test setmodel performanceQwen2.5-Math-7Brandom rewardsrandom rewards improve model performanceReinforcement learningreinforcement learning signal learningRLAIFRLHFthe future of RLHF/RLAIFtraining Qwen2.5-Math-7B with erroneous rewards