Tag: training Qwen2.5-Math-7B with erroneous rewards