How Direct Reasoning Optimization Teaches LLMs to Grade Their Own Thinking Large language models have learned to reason well in math and coding thanks to reinforcement learning with verifiable rewards, where an answer can be checked automatically. Open-ended tasks like rewri... chain-of-thought FinQA GRPO ParaRev R3 reinforcement learning RLVR