Advancing Mathematical Reasoning Excellence via Self Play Reinforcement Learning Frameworks for Recursive Logic Improvement in Large Language Models
DOI:
https://doi.org/10.66280/ijair.v1i2.155Abstract
The evolution of large language models has transitioned from simple linguistic pattern recognition to complex cognitive task execution, yet the achievement of consistent and verifiable mathematical reasoning remains a significant architectural challenge. This research paper explores the systemic integration of self-play reinforcement learning frameworks as a primary mechanism for driving recursive logic improvement within transformer-based architectures. Unlike traditional supervised fine-tuning, which is inherently limited by the quality and volume of human-annotated data, self-play frameworks allow models to generate their own synthetic reasoning paths, evaluate them against logical ground truths, and iteratively refine their internal policy through competitive and collaborative cycles. This paper focuses on the system-level implications of such frameworks, emphasizing the structural trade-offs between computational intensity and reasoning robustness. We investigate the deployment of dual-agent systems where a generator proposes solutions and a verifier provides nuanced feedback, creating a feedback loop that mimics human-like metacognitive reflection. Furthermore, the discussion extends to the infrastructure requirements for large-scale recursive training, the socio-technical implications of autonomous logic refinement, and the governance frameworks necessary to ensure that such systems remain transparent and fair. By analyzing the intersection of reinforcement learning and recursive logic, this study provides a comprehensive roadmap for developing infrastructures capable of sustained mathematical excellence without human intervention, while addressing the critical challenges of hallucination mitigation and algorithmic sustainability.
References
1.Silver, D., Hubert, T., Reiter, N., & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.
2.Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
3.Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.
4.Wei, J., Wang, X., Schuurmans, D., Bosma, M., Fei-Fei, L., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837.
5.Dou, Z., Zhao, Q., Wan, Z., Zhang, D., Wang, W., Raiyan, T., ... & Biswas, S. (2025). Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning. arXiv preprint arXiv:2510.01833.
6.Polu, S., & Sutskever, I. (2020). Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393.
7.Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
8.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
9.Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
10.Kaplan, J., McCandlish, S., Hernandez, D., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
11.Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., ... & Guy, S. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35, 3843-3857.
12.Zelikman, E., Wu, Y., Mu, J., & Goodman, N. (2022). Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476-15488.
13.Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., ... & Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. NeurIPS Datasets and Benchmarks.
14.Weng, L. (2021). LLM Powered Autonomous Agents. lilianweng.github.io.
15.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
16.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
17.Bengio, Y., Lecun, Y., & Hinton, G. (2021). Deep learning for AI. Communications of the ACM, 64(7), 58-65.
18.Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach. Pearson.
19.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
20.Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.
21.Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
22.Floridi, L., & Cowls, J. (2019). A unified framework of five ethical principles for AI in society. Harvard Data Science Review.
23.Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
24.Christian, B. (2020). The Alignment Problem: Machine Learning and Human Values. Norton & Company.
25.Patterson, D., Gonzalez, J., Le, Q., Liang, C., Moghimi, L., Wang, S., ... & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
26.Wu, Y., Remscheid, M., & Szegedy, C. (2022). Autoformalization with large language models. Advances in Neural Information Processing Systems.
27.Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. International Conference on Machine Learning.
28.Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
29.Lightman, H., Kosaraju, V., Burda, Y., Harrison, E., Rivers, A. J., & Schulman, J. (2023). Let's verify step by step. arXiv preprint arXiv:2305.20050.
30.Wang, X., Wei, J., Schuurmans, D., Qu, Q., Terark, F., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal of Artificial Intelligence Research

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



