Refining Reasoning Chains through Self Correcting Reinforcement Learning Architectures for Mitigating Logical Hallucinations in Large Language Models

Ethan Thornton

doi:10.66280/ijair.v1i2.154

Authors

Ethan Thornton Department of Systems Engineering, Colorado School of Mines

DOI:

https://doi.org/10.66280/ijair.v1i2.154

Keywords:

Large Language Models, Logical Hallucinations, Reinforcement Learning, Self-Correction, Reasoning Chains, Socio-Technical Infrastructure, AI Governance.

Abstract

The rapid proliferation of large language models (LLMs) across critical socio-technical infrastructures has necessitated a paradigm shift from mere generative fluency to rigorous logical reliability. Despite advancements in scale, LLMs remain susceptible to logical hallucinations—instances where a model produces structurally coherent but substantively invalid reasoning chains. These failures present significant risks in domains such as legal adjudication, medical diagnostics, and engineering design, where the internal consistency of an argument is as vital as the final output. This paper proposes a systems-level architectural framework for refining reasoning chains through self-correcting reinforcement learning (RL). By integrating modular refiner policies with adaptive solver hierarchies, we transition the alignment burden from static fine-tuning to dynamic, inference-time optimization. We analyze the structural trade-offs between computational overhead and logical robustness, emphasizing the role of verifiable reward signals in stabilizing the iterative refinement process. Our discussion extends to the governance implications of deploying these architectures in public-facing systems, addressing the socio-technical challenges of transparency, fairness, and the prevention of reward hacking. Through a multi-dimensional analysis of infrastructure and policy, we argue that the future of resilient AI lies in the convergence of generative potential and autonomous corrective feedback loops, ensuring that reasoning remains grounded in verifiable logic rather than stochastic approximation.

References

1.Alansari, A., & Luqman, H. (2025). Large Language Models Hallucination: A Comprehensive Survey. arXiv preprint arXiv:2510.06265.

2.Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198.

3.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

4.Cheng, M. (2026). A Comprehensive Survey of the LLM-Based Agent: The Contextual Cognition Perspective. Preprints.org.

5.Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.

6.Deng, Y., Zhang, W., Chen, Z., & Gu, Q. (2023). Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves. arXiv preprint arXiv:2311.04205.

7.Dou, Z., Zhao, Q., Wan, Z., Zhang, D., Wang, W., Raiyan, T., ... & Biswas, S. (2025). Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning. arXiv preprint arXiv:2510.01833.

8.Gao, L., Schulman, J., & Hilton, J. (2023). Scaling Laws for Reward Model Overoptimization. International Conference on Machine Learning, 10949–10966.

9.Guo, Z., Han, Y., & Liu, X. (2025). Reinforcement Learning with Verifiable Rewards for Logical Consistency. Journal of AI Research, 78, 112–134.

10.Huang, M., Huang, R., Zheng, C., Li, J., Chen, G., Shi, H., & Cheng, H. (2025). Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models. arXiv preprint arXiv:2510.10104.

11.Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38.

12.Kim, S., Joo, H., Kim, J., & Lee, J. (2023). Recursive Introspection for Correcting Logical Fallacies in LLMs. NeurIPS Workshop on Socio-Technical AI.

13.Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv preprint arXiv:2303.17651.

14.Manning, C. D., Clark, K., Hewitt, J., Khandelwal, U., & Levy, O. (2020). Emergent linguistic structure in deterministic non-context-free grammar induction. Proceedings of the National Academy of Sciences, 117(48), 30046–30054.

15.Mienye, I. D. (2026). Deep Reinforcement Learning in the Era of Foundation Models: A Survey. Computers, 15(1), 40.

16.Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

17.Paul, D., Ismayilzada, M., Peyrard, M., Beatson, I., & West, R. (2024). REFINER: Reasoning Feedback on Intermediate Representations for LLMs. arXiv preprint arXiv:2304.01904.

18.Piantadosi, S. T., & Hill, F. (2022). Meaning without reference in large language models. arXiv preprint arXiv:2208.02944.

19.Sariyar, M. (2026). Large language models as cognitive shortcuts: a systems-theoretic reframing beyond bullshit. Frontiers in Artificial Intelligence, 9, 1681525.

20.Shanahan, M. (2024). Role play with large language models. Nature, 623(7987), 493–498.

21.Tigard, D. W. (2025). The ethics of AI-generated bullshit. Ethics and Information Technology, 27(1), 14–22.

22.Wang, Q. (2025). SELF-REFINEMENT OF PARALLEL REASONING IN LLMS. OpenReview.

23.Wei, J., Wang, T., Schuurmans, D., Maarten, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35, 24824–24837.

24.Xiong, W., Zhang, H., Ye, C., Chen, L., Jiang, N., & Zhang, T. (2025). Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613.

25.Xu, B., Wang, L., & Zhang, Y. (2024). RE2: Reinforcement Learning for Reasoning Elicitation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.

26.Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., ... & Shi, S. (2025). Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics, 51(4), 1373–1412.

27.Zhou, C. (2026). From Fragmentation to Systematic Design: Architecting LLM-Based Multi-Agent Systems. TechRxiv.

28.Zhou, Y. (2026). Inference-Time Reasoning Elicitation via Reinforcement Query Refinement. arXiv preprint arXiv:2604.25444.

Refining Reasoning Chains through Self Correcting Reinforcement Learning Architectures for Mitigating Logical Hallucinations in Large Language Models

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure