Autonomous Carrier Landing Control Strategy for VTOL UAVs Based on Deep Deterministic Policy Gradient Reinforcement Learning
DOI:
https://doi.org/10.66280/ijair.v1i1.5Abstract
Autonomous shipboard recovery of vertical take-off and landing (VTOL) unmanned aerial vehicles (UAVs) is characterized by tight terminal constraints, rapidly varying wind disturbances, and deck motion induced by sea states. These factors lead to significant model uncertainty and render purely model-based designs brittle when the operating envelope broadens.
This paper develops an autonomous carrier-landing control strategy based on Deep Deter- ministic Policy Gradient (DDPG) for continuous control. Carrier recovery is formulated as a constrained Markov decision process (CMDP) using a deck-relative state representation and an action space consistent with common inner-loop attitude/thrust architectures. To improve training stability and reduce unsafe behaviors, we propose (i) a structured reward with explicit terminal touchdown constraints, (ii) constraint-aware termination and curriculum scheduling across approach phases, and (iii) domain randomization over aerodynamics, actuator dynamics, sensing latency/noise, wind gusts, and deck motion.
Comprehensive simulation studies demonstrate that the learned policy achieves higher land- ing success rates and lower touchdown dispersion than tuned PID guidance–control baselines under a wide range of perturbations. We further report ablations on reward terms and random- ization ranges, and discuss practical considerations for sim-to-real transfer.
References
[2] H. K. Khalil, Nonlinear Systems, 3rd. Prentice Hall, 2002.
[3] E. F. Camacho and C. Bordons, Model Predictive Control. Springer, 2013.
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[5] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
[6] G. Brockman et al., “Openai gym,” in arXiv preprint arXiv:1606.01540, 2016.
[7] Y. Tassa et al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.
[8] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.
[9] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor- critic methods,” in Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
[10] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
[11] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomiza- tion for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
[12] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in Robotics: Science and Systems (RSS), 2018.
[13] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforce- ment learning that matters,” Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[14] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceed- ings of the 34th International Conference on Machine Learning (ICML), 2017.
[15] J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, pp. 1437–1480, 2015.
[16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd. MIT Press, 2018.
[17] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems (NeurIPS), 2000.
[18] R. S. Sutton, D. A. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for re- inforcement learning with function approximation,” in Advances in Neural Information Pro- cessing Systems (NeurIPS), 2000.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
[20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza- tion algorithms,” in arXiv preprint arXiv:1707.06347, 2017.
[21] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy opti- mization,” in Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
[22] M. Hessel et al., “Rainbow: Combining improvements in deep reinforcement learning,” Pro- ceedings of the AAAI Conference on Artificial Intelligence, 2018.
[23] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforce- ment learning,” Advances in Neural Information Processing Systems (NeurIPS), 2020.
[24] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” in International Conference on Learning Representations (ICLR), 2016.
[25] G. Dalal, D. Gilboa, S. Mannor, and N. Shimkin, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.
Downloads
Published
Versions
- 2026-03-02 (5)
- 2026-03-02 (4)
- 2026-01-30 (3)
- 2026-01-30 (2)
- 2026-01-30 (1)
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal of Artificial Intelligence Research

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



