Reinforcement learning (RL) is a powerful technique that allows agents to learn optimal decision-making policies through interactions with an environment. However, traditional RL algorithms suffer from several limitations such as the need for large amounts of data and long-term credit assignment, i.e. the problem of determining which actions actually produce a certain reward. Recently, Transformers have shown their capacity to address these constraints in this area of learning in an offline setting. This paper proposes a framework that uses Transformers to enhance the training of online off-policy RL agents and address the challenges described above through self-attention. The proposal introduces a hybrid agent with a mixed policy that combines an online off-policy agent with an offline Transformer agent using the Decision Transformer architecture. By sequentially exchanging the experience replay buffer between the agents, the agent’s learning training efficiency is improved in the first iterations and so is the training of Transformer-based RL agents in situations with limited data availability or unknown environments.

Keywords:

References

1. A. Esmalian, W. Wang and A. Mostafavi, Multi-agent modeling of hazard–household–infrastructure nexus for equitable resilience assessment, Comput.-Aided Civ. Infrastruct. Eng. 37(12) (2022) 1491–1520. Crossref, Web of Science, Google Scholar
2. M. Aghababaei and M. Koliou, Community resilience assessment via agent-based modeling approach, Comput.-Aided Civ. Infrastruct. Eng. 38(7) (2023) 920–939. Crossref, Web of Science, Google Scholar
3. M. Gutierrez Soto and H. Adeli, Multi-agent replicator controller for sustainable vibration control of smart structures Mariantonieta, J. Vibroeng. 19(6) (2017) 4300–4322. Crossref, Web of Science, Google Scholar
4. S. Levine, A. Kumar, G. Tucker and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv:2005.01643 (2020). Google Scholar
5. L.-J. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Mach. Learn. 8(3) (1992) 293–321. Crossref, Web of Science, Google Scholar
6. C. J. C. H. Watkins and P. Dayan, Q-learning, Mach. Learn. 8 (1992) 279–292. Crossref, Web of Science, Google Scholar
7. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (Bradford, Cambridge, MA, USA, 2018). Google Scholar
8. G. Dulac-Arnold, N. Levine, D. Mankowitz, J. Li, C. Paduraru, S. Gowal and T. Hester, Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis, Mach. Learn. 110(9) (2021) 2419–2468. Crossref, Web of Science, Google Scholar
9. L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas and I. Mordatch, Decision transformer: Reinforcement learning via sequence modeling, in Advances in Neural Information Processing Systems, Vol. 18, eds. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang and J. Wortman Vaughan (Curran Associates, 2021), pp. 15084–15097. Google Scholar
10. M. Janner, Q. Li and S. Levine, Offline reinforcement learning as one big sequence modeling problem, in Advances in Neural Information Processing Systems, Vol. 34, eds. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang and J. Wortman Vaughan (Curran Associates, 2021), pp. 1273–1286. Google Scholar
11. S. Khan, M. Naseer, M. Hayat, S. Zamir, F. Khan and M. Shah, Transformers in vision: A survey, ACM Comput. Surv. 54(10) (2022) 1–41. Crossref, Web of Science, Google Scholar
12. M. Bellemare, Y. Naddaf, J. Veness and M. Bowling, The arcade learning environment: An evaluation platform for general agents, J. Artif. Intell. Res. 47 (2013) 253–279. Crossref, Web of Science, Google Scholar
13. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller, Playing Atari with deep reinforcement learning, arXiv:1312.5602 (2020). Google Scholar
14. S. Kapturowski, G. Ostrovski, J. Quan, R. Munos and W. Dabney, Recurrent experience replay in distributed reinforcement learning, in 7th Int. Conf. Learning Representations (ICLR 2019), New Orleans, Louisiana, United States, 2018. Google Scholar
15. A. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, D. Guo and C. Blundell, Agent57: Outperforming the Atari human benchmark, in 37th Int. Conf. Machine Learning (ICML) (JMLR.org, 2020), pp. 507–517. Google Scholar
16. B. Hu, Y. Xiao, S. Zhang and B. Liu, A data-driven solution for energy management strategy of hybrid electric vehicles based on uncertainty-aware model-based offline reinforcement learning, IEEE Trans. Ind. Inform. 19 (2023) 7709–7719. Crossref, Web of Science, Google Scholar
17. B. Hu and J. Li, A deployment-efficient energy management strategy for connected hybrid electric vehicle based on offline reinforcement learning, IEEE Trans. Ind. Electron. 69(9) (2022) 9644–9654. Crossref, Web of Science, Google Scholar
18. G. Zhang, C. Zhang, W. Wang, H. Cao, Z. Chen and Y. Niu, Offline reinforcement learning control for electricity and heat coordination in a supercritical CHP unit, Energy 266 (2023) 126485. Crossref, Web of Science, Google Scholar
19. C. Blad, S. Bøgh, C. Kallesøe and P. Raftery, A laboratory test of an offline-trained multi-agent reinforcement learning algorithm for heating systems, Appl. Energy 337 (2023) 1–10. Crossref, Web of Science, Google Scholar
20. S. Brandi, M. Fiorentini and A. Capozzoli, Comparison of online and offline deep reinforcement learning with model predictive control for thermal energy management, Autom. Constr. 135 (2022) 104128. Crossref, Web of Science, Google Scholar
21. C. Y. Yang, C. Shiranthika, C. Y. Wang, K. W. Chen and S. Sumathipala, Reinforcement learning strategies in cancer chemotherapy treatments: A review, Comput. Methods Programs Biomed. 229 (2023) 107280. Crossref, Medline, Web of Science, Google Scholar
22. B. D. Paoli, F. D’antoni, M. Merone, S. Pieralice, V. Piemonte and P. Pozzilli, Blood glucose level forecasting on type-1-diabetes subjects during physical activity: A comparative analysis of different learning techniques, Bioengineering 8(6) (2021) 1–11. Crossref, Web of Science, Google Scholar
23. J. Deng, S. Sierla, J. Sun and V. Vyatkin, Offline reinforcement learning for industrial process control: A case study from steel industry, Inf. Sci. 632 (2023) 221–231. Crossref, Web of Science, Google Scholar
24. J. Li, C. Tang, M. Tomizuka and W. Zhan, Hierarchical planning through goal-conditioned offline reinforcement learning, IEEE Robot. Autom. Lett. 7(4) (2022) 10216–10223. Crossref, Web of Science, Google Scholar
25. N. Stricker, A. Kuhnle, R. Sturm and S. Friess, Reinforcement learning for adaptive order dispatching in the semiconductor industry, CIRP Ann. 67(1) (2018) 511–514. Crossref, Web of Science, Google Scholar
26. H. Xia, Y. Wang, S. Jasimuddin, J. Z. Zhang and A. Thomas, A big-data-driven matching model based on deep reinforcement learning for cotton blending, Int. J. Prod. Res. 61 (2023) 7573–7591. Crossref, Web of Science, Google Scholar
27. X. Wang, D. Hou, L. Huang and Y. Cheng, Offline–online actor-critic, IEEE Trans. Artif. Intell. (2022) 1–10. https://doi.org/10.1109/TAI.2022.3225251 Google Scholar
28. Y. Zhao, R. Boney, A. Ilin, J. Kannala and J. Pajarinen, Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning, in European Symp. Artificial Neural Networks, Bruges, Belgium, 2020, pp. 545–550. Google Scholar
29. F. Torabi, G. Warnell and P. Stone, Behavioral cloning from observation, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 4950–4957. Crossref, Google Scholar
30. S. Fujimoto, D. Meger and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36th Int. Conf. Machine Learning, Long Beach, California, United States, 2019, pp. 2052–2062. Google Scholar
31. A. Kumar, J. Fu, G. Tucker and S. Levine, Stabilizing off-policy Q-learning via bootstrapping error reduction, in Advances in Neural Information Processing Systems, Vol. 32, eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox and R. Garnett (Curran Associates, 2019), pp. 11784–11794. Google Scholar
32. N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, M. Neunert, T. Lampe, R. Hafner and M. Riedmiller, Keep doing what worked: Behavioral cloning priors for fully offline learning, in Int. Conf. Learning Representations, Addis Ababa, Ethiopia, 2020, pp. 1–21. Google Scholar
33. A. Kumar, A. Zhou, G. Tucker and S. Levine, Conservative Q-learning for offline reinforcement learning, in Advances in Neural Information Processing Systems, Vol. 33, eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin (Curran Associates, 2020), pp. 1179–1191. Google Scholar
34. R. Kidambi, A. Rajeswaran, P. Netrapalli and T. Joachims, MOReL: Model-based offline reinforcement learning, in Advances in Neural Information Processing Systems, Vol. 33, eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin (Curran Associates, 2020), pp. 21810–21823. Google Scholar
35. T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn and T. Ma, MOPO: Model-based offline policy optimization, in Advances in Neural Information Processing Systems, Vol. 33, eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin (Curran Associates, 2020), pp. 14129–14142. Google Scholar
36. A. Daranda and G. Dzemyda, Reinforcement learning strategies for vessel navigation, Integr. Comput.-Aided Eng. 30(1) (2022) 53–66. Crossref, Web of Science, Google Scholar
37. Y. Du, J. Chen, C. Zhao, F. Liao and M. Zhu, A hierarchical framework for improving ride comfort of autonomous vehicles via deep reinforcement learning with external knowledge, Comput.-Aided Civ. Infrastruct. Eng. 38(8) (2023) 1059–1078. Crossref, Web of Science, Google Scholar
38. H. Shi, Q. Nie, S. Fu, X. Wang, Y. Zhou and B. Ran, A distributed deep reinforcement learning-based integrated dynamic bus control system in a connected environment, Comput.-Aided Civ. Infrastruct. Eng. 37(15) (2022) 2016–2032. Crossref, Web of Science, Google Scholar
39. H. Shi, Y. Zhou, X. Wang, S. Fu, S. Gong and B. Ran, A deep reinforcement learning-based distributed connected automated vehicle control under communication failure, Comput.-Aided Civ. Infrastruct. Eng. 37(15) (2022) 2033–2051. Crossref, Web of Science, Google Scholar
40. T. Gao, Z. Li, Y. Gao, P. Schonfeld, X. Feng, Q. Wang and Q. He, A deep reinforcement learning approach to mountain railway alignment optimization, Comput.-Aided Civ. Infrastruct. Eng. 37(1) (2022) 73–92. Crossref, Web of Science, Google Scholar
41. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, California, United States, 2017, pp. 6000–6010. Google Scholar
42. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever and D. Amodei, Language models are few-shot learners, in Advances in Neural Information Processing Systems, Vol. 33, eds. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin (Curran Associates, 2020), pp. 1877–1901. Google Scholar
43. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby, An image is worth $16 \times 16$ words: Transformers for image recognition at scale, in ICLR 2021 — 9th Int. Conf. Learning Representations, Vienna, Austria, 2021. Google Scholar
44. A. De Nardin, P. Mishra, G. L. Foresti and C. Piciarelli, Masked transformer for image anomaly localization, Int. J. Neural Syst. 32(7) (2022) 2250030. Link, Web of Science, Google Scholar
45. W. Liu, H. Chen, Y. Ma, J. Wang and N. Zheng, Transformer-based approach via contrastive learning for zero-shot detection, Int. J. Neural Syst. 33 (2023) 2350035. Link, Web of Science, Google Scholar
46. E. Parisotto, H. Song, J. Rae, R. Pascanu, C. Gulcehre, S. Jayakumar, M. Jaderberg, R. Kaufman, A. Clark, S. Noury, N. Heess and R. Hadsell, Stabilizing transformers for reinforcement learning, in Proc. 37th Int. Conf. Machine Learning (ICML), Vienna, Austria, 2020, pp. 7443–7454. Google Scholar
47. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le and R. Salakhutdinov, Transformer-XL: Attentive language models beyond a fixed-length context, in Proc. Conf. ACL 2019 — 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 2978–2988. Crossref, Google Scholar
48. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps and D. Silver, Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature 575(7782) (2019) 350–354. Crossref, Medline, Web of Science, Google Scholar
49. M. Freitag and Y. Al-Onaizan, Beam search strategies for neural machine translation, in Proc. Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 56–60. Crossref, Google Scholar
50. K. Wang, H. Zhao, X. Luo, K. Ren, W. Zhang and D. Li, Bootstrapped transformer for offline reinforcement learning, in Advances in Neural Information Processing Systems, Vol. 35, eds. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho and A. Oh (Curran Associates, 2022), pp. 34748–34761. Google Scholar
51. L. Wen, J. Xiao, S. Tan, X. Wu, J. Zhou, X. Peng and Y. Wang, A transformer-embedded multi-task model for dose distribution prediction, Int. J. Neural Syst. 33 (2023) 2350043. Link, Web of Science, Google Scholar
52. S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar and N. de Freitas, A generalist agent, arXiv:2205.06175 (2022). Google Scholar
53. M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, 2014). Google Scholar
54. R. Bellman, Dynamic Programming (Dover, 1957). Google Scholar
55. J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, Proximal policy optimization algorithms, arXiv:1707.06347 (2017). Google Scholar
56. J. Schulman, S. Levine, P. Moritz, M. Jordan and P. Abbeel, Trust region policy optimization, in Proc. 32nd Int. Conf. Machine Learning (ICML), Vol. 3, Lille, France, 2015, pp. 1889–1897. Google Scholar
57. T. Schaul, J. Quan, I. Antonoglou and D. Silver, Prioritized experience replay, in 4th Int. Conf. Learning Representations (ICLR 2016) — Conf. Track Proc., San Juan, Puerto Rico, 2016. Google Scholar
58. V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Legg and D. Hassabis, Human-level control through deep reinforcement learning, Nature 518(7540) (2015) 529–533. Crossref, Medline, Web of Science, Google Scholar
59. P. J. Huber, Robust estimation of a location parameter, Ann. Math. Stat. 35 (1964) 73–101. Crossref, Google Scholar
60. G. Hinton, N. Srivastava and K. Swersky, Neural networks for machine learning. Lecture 6a: Overview of mini-batch gradient descent, 14(8) (2012) 2. Google Scholar
61. D. Kingma and J. Ba, Adam: A method for stochastic optimization, in 3rd Int. Conf. Learning Representations, ICLR 2015 — Conf. Track Proc., San Diego, California, United States, 2015. Google Scholar
62. J. S. O. Ceron and P. S. Castro, Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research, in Int. Conf. Machine Learning (PMLR, 2021), pp. 1373–1383. Google Scholar
63. M. Bellemare, W. Dabney and R. Munos, A distributional perspective on reinforcement learning, in 34th Int. Conf. Machine Learning (ICML), Vol. 1, Sydney, Australia, 2017, pp. 693–711. Google Scholar
64. R. Agarwal, D. Schuurmans and M. Norouzi, An optimistic perspective on offline reinforcement learning, in 37th Int. Conf. Machine Learning (ICML), Vienna, Austria, 2020, pp. 92–102. Google Scholar
65. W. Dabney, M. Rowland, M. Bellemare and R. Munos, Distributional reinforcement learning with quantile regression, in 32nd AAAI Conf. Artificial Intelligence (AAAI, 2018), pp. 2892–2901. Crossref, Google Scholar
66. A. Sherstinsky, Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network, Phys. D, Nonlinear Phenom. 404 (2020) 132306. Crossref, Web of Science, Google Scholar
67. R. Pascanu, T. Mikolov and Y. Bengio, On the difficulty of training recurrent neural networks, in Proc. 30th Int. Conf. Machine Learning, eds. S. Dasgupta and D. McAllester, Proceedings of Machine Learning Research, Vol. 28 (PMLR, Atlanta, GA, USA, 2013), pp. 1310–1318. Google Scholar
68. S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput. 9(8) (1997) 1735–1780. Crossref, Medline, Web of Science, Google Scholar
69. M. Chevalier-Boisvert, B. Dai, M. Towers, R. de Lazcano, L. Willems, S. Lahlou, S. Pal, P. S. Castro and J. Terry, Minigrid & Miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks, arXiv:2306.13831 (2023). Google Scholar
70. A. Radford, K. Narasimhan, T. Salimans and I. Sutskever, Improving language understanding by generative pre-training (2018). Google Scholar
71. J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter and S. Hochreiter, RUDDER: Return decomposition for delayed rewards, in Advances in Neural Information Processing Systems, Vol. 32, eds. H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox and R. Garnett, Curran Associates, 2019. Google Scholar
72. P. S. Castro, S. Moitra, C. Gelada, S. Kumar and M. G. Bellemare, Dopamine: A research framework for deep reinforcement learning, arXiv:1812.06110 (2018). Google Scholar
73. M. Machado, M. Bellemare, E. Talvitie, J. Veness, M. Hausknecht and M. Bowling, Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents, in IJCAI Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 5573–5577. Crossref, Google Scholar
74. D. Hafner, T. Lillicrap, M. Norouzi and J. Ba, Mastering Atari with discrete world models, ICLR 2021 — 9th Int. Conf. Learning Representations, Vienna, Austria, 2021. Google Scholar
75. Y.-A. Wang and Y.-N. Chen, What do position embeddings learn? An empirical study of pre-trained language model positional encoding, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2020), pp. 6840–6849. Crossref, Google Scholar
76. K.-H. Lee, O. Nachum, M. S. Yang, L. Lee, D. Freeman, S. Guadarrama, I. Fischer, W. Xu, E. Jang, H. Michalewski and I. Mordatch, Multi-game decision transformers, in Advances in Neural Information Processing Systems, Vol. 35, eds. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho and A. Oh (Curran Associates, 2022), pp. 27921–27936. Google Scholar
77. M. Rafiei and H. Adeli, A new neural dynamic classification algorithm, IEEE Trans. Neural Netw. Learn. Syst. 28(12) (2017) 3074–3083. Crossref, Medline, Web of Science, Google Scholar
78. K. Alam, N. Siddique and H. Adeli, A dynamic ensemble learning algorithm for neural networks, Neural Comput. Appl. 32(12) (2020) 8675–8690. Crossref, Web of Science, Google Scholar
79. D. Pereira, M. Piteri, A. Souza, J. Papa and H. Adeli, FEMa: A finite element machine for fast learning, Neural Comput. Appl. 32(10) (2020) 6393–6404. Crossref, Web of Science, Google Scholar
80. M. H. Rafiei, L. Gauthier, H. Adeli and D. Takabi, Self-supervised learning for electroencephalography, IEEE Trans. Neural Netw. Learn. Syst. (2022) 1–15. https://doi.org/10.1109/TNNLS.2022.3190448 Crossref, Web of Science, Google Scholar