Applications of Reinforcement Learning: From Video Games to Self-Driving Cars

Aug 12, 2024

Authors: Clay Hsieh, Xander Matcuk, DeVaughn Prince, and Shiven Seth

Mentor: Harry Moore. Harry is a doctoral candidate in the Department of Engineering at the University of Cambridge with research specialization in Artificial Intelligence.

Abstract

Reinforcement learning (RL) has emerged as a pivotal framework in the field of machine learning, offering powerful solutions for complex decision-making problems where an agent learns to make sequences of decisions by interacting with an environment. This review paper synthesizes recent advancements in RL, tracing its evolution from foundational models like Q-learning and policy gradients to contemporary techniques such as deep reinforcement learning. We explore key algorithms, including value- and policy-based learning, which form the backbone of RL algorithms. The paper also highlights significant breakthroughs facilitated by deep neural networks, enabling RL to tackle high-dimensional state spaces previously considered intractable. Notable applications in robotics, autonomous driving, and game playing are examined to illustrate the practical impacts of RL. Additionally, the review addresses current challenges such as sample efficiency and exploration-exploitation trade-offs. By providing a comprehensive overview of RL's methodologies, applications, and future directions, this paper aims to evaluate the current state of research in RL.

Introduction

Reinforcement learning (RL) is a machine learning approach where an agent learns to make optimal decisions given its an environment to maximize some objective (Byeon, 2023). The field has been evolving since the mid twentieth century, with significant progress since the 1980s. It has gained significant popularity recently due to advancements in computing power and the application of deep learning neural networks.

The recent resurgence of interest in RL has been fueled by breakthroughs in both theoretical foundations and practical applications. The development of deep reinforcement learning, which combines deep neural networks with RL techniques, has allowed agents to learn complex behaviors from raw sensory inputs. This has expanded the scope of RL from simple, well-defined environments to more intricate and dynamic ones, making it a cornerstone of modern artificial intelligence research (Sutton & Barto, 2018).

One key area in which reinforcement learning algorithms have made large strides is strategy games, which involve complex decision-making scenarios. This has become a promising and rapidly developing area for RL algorithms. Initially, this technology was applied to simple games like Atari (Mnih, 2015). However, advancements in technology and research have enabled RL agents to perform exceptionally well in complex games such as Dota 2 (Berner, 2019) and StarCraft II (Vinyals, 2019). These games present challenges such as long-term planning, reasoning with incomplete information, and interactions with multiple agents, which remain areas of active research in reinforcement learning.

In these applications, RL agents learn efficiently from experience to develop effective strategies. These advancements have implications beyond their initial problem setting, extending to other fields where problem-solving under uncertain conditions is crucial. For instance, RL is being applied in robotics for tasks such as navigation, object manipulation, and human-robot interaction, where adaptability and learning from real-world feedback are essential (Ibarz, 2020).

In reinforcement learning, the problem refers to the agent training process and its associated elements. RL is similar to trial-and-error learning, where an agent explores the available space of actions given its current state. By adjusting the strength of a reward signal based on how closely aligned an action is with achieving the desired goal, the agent learns the best series of actions to maximize the total reward (Bhatt, 2018). This process, known as exploration and exploitation, is fundamental to RL, as it balances the need to gather information about the environment with the need to utilize known information to make optimal decisions.

The versatility of RL has led to applications in diverse domains, including autonomous systems, healthcare, finance, and climate modeling. In healthcare, for example, RL can personalize treatment plans, optimize resource allocation in hospitals, and assist in drug discovery by modeling complex biological systems (Yu, 2019). In finance, RL is used to develop trading strategies, manage portfolios, and model market behaviors (Rao & Jelvis, 2022).

While reinforcement learning has had a considerable impact across diverse application areas, from strategic gameplay to the optimal control of physical systems, it does have well known limitations. RL often requires large amounts of data and computation to be effective. It needs to interact with the environment and explore different actions to find the optimal policy, which can be inefficient, time-consuming, and costly, depending on the size of the action space. RL is also heavily dependent on the quality of the reward function, the subject of much ongoing research (Sutton & Barto, 2018).

Generalization and transfer are other challenges in reinforcement learning. An agent trained in specific scenarios may not perform well in real-world environments with diverse conditions. For example, an agent trained to drive in a simulation may struggle on a real road with varying weather and traffic conditions (Lu, 2022). There are also issues with scalability and robustness, as agents may face continuous state and action spaces, leading to computational and memory challenges. Explainability and interpretability are also challenging, especially when RL agents are based on complex models like neural networks, making it difficult to understand their decision-making processes or provide feedback (Szepesvári, 2010).

Despite these challenges, reinforcement learning has proven to be capable and useful in many applications, including gaming, self-driving cars, quantitative finance and healthcare. RL personalizes treatment plans based on individual health data. In finance, RL algorithms are used to predict stock prices. Engineers use RL to teach robots how to navigate and manipulate objects, for example in self-driving cars, allowing them to learn effective driving policies and respond in real time to traffic conditions (Lu, 2022). While challenges exist, they do not diminish the fact that reinforcement learning is a central capability in the AI and engineering industries, and will likely continue to play a significant role in these fields in the years to come.

Components of Reinforcement Learning

The agent is the central component of any reinforcement learning system, and serves as the decision maker that interacts with the environment to learn optimal behaviors. It takes actions based on observations, receives feedback through rewards and adjusts its strategy to maximize cumulative rewards. The agent’s goal is to discover the best actions to take in a given situation, ultimately learning a policy that achieves the desired outcomes (Sutton & Barto, 2018).

The environment is a similarly central component, and refers to the external context within which the agent operates. It presents states to the agent, receives actions from the agent, and assigns rewards based on the actions taken (Jaffry, 2020). Examples of RL environments include game simulations such as Chess and Go, robotic control simulations like those provided by OpenAI's Gym, and real-world settings such as self-driving cars (Reinforcement Learning Environments). The environment dictates the modes of interaction available to the agent, thereby influencing the learning process and the strategies that the agent develops.

The environment provides the context in which the agent operates and learns. It supplies the agent with states and feedback in the form of rewards or penalties based on the agent's actions. The environment's dynamics and structure significantly influence the agent's learning process, shaping its ability to develop effective strategies and achieve desired outcomes. Essentially, the environment is the testing ground where the agent's decisions are evaluated and refined.

Furthermore, the state represents the current situation or context of the environment as perceived by the agent. It provides the agent with the necessary information to make decisions about which actions to take. The state helps track the environment's dynamics and progress towards the agent's goals, playing a key role in determining the appropriate actions and strategies for maximizing rewards.

In the context of autonomous vehicle control, for example, a self-driving car presents a complex system with numerous interacting factors influencing its behavior. Key elements include the car's position (latitude and altitude), velocity (speed and direction), surrounding traffic (location and speed of other vehicles), and prevailing weather conditions (rain, snow, hail, sandstorm, etc.). Data from sensors such as cameras and lidars provide crucial inputs for the car's decision-making process (Sutton & Barto, 2018). These factors contribute to the state of the self-driving car in a particular environment.

The reward function provides immediate feedback by assigning a numerical value to each action, guiding the agent toward desirable outcomes. It acts as a signal to help the agent maximize cumulative rewards, making it a crucial element in shaping the learning process and achieving the desired goals (Sutton & Barto, 2018).

Whilst the reward function defines immediate positive and negative rewards, the value function defines good and bad actions in the long term. The value function predicts the total amount of reward the agent expects to acquire from the current state and all future states.

Key Algorithms

Value-Based Learning

Value-based learning chooses optimal actions that maximize a future expected reward represented by a function Q(s, a) → argmax(Q(s, a)) (Sutton & Barto, 2018). The function, at any given state s, will return a Q value for each possible action, a, within the state. This type of learning is similar to trial-and-error where the agent explores every possible combination of actions and states (Watkins and Dayan, 1992). Additionally, when presented with the same state, the agent under value learning will always perform the same action.

Value-based learning techniques are simple to implement as it performs best in problems where there are a limited number actions and outcomes, for example simple Atari games such as Pong and Breakout. Additionally, this technique ignores certain complexities by simplifying the possible inputs, for example using a limited set of vectors instead of 360-degree motion, or simplifying financial trading by only estimating return instead of modeling the whole environment (Lillicrap, 2015).

However, value-based methods are costly and inefficient as an agent needs to explore every possibility to determine the optimal actions. Additionally, this methods often falls victim to overestimation of the value function that may lead to suboptimal decisions (Watkins & Dayan, 1992). Even so, techniques such as Continuous Deep Q-learning (Gu, 2016) or Double Q-Learning are effective approaches for reducing overestimation (Hasselt, 2015).

Policy-Based Learning

By contrast, reinforcement learning with policy is a practice in which an agent directly learns a policy that determines the sequence of action to be taken with the purpose of yielding the highest total reward. Compared to methods for estimating value that trains a model based on how much is expected to be gained for each of the states, these approaches focus more on enhancing the agent’s choice of an action. It is particularly beneficial in the actions spaces that are continuous in nature, where many value-based methods do not excel (Sutton & Barto, 2018).

As they are usually modeled by learnable functions, typically neural networks, agents can learn virtually any policy they want. The policy is then slowly adjusted by an iterative process that utilizes the gradient of the expected return. This gradient-based approach to learning makes it possible for the model to learn in complex environments.

In other words, the main objective of policy-based methods is to maximize the chances of choosing an optimal behavioral pattern for an agent to follow in a given environment by directly enhancing the decision-making process.

The difference between continuous and discrete action spaces is crucial in reinforcement learning. Discrete action spaces involve a finite set of choices whilst continuous action spaces allow for a continuous range of options (Lillicrap, 2015). Policy-based methods excel in continuous action spaces to which value-based methods are not well suited. In the same way, the type of environment can also be categorized into stochastic or deterministic. Stochastic approaches are characterized by randomness in getting to the next state while deterministic is quite sure as to how it gets to that particular state depending on the state-action pair. Knowledge of these attributes and environments is critical when developing or learning the right reinforcement learning schemes.

Application of Reinforcement Learning

Atari Games

Deep Q Networks (DQN) represent a significant advancement in the application of reinforcement learning to complex environments, including Atari games (Mnih, 2015). DQNs utilize a neural network to approximate the Q-value function, which estimates the expected future rewards for each action in a given state. A key innovation in DQN is the use of experience replay, where the agent stores past experiences as tuples of state, action, reward, and next state. These experiences are then randomly sampled to train the network, breaking the temporal correlations between successive states and stabilizing the learning process.

In the application of DQN to Atari games, the state space was simplified by converting the game's color graphics into grayscale and downscaling the frames (Mnih, 2015). This preprocessing step reduced the complexity of the input data, making it more manageable for the network to process. The network architecture was designed to take the processed state as input and output Q-values for all possible actions in the game. To further streamline the learning process, a uniform reward scaling was applied, setting all positive rewards to +1, negative rewards to -1, and unchanged rewards to 0. This approach, while simplifying the reward structure, prevented the agent from distinguishing between rewards of different magnitudes, potentially leading to suboptimal learning in scenarios where reward differentiation is crucial.

Additionally, frame skipping was implemented, where the agent only processed every nth frame instead of every single frame. This technique accelerated the training process by reducing the computational load and allowing the agent to focus on more meaningful state transitions. The distinct color coding of objects in Atari games facilitated this approach, as critical game elements could still be identified even with skipped frames.

The application of DQNs to Atari games, as demonstrated in (Mnih, 2015) yielded significant performance improvements over previous methods. The DQN algorithm was able to learn successful strategies and achieve high scores in various games such as Pong, Breakout, and Space Invaders, often surpassing human-level performance. Notably, these achievements were accomplished without incorporating prior knowledge specific to the games, underscoring the power of deep reinforcement learning methods to generalize across different tasks. This breakthrough highlighted the potential of DQNs to tackle a wide range of problems in diverse domains, marking a milestone in the field of reinforcement learning.

Complex Strategy Games

Whilst the breakthroughs in simple Atari-style video games demonstrated the ability of RL algorithms to meet or surpass human capabilities, a more demanding class of problem is the application of RL to complex strategy games. One of the first breakthroughs in this area targeted the game of backgammon.

TD-Gammon represents a pioneering application of Temporal Difference (TD) Learning, a complex board game characterized by elements of chance and strategy (Tesauro, 1994). The fundamental principle behind TD-Gammon is to use predicted future values of the game's state to guide the agent's learning process, a technique that allows for incremental updates to the estimated value of the current state based on the difference between predicted and actual future outcomes. This method helps refine the agent's strategy over time, increasing its accuracy and effectiveness in decision-making.

The neural network in TD-Gammon plays a crucial role by mapping the current game state to a value function that estimates the expected utility of that state, guiding the agent's choice of moves. This network is trained using data from self-play, where the agent plays games against itself. This self-play approach generates a rich dataset, allowing the agent to explore a wide variety of strategies and responses without external input. Through this iterative process, the neural network learns to predict the future outcomes of various moves, aligning present inputs with optimal future states.

TD-Gammon's training regimen relies heavily on the concept of bootstrapping, where the learning process is based on current estimates of future rewards, rather than waiting for the final outcome of a game. This technique accelerates the learning process, enabling the agent to make adjustments after each move rather than only at the end of each game. The agent continually refines its policy by comparing the predicted value of a state with the actual outcome, adjusting its strategy to minimize discrepancies and optimize performance.

Notably, TD-Gammon achieved a level of play comparable to that of human experts in backgammon, a significant achievement given the game's complexity and the element of randomness introduced by dice rolls. This success demonstrated the potential of reinforcement learning algorithms, particularly those based on temporal difference learning, to excel in strategic decision-making tasks under uncertainty. TD-Gammon's development also highlighted the utility of neural networks in approximating value functions in complex environments, paving the way for further advancements in the field of game-playing AI.

The impact of TD-Gammon extends beyond backgammon, serving as a foundational case study for applying reinforcement learning in other domains. Its approach to training through self-play and its use of TD learning have influenced subsequent AI developments in various games and real-world applications, underscoring the versatility and power of these techniques in tackling complex decision-making challenges.

The methods employed in this work paved the way for AlphaGo, the first algorithmic system to beat grandmaster-level human Go players. This was achieved by a synergistic combination of deep neural networks, supervised learning, and reinforcement learning. Central to AlphaGo's success was self-play, a method that enabled the AI to learn and improve autonomously by playing countless games against itself. This iterative process generated a vast amount of data, fostering the development of sophisticated strategies (Silver, 2016). By mastering the complexities of Go, AlphaGo demonstrated the immense potential of reinforcement learning in tackling challenging problems and inspired subsequent advancements in the field. Go, unlike chess, presents a uniquely complex challenge due to its vast search space and the subtle nature of its strategies. While chess involves a limited set of pieces with specific movement rules, Go utilizes simple black and white stones placed on a grid. This seemingly straightforward setup belies the game's immense complexity. The number of potential board configurations in Go is astronomically larger than in chess, making it exponentially more difficult for computers to evaluate and plan effectively.

A key component of the success of AlphaGo is self-play. The agent can learn and adapt more quickly by continuously competing against itself, which produces a large amount of diverse training data. This approach removes the need for external datasets, which makes it especially useful in fields where data is scarce (Mnih, 2015). Self-play also encourages exploration, which helps the agent to invent new tactics and strengthens its general robustness.

Autonomous Vehicles

Reinforcement learning has emerged as a pivotal technique in the development of self-driving cars, revolutionizing the way autonomous vehicles (AVs) learn and make decisions in dynamic environments (Udugama, 2023). RL's unique ability to adapt to new conditions through trial and error makes it particularly suited for the complex task of autonomous driving.

Self-driving cars must navigate a myriad of scenarios, from straightforward highway driving to complex urban environments with unpredictable pedestrians and traffic. Traditional rule-based or supervised learning approaches fall short in these dynamic settings due to their inability to generalize well to new, unseen situations. RL excels in this regard, enabling AVs to learn from interactions and adapt to new conditions.

One of the most significant papers in the field was published by Shalev-Shwartz et al. (Shalev-Shwartz, 2016), in which the authors discuss strategies for enhancing autonomous driving through reinforcement learning (RL) in multi-agent settings. The study highlights the complexities of ensuring safety and managing the unpredictability of other agents on the road. The authors propose a novel approach by integrating policy gradient methods with a trajectory planner that adheres to stringent safety constraints. This combination allows the autonomous system to plan feasible and safe paths, thus avoiding potential collisions.

A significant innovation in this paper is the introduction of the "Option Graph," a hierarchical temporal abstraction framework. This method breaks down complex tasks into simpler subtasks, thereby improving the efficiency of the learning process and enhancing decision-making for long-term planning. By structuring the learning process, the Option Graph aids in reducing the sample complexity, which is a common challenge in RL applications.

The research also emphasizes empirical validation through simulations, demonstrating that their approach outperforms traditional methods in terms of both safety and efficiency. The proposed framework not only advances theoretical RL approaches but also aims to develop practical and reliable autonomous driving systems capable of safely interacting with other vehicles and pedestrians in diverse and complex environments.

Companies like Waymo, Tesla, and Uber have integrated RL techniques in their AV development. For example, Waymo uses a combination of imitation learning and RL to train their vehicles in a simulated environment before deploying them in the real world. This hybrid approach allows the cars to benefit from human driving data while still learning to handle novel situations autonomously (Lu, 2022).

Despite the promise, RL in self-driving cars faces several challenges. The most significant is the requirement for extensive and diverse training data to ensure robust performance. Simulators like CARLA (Dosovitskiy, 2017) are essential for generating this data, but the transfer of learning from simulation to real-world driving, known as sim-to-real transfer, remains a hurdle.

Future research in RL for self-driving cars is likely to focus on improving the efficiency and safety of learning processes. Techniques such as hierarchical RL, where complex tasks are broken down into simpler subtasks, and multi-agent RL, where multiple agents learn to interact within the same environment, hold promise for advancing the capabilities of autonomous vehicles.

Current Challenges in Reinforcement Learning

Exploration vs Exploitation

The exploration vs. exploitation problem in reinforcement learning is a fundamental challenge that revolves around the agent's decision-making strategy. Exploration involves trying new actions to discover more information about the environment, potentially leading to better long-term outcomes (Sutton & Barto, 2018). It helps the agent learn about the effects of actions in various states, which is crucial for developing an accurate understanding of the environment and finding the optimal policy.

On the other hand, exploitation focuses on using the agent's current knowledge to select actions that maximize immediate rewards. This approach leverages the information the agent has already gathered to achieve the best possible outcome based on its current understanding. The challenge lies in balancing these two strategies: too much exploration can be inefficient and delay achieving optimal rewards, while too much exploitation can cause the agent to miss out on potentially better strategies. Effective reinforcement learning algorithms strive to find a balance, often using techniques like epsilon-greedy policies, where the agent explores with a small probability and exploits most of the time, or using algorithms that adjust this balance dynamically based on the learning progress.

Sparse and Delayed Rewards

In reinforcement learning, the challenge of delayed rewards arises when the consequences of actions are not immediately apparent. This situation complicates the agent's learning process, as it becomes difficult to discern which actions are beneficial or detrimental in achieving long-term goals. The agent must learn to attribute rewards to the appropriate actions, even when these rewards are delayed, requiring sophisticated strategies to link actions with outcomes that may occur much later.

The RUDDER algorithm addresses this challenge by redistributing the reward signal to highlight the most critical actions that influence future rewards (Arjona-Medina, 2019). RUDDER stands for "Return Decomposition for Delayed Rewards" and works by analyzing the trajectory of states and actions to identify key moments that significantly impact the final outcome. By reassigning the reward more precisely to these pivotal actions, RUDDER helps the agent learn more efficiently and effectively, reducing the complexity of the learning process. This method not only accelerates learning but also improves the stability and performance of reinforcement learning algorithms in environments where delayed rewards are prevalent.

Sample Efficiency

Sample efficiency refers to how effectively an agent learns from a limited amount of data or interactions with the environment (Mai, 2022). It is a critical aspect, especially in real-world applications where obtaining samples can be expensive, time-consuming, or even impractical. Sample-efficient algorithms are designed to maximize the learning gained from each interaction, enabling the agent to perform well with fewer experiences.

To improve sample efficiency, various techniques are employed, such as experience replay, where past interactions are stored and reused to learn more robustly, and off-policy learning, where the agent learns from actions that were not taken according to the current policy (Zhang & Sutton, 2017). Additionally, model-based reinforcement learning enhances sample efficiency by creating a model of the environment, allowing the agent to simulate interactions and plan actions without needing additional real-world data. These approaches, by better utilizing available information, help in achieving faster and more reliable learning outcomes with limited data.

Conclusion

Reinforcement learning (RL) has emerged as a transformative field within artificial intelligence, significantly influencing various domains through its capacity to enable agents to learn optimal behaviors from their interactions with environments. This review has explored the foundational concepts of RL, including the roles of the agent, state, actions, and rewards, as well as the critical importance of the environment in shaping learning outcomes. The discussion has highlighted how RL techniques, particularly those incorporating deep learning, have expanded the scope and applicability of RL, allowing for the successful application of these methods to complex, dynamic environments such as Atari games and strategic settings like backgammon.

The development of key algorithms, such as Deep Q Networks (DQN) (Mnih, 2015) and TD-Gammon (Tesauro, 1995), have demonstrated the potential of RL to achieve high levels of performance in decision-making tasks, often surpassing human capabilities, by leveraging approaches such as self-play (Silver, 2017). These advancements underscore the capability of RL to generalize across different tasks and environments, a feature that is crucial for its broader application in real-world scenarios.

However, it is important to acknowledge the challenges inherent in reinforcement learning, including the exploration vs. exploitation tradeoff, handling of sparse and delayed rewards, and issues related to sample efficiency. These challenges highlight the ongoing need for research and innovation to enhance the robustness, efficiency, and interpretability of RL systems. Techniques like the RUDDER algorithm and model-based learning approaches represent significant strides toward addressing these challenges, improving the practicality and effectiveness of RL in various applications (Arjona-Medina, 2019).

In summary, reinforcement learning stands as a critical component of modern AI, with applications ranging from gaming and robotics to healthcare and finance. While there are hurdles to overcome, the continuous advancements in this field promise to further expand its impact, paving the way for more sophisticated and adaptive AI systems. As research continues to evolve, RL is poised to play a central role in addressing increasingly complex and dynamic problems across diverse domains.

References

Arjona-Medina, J. A., Gillhofer, M., Keuper, J., Pfreundschuh, M., & Udluft, S. (2019). RUDDER: Return Decomposition for Delayed Rewards. arXiv preprint arXiv:1806.07857. Available at: https://arxiv.org/abs/1806.07857.

Berner, C., Brockman, G., Chan, B., Cheung, V., DePristo, M., & Olsson, C. (2019). Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680.

Bhatt, S. (2018, March 9). Reinforcement Learning 101. Towards Data Science. https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292.

Chithra, K., Adukkathayar, A. C., Keshta, I., & Byeon, H. (2023). Reinforcement Learning Fundamentals: Learning Through Rewards and Punishments. Xoffencer publications.

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An Open Urban Driving Simulator. Conference on Robot Learning (CoRL). Available at: https://arxiv.org/abs/1711.03938.

Gu, S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schoelkopf, B., & Levine, S. (2016). Continuous Deep Q-Learning with Model-based Acceleration. arXiv preprint arXiv:1603.00748. Available at: https://arxiv.org/abs/1603.00748.

Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., & Levine, S. (2021). How to Train Your Robot with Deep Reinforcement Learning – Lessons We’ve Learned. arXiv preprint arXiv:2102.02915.

Jaffry, S. (2020, August 8). A simple reinforcement learning environment from scratch. Retrieved July 14th 2024 https://medium.com/analytics-vidhya/a-simple-reinforcement-learning-environment-from-scratch-72c37bb44843.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971.

Lu, Y., Fu, J., Tucker, G., Pan, X., Bronstein, E., Roelofs, R., Sapp, B., White, B., Faust, A., Whiteson, S., Anguelov, D., & Levine, S. (2022). Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios. arXiv preprint arXiv:2212.11419.

Mai, V., Mani, K., & Paull, L. (2022). Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation. arXiv preprint arXiv:2201.01666.

Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236.

Rao, A., & Jelvis, T. (2022). Foundations of Reinforcement Learning with Applications in Finance, Taylor & Francis.

Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2016). Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv preprint arXiv:1610.03295. Available at: https://arxiv.org/pdf/1610.03295.

Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Available at: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf.

Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Morgan and Claypool Publishers.

Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3), 58-68. https://doi.org/10.1145/203330.203343.

Udugama, B. (2023). Review of Deep Reinforcement Learning for Autonomous Driving. arXiv preprint arXiv:2302.06370.

Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461. Available at: https://arxiv.org/abs/1509.06461.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., & Silver, D. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350-354.

Watkins, C.J.C.H., Dayan, P. Q-learning. Mach Learn 8, 279–292 (1992). https://doi.org/10.1007/BF00992698.

Yu, C., Liu, J., & Nemati, S. (2019). Reinforcement Learning in Healthcare: A Survey. arXiv preprint arXiv:1908.08796.

Zhang, S., & Sutton, R. S. (2017). A Deeper Look at Experience Replay. arXiv preprint arXiv:1712.01275.

REVIEW