Cloud native EDA tools & pre-optimized hardware platforms
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal behavior in an environment to obtain maximum reward. This optimal behavior is learned through interactions with the environment and observations of how it responds, similar to children exploring the world around them and learning the actions that help them achieve a goal.
In the absence of a supervisor, the learner must independently discover the sequence of actions that maximize the reward. This discovery process is akin to a trial-and-error search. The quality of actions is measured by not just the immediate reward they return, but also the delayed reward they might fetch. As it can learn the actions that result in eventual success in an unseen environment without the help of a supervisor, reinforcement learning is a very powerful algorithm.
The Reinforcement Learning problem involves an agent exploring an unknown environment to achieve a goal. RL is based on the hypothesis that all goals can be described by the maximization of expected cumulative reward. The agent must learn to sense and perturb the state of the environment using its actions to derive maximal reward. The formal framework for RL borrows from the problem of optimal control of Markov Decision Processes (MDP).
The main elements of an RL system are:
A useful abstraction of the reward signal is the value function, which faithfully captures the ‘goodness’ of a state. While the reward signal represents the immediate benefit of being in a certain state, the value function captures the cumulative reward that is expected to be collected from that state on, going into the future. The objective of an RL algorithm is to discover the action policy that maximizes the average value that it can extract from every state of the system.
RL algorithms can be broadly categorized as model-free and model-based. Model-free algorithms do not build an explicit model of the environment, or more rigorously, the MDP. They are closer to trial-and-error algorithms that run experiments with the environment using actions and derive the optimal policy from it directly. Model-free algorithms are either value-based or policy-based. Value-based algorithms consider optimal policy to be a direct result of estimating the value function of every state accurately. Using a recursive relation described by the Bellman equation, the agent interacts with the environment to sample trajectories of states and rewards. Given enough trajectories, the value function of the MDP can be estimated. Once the value function is known, discovering the optimal policy is simply a matter of acting greedily with respect to the value function at every state of the process. Some popular value-based algorithms are SARSA and Q-learning. Policy-based algorithms, on the other hand, directly estimate the optimal policy without modeling the value function. By parametrizing the policy directly using learnable weights, they render the learning problem into an explicit optimization problem. Like value-based algorithms, the agent samples trajectories of states and rewards; however, this information is used to explicitly improve the policy by maximizing the average value function across all states. Popular policy-based RL algorithms include Monte Carlo policy gradient (REINFORCE) and deterministic policy gradient (DPG). Policy-based approaches suffer from a high variance which manifests as instabilities during the training process. Value-based approaches, though more stable, are not suitable to model continuous action spaces. One of the most powerful RL algorithms, called the actor-critic algorithm, is built by combining the value-based and policy-based approaches. In this algorithm, both the policy (actor) and the value function (critic) are parametrized to enable effective use of training data with stable convergence.
Model-based RL algorithms build a model of the environment by sampling the states, taking actions, and observing the rewards. For every state and a possible action, the model predicts the expected reward and the expected future state. While the former is a regression problem, the latter is a density estimation problem. Given a model of the environment, the RL agent can plan its actions without directly interacting with the environment. This is like a thought experiment that a human might run when trying to solve a problem. When the process of planning is interweaved with the process of policy estimation, the RL agent’s ability to learn.
Any real-world problem where an agent must interact with an uncertain environment to meet a specific goal is a potential application of RL. Here are a few RL success stories:
Reinforcement learning is applicable to a wide range of complex problems that cannot be tackled with other machine learning algorithms. RL is closer to artificial general intelligence (AGI), as it possesses the ability to seek a long-term goal while exploring various possibilities autonomously. Some of the benefits of RL include:
While RL algorithms have been successful in solving complex problems in diverse simulated environments, their adoption in the real world has been slow. Here are some of the challenges that have made their uptake difficult:
Supervised learning is a paradigm of machine learning that requires a knowledgeable supervisor to curate a labelled dataset and feed it to the learning algorithm. The supervisor is responsible for collecting this training data – a set of examples such as images, text snippets, or audio clips, each with a specification that assigns the example to a specific class. In the RL setting, this training dataset would look like a set of situations and actions, each with a ‘goodness’ label attached to it. The core function of a supervised learning algorithm is to extrapolate and generalize, to make predictions for examples that are not included in the training dataset.
RL is a separate paradigm of machine learning. RL does not require a supervisor or a pre-labelled dataset; instead, it acquires training data in the form of experience by interacting with the environment and observing its response. This crucial difference makes RL feasible in complex environments where it is impractical to separately curate labelled training data that is representative of all the situations that the agent would encounter. The only approach that is likely to work in these situations is where the generation of training data is autonomous and integrated into the learning algorithm itself, much like RL.
Since RL does not require a supervisor, it is important to point out that RL is not the same as unsupervised learning, yet another paradigm of machine learning. In unsupervised learning, the training data is not labelled, and the objective is to uncover the hidden structure in the data. A knowledge of this hidden structure lets the model group similar examples or estimate the distribution function that generated the examples. Uncovering this hidden structure does not solve the RL problem, which is to maximize the reward at the end of a trajectory. However, the knowledge of a hidden structure in the agent’s experience can help speed up the learning process.
A challenge that is unique to RL algorithms is the trade-off between exploration and exploitation. This trade-off doesn’t arise in either supervised or unsupervised machine learning. An RL agent must strike a careful balance between exploiting its past experience and exploring the unknown states of the environment. The right balance would lead the agent to discover the optimal policy that yields maximal reward. If the agent continues to exploit the past experience only, it is likely to get stuck in a local minima and produce a sub-optimal policy. On the other hand, if the agent continues to explore without exploiting, it might never find a good policy.
DSO is a novel approach to searching large design spaces enabled by recent advancements in machine-learning.
In recent years, significant progress has been made in the area of deep reinforcement learning. Deep reinforcement learning uses deep neural networks to model the value function (value-based) or the agent’s policy (policy-based) or both (actor-critic). Prior to the widespread success of deep neural networks, complex features had to be engineered to train an RL algorithm. This meant reduced learning capacity, limiting the scope of RL to simple environments. With deep learning, models can be built using millions of trainable weights, freeing the user from tedious feature engineering. Relevant features are generated automatically during the training process, allowing the agent to learn optimal policies in complex environments.
Traditionally, RL is applied to one task at a time. Each task is learned by a separate RL agent, and these agents do not share knowledge. This makes learning complex behaviors, such as driving a car, inefficient and slow. Problems that share a common information source, have related underlying structure, and are interdependent can get a huge performance boost by allowing multiple agents to work together. Multiple agents can share the same representation of the system by training them simultaneously, allowing improvements in the performance of one agent to be leveraged by another. A3C (Asynchronous Advantage Actor-Critic) is an exciting development in this area, where related tasks are learned concurrently by multiple agents. This multi-task learning scenario is driving RL closer to AGI, where a meta-agent learns how to learn, making problem-solving more autonomous than ever before.
Synopsys taps into reinforcement learning for its DSO.ai™ (Design Space Optimization AI) solution, which is the semiconductor industry's first autonomous artificial intelligence application for chip design. Inspired by DeepMind's AlphaZero that mastered complex games like chess or Go, DSO.ai uses RL technology to search for optimization targets in very large solution spaces of chip design. DSO.ai revolutionizes chip design by massively scaling exploration of options in design workflows while automating less consequential decisions, allowing SoC teams to operate at expert levels and significantly amplifying overall throughput.
Optimize silicon performance, accelerate chip design and improve efficiency throughout the entire EDA flow with our advanced suite of AI-driven solutions.