LLMs for Policy Interpretation

LLMs may be the key to making reinforcement learning policies more understandable.

Deep reinforcement learning has proven to be a highly successful method for creating intelligent systems in recent times. However, the policies - which determine the actions of an agent based on its environment – are often complex, opaque, and difficult to understand. This barrier prevents deep interrogation and understanding of the decision-making process utilised by the RL model.

On the other hand, natural language models have made significant progress in the past few years, providing a basis for human-AI interaction. These models may be the key to bridging the gap between reinforcement learning policies and human understanding. By providing natural language explanations for reinforcement learning policies, it becomes easier to understand and interpret them, leading to more effective AI systems that are better aligned with human objectives and preferences.

The main challenge of interpreting deep reinforcement learning policies is their complexity. Deep neural networks are inherently difficult to reason about, and as learned policies take into account numerous factors, such as the current state of the environment, the reward signal, and the agent’s past experiences, the resultant decision-making process is also inherently abstract. Some existing methods to explain reinforcement learning policies are to visualise the learned policies as graphs or heat maps, or building saliency maps to highlight the important factors in the decision-making process. And while these methods are indeed useful, they are limited in their ability to provide a natural, nuanced explanation of the learned policies.

This is where language models could in the future play a crucial role. Consider an AI system that uses reinforcement learning to control a robot in a warehouse. The learned policy determines the actions the robot should take based on its environment and past experiences, optimising for efficiency and safety. A language model could provide a natural language explanation for this policy, making it easier to understand and interpret. For instance, the robot may pick up a package out of the expected order.
Traditionally, this would be difficult to explain, with a visualisation simply showing that the robot preferred this action over others. However, if a LLM were able to enquire about the robot’s actions, it could give an explanation like:

“The robot chose to pick up a larger package out of order because this action will clear space and allow it to progress through the remaining packages faster.”

This type of explanation increases the transparency and accountability of the AI system, helping to ensure that it is aligned with human objectives and preferences.

While there has been some successful work in this realm , there is currently a disconnect and lack of interface between these two fields. Reinforcement learning algorithms are typically designed to optimize reward signals and do not have built-in mechanisms for generating human-readable explanations, while language models don’t have the ability to generate novel strategies or make decisions based on an explicit environment and reward signal.

The integration of deep reinforcement learning and large language models has great potential looking toward the future. However, until there is a generalised unification of the two, the current approaches for policy interpretation will have to do.

Tags: reinforcement learning language models deep learning