Interpretable Reinforcement Learning via Shapley Values

Reinforcement learning (RL) has revolutionized decision-making in complex domains, yet its “black box” nature poses challenges for trust and transparency in critical applications. This project introduces a novel framework that transforms opaque RL policies into interpretable representations using Shapley values, enabling both local and global understanding of agent behavior.

Key components of the framework: (Left) Shapley value vectors clustered by action regions, (Middle) Decision boundaries mapped to original state space, (Right) Stable performance of interpretable policies compared to original models.

Approach

The methodology combines Shapley value analysis with action-aware clustering to extract decision patterns from trained RL policies. By reformulating states into contribution vectors, we identify critical boundaries between action regions and reconstruct them as interpretable linear functions. This process preserves policy effectiveness while offering full transparency.

Experimental results in CartPole (left) and MountainCar (right) environments, showing interpretable policies achieving competitive rewards with lower variance.

Impact

Model-Agnostic: Compatible with DQN, PPO, and A2C algorithms.
Stability: 23% reduction in performance variance across tested environments.
Accessibility: Policies expressed as linear equations (e.g., f₀₁ = 0.35x − ẋ − 0.3), enabling human validation.

Future work extends this framework to continuous action spaces and high-dimensional domains, aiming to democratize trustworthy RL for real-world deployment.

(missing reference)

Approach

Impact

References