Reinforcement learning has achieved remarkable performance in complex domains, but the inherent black-box nature of deep neural network policies makes them difficult to interpret and trust, especially in critical applications. This project introduces a novel model-agnostic approach that leverages Shapley values to transform deep reinforcement learning policies into interpretable representations.

The framework provides global insights into policy behavior, surpassing traditional local explainability. It supports both off-policy and on-policy algorithms, including Deep Q-Network (DQN), Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C). The interpretable policies maintain model effectiveness and demonstrate enhanced stability in environments like CartPole and MountainCar.

Visualizations include Shapley value analysis (left), action clustering (middle), and decision boundaries for interpretable policies (right).

This work represents a step toward developing transparent, trustworthy AI solutions for high-stakes real-world applications.