MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Self Modification: An agent's ability to modify its own capabilities or configuration to overcome limitations and complete challenges.
Tool Improvement: An agent's ability to identify and fix issues in external tools or scripts to accomplish given tasks.
Knowledge Seeking: An agent's ability to gather and synthesize information about its environment through exploration and inquiry.
Embedded Social Reasoning: An agent's ability to navigate complex social situations, potentially involving strategic or deceptive behavior to achieve goals.
Opaque reasoning: An agent's ability to evade basic monitoring for instrumental self-reasoning (in the context of self-modification).

Abstract

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models.

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Abstract

Our monitoring model is able to detect almost all instances of self-reasoning, reducing agent performance to near zero. However, forcing the agent to perform opaque reasoning retains between 40-60% of the original performance without a monitor.

Trajectories

BibTeX