diff History for Neural Language Agents

New York University

Tune your LLM agent on text deltas. We show that even tiny LMs (~120M params) can be efficiently tuned into highly competitive and robust agents for hard decision-making settings like the video-game NetHack, where they match SOTA despite being tuned on 1800x fewer data than prior work... just extend your model's context length and process text observations with the Unix diff command.

Abstract

Neural Language Models (LMs) offer an exciting solution for general-purpose embodied control. However, a key technical issue arises when using an LM-based controller: environment observations must be converted to text, which coupled with history, results in long and verbose textual prompts. As a result, prior work in LM agents is limited to restricted domains with small observation size as well as minimal needs for interaction history or domain-specific instruction tuning.

In this paper, we introduce diff history, a simple and highly effective solution to these issues. By applying the Unix diff command on consecutive text observations in the interaction histories used to prompt LM policies, we can both abstract away redundant information and focus the content of textual input on the salient changes in the environment.

On NetHack, an unsolved video game that requires long-horizon reasoning for decision-making, LMs tuned with diff history match state-of-the-art performance for neural agents while needing 1800x fewer training data compared to prior work. Even on the simpler BabyAI-Text environment with concise text observations, we find that although diff history increases the length of prompts, the representation it provides offers a 25% improvement in the efficiency of instruction tuning. Further, we show that diff history scales favorably across different tuning dataset sizes.

Method

A diff history is a text sequence summarizing a neural language agent’s recent interactions in a decision-making setting, from "left" to "right." A visualization is provided below.
Visualizing diff history.

As shown above, diff histories consist of several components: (1) an instruction describing the task; (2) an “anchor” full-text observation; and (3) the subsequent actions taken by the agent as well as the observed, resultant text deltas in the world state. At inference-time, we can flexibly resize the history horizon provided to LM agents, up to the model context-length.

Examples of diff vs full-text observations in BabyAI-Text and NetHack.

Why diff history? Text deltas returned by diff have a rich but simple algebraic structure (see the examples above). In text-based decision-making settings with natural language observations, the matching algorithm underlying the diff operator can also localize the high-level changes in environment properties, attributes, and object states that occur between consecutive timesteps of interaction. Thus, diff history provides a dense learning signal for instruction tuning LMs on action prediction. In high-dimensional environments with complex and verbose per-timestep observations, diff history can also act as a soft-compression mechanism, preserving information while reducing token counts to yield agents with longer memory horizons at fixed context-length.

Results

In NetHack, LMs with diff history match previous state of the art performance for data-driven agents despite tuning on 1800x fewer labeled demonstration data.

Ablating diff observations from interaction histories in favor of full-text observations results in a 98% decline in mean LM agent score on withheld seeds of the game, suggesting that diff is responsible for the performance gains that we observe.

Introducing an auxiliary "world model" prediction objective somewhat reduces the gap between diff and full-text interaction history agents. LMs with diff history also outperform vision-language baselines trained on the same demonstration data by 780% in mean test-time score.

Visualizing diff history.

In the multi-task BabyAI-Text environment, introducing diff history improves the quality of inference-time LM generations resulting from low-resource instruction tuning, reducing FLOPs-to-convergence by 25% in both low-data and ultra low-data tuning settings.
Examples of diff vs full-text observations in BabyAI-Text and NetHack.

BibTeX

@misc{piterbarg2023diff,
      title={diff History for Neural Language Agents}, 
      author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
      year={2023},
      eprint={2312.07540},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}