Auditing Forgetting in Limited Memory Language Models

This project introduces a causal evaluation framework for memory separation in Limited Memory Language Models, building on Zhao et al. (2025). The core question: when a model is asked to forget specific information, what residual signal remains?

We decompose post-deletion correctness into three components: parametric leakage (information stored in the model's weights), retrieval-mediated correctness (information returned via retrieval over the active database), and retrieval artifacts (correctness arising from inference patterns rather than direct recall).

Across 1,404 datapoints, six prompt formulations, and thirteen database variants, we orchestrated evaluations via Slurm and tracked experiments with Weights & Biases. The headline finding: retrieval artifacts, not parametric leakage, dominate residual correctness after deletion.

The work was developed with research feedback from NLP researchers at Stanford, Cornell, and UC Berkeley.

Affiliation

UC Berkeley

Partners

Report

Manuscript

Keywords

Machine Unlearning
Knowledge Editing
Causal Inference
Python
PyTorch
CUDA
LLMs
NLP
Slurm
Weights & Biases

▸ Deepdive

Under development.