Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

Authors: Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin

Affiliations: Gaoling School of Artificial Intelligence (Renmin University of China), School of Software (Beihang University), Meituan

Code & Data: https://github.com/RUCBM/ExpInternalization

Abstract

Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization:

(1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details.

(2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use.

(3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states.

Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.

1 Introduction

The capability for continual learning is essential for building autonomous and adaptive LLM agents. Toward this end, learning from experience offers a promising path, enabling LLMs to acquire generalizable knowledge from past interactions and continuously improve through future interactions.

In-context learning (ICL) represents the most direct exploitation of experience by presenting it to the model as context. However, this paradigm is bounded by in-context capacity and prone to context collapse as the experience pool grows.

This motivates experience internalization, which converts context-dependent experience use into parametric capability. Most recent work on experience internalization adopts on-policy context-distillation and achieves strong performance in a single iteration of internalization. However, existing approaches largely overlook the necessity of iterative experience internalization, which is a cornerstone of the continual learning paradigm. Through a preliminary study, we reveal a critical vulnerability: current methods fail to sustain this self-evolving process, with performance collapsing as self-evolution proceeds.

In this study, we rethink why current experience internalization paradigms fail under multi-iteration experience learning. We attribute these failures to three stages of the transfer: how experience is represented, how it shapes teacher supervision, and which trajectory distribution is used to transfer the resulting behavior into the student.

Experience Granularity

We find that principle-level experience is more suitable for internalization than instance-level experience. By abstracting transferable strategies and failure patterns from trajectory-specific details, principle-level experience provides a more generalizable signal and reduces the risk of reinforcing instance-specific behaviors across iterations.

Experience Injection Pattern

We find that step-wise injection outperforms global injection by aligning relevant experience with intermediate decision states. This state-aligned use of experience is especially important in long-horizon tool-use tasks, where global injection can fail to preserve the model's ability to use newly generated experience in later self-evolution iterations. However, degradation can still occur under principle-level experience and step-wise injection, motivating us to examine the internalization regime.

Internalization Regime

We find that on-policy context-distillation delivers strong gains in a single iteration but fails to sustain them across multiple iterations. Since supervision is built on student-induced trajectories, the teacher is reduced to local corrections on flawed states, rather than coherent demonstrations of experience-guided behavior. Off-policy context-distillation, by contrast, trains on high-quality teacher-generated trajectories, providing a more stable signal for experience internalization and self-evolution.

Overall, we systematically study experience internalization across these three dimensions and propose a simple recipe for sustainable internalization. These findings provide practical guidance for designing LLM agents that can sustain experience-based self-evolution across iterations.

2 Related Work

2.1 Learning from Experience

Context-Based Experience Learning

The experience accumulated from the interaction trajectories of LLM agents provides a valuable resource for improving agent behavior. Recent work reuses such experience as contextual guidance without parameter updates. These methods can be broadly organized into storage, reflection, and abstraction: preserving trajectories for retrieval, refining stored experience through self-feedback, and generalizing experience into reusable forms such as skills, strategies, or summarized experiential knowledge.

However, context-based methods retain experience as inference-time context, leaving their benefits bounded by the model's in-context learning ability and vulnerable to context collapse when experience accumulates. This motivates our study of sustainable experience internalization beyond inference-time context.

Experience Internalization

Context distillation provides a way to internalize experience into model parameters by aligning an experience-free student with an experience-aware teacher. Early formulations are often off-policy, where the student is trained on teacher-generated trajectories but may suffer from training–inference mismatch. Recent work has therefore shifted toward on-policy context distillation, which supervises trajectories sampled from the student to improve distributional consistency.

However, existing works focus on single-round transfer, leaving the stability of multi-iteration internalization underexplored. We address this gap by studying sustainable experience internalization across self-evolution cycles.

2.2 Self-Evolving LLM Agents

Self-evolving LLM agents refer to agent systems that iteratively improve their behavior by leveraging interaction data, feedback signals, and self-generated experience. Existing work has explored self-evolution at both the policy and component levels. Policy-level methods update the agent model from interaction trajectories and feedback, whereas component-level methods evolve external structures such as memory, tools, skills, or experience libraries.

Recent work further couples model training with experience evolution in a closed loop, iteratively training from the experience pool and refreshing it with trajectories from the updated model. Effective experience-based self-evolution requires experience evolution and model improvement to reinforce each other across rounds. We therefore study how experience representation and internalization can strengthen this loop and support subsequent policy improvement.

3 Formulation

We formalize continual experience internalization and introduce the notation used in our analysis.

Agent Trajectories and Experience Pool

Following ReAct, an agent policy πθ interacts with an environment through interleaved reasoning and action steps, where 𝒜 denotes the action space. Given a user query x, at each step t, the agent generates a thought τₜ and an action aₜ ∈ 𝒜 conditioned on the history ℋₜ₋₁, where aₜ is either a tool call or a terminal answer. Tool calls return observations oₜ, forming a trajectory:

ℋₜ = (x, (τ₁, a₁, o₁), …, (τₜ, aₜ, oₜ))

evaluated by a task-level reward r(ℋₜ). Following prior work on experience extraction, we summarize trajectories into natural-language experience with DeepSeek-V4 unless otherwise specified, and denote the resulting pool as ℰ = {e₁, …, eₙ}.

Experience Distillation

Experience internalization distills an experience-aware teacher πₜ into an experience-free student πθ. The teacher can access injected experience ℰₜ ⊆ ℰ during supervision construction, while the student acts without experience at deployment.

For brevity, let hₜ₋₁ = ℋₜ₋₁, pₜ = πₜ(·|hₜ₋₁, ℰₜ), and qₜ = πθ(·|hₜ₋₁).

We consider two internalization regimes.

Off-policy context-distillation: Trajectories are generated by the teacher and the student matches the teacher distribution with forward KL:

ℒ_off(θ) = 𝔼_ℋ∼πₜ ∑ₜ₌₁ᵀ D_KL(pₜ ∥ qₜ)

On-policy context-distillation: Trajectories are generated by the student and the teacher supervises student-induced states with reverse KL:

ℒ_on(θ) = 𝔼_ℋ∼πθ ∑ₜ₌₁ᵀ D_KL(qₜ ∥ pₜ)

Continual Experience Internalization

To study experience internalization beyond a single update, we consider an iterative process indexed by k = 0, 1, …, K. At iteration k, the current policy πθ₍ₖ₎ interacts with the environment and produces trajectories 𝒟⁽ᵏ⁾ = {ℋᵢ⁽ᵏ⁾}. These trajectories are summarized into an experience pool ℰ⁽ᵏ⁾. The same policy, when conditioned on ℰ⁽ᵏ⁾, serves as an experience-aware teacher for training the next experience-free student πθ₍ₖ₊₁₎:

θ⁽ᵏ⁺¹⁾ = Internalize(θ⁽ᵏ⁾, ℰ⁽ᵏ⁾)

This closed loop captures the promise of continual experience learning: an agent may transform accumulated experience into reusable capability as its policy evolves. Therefore, experience internalization should be evaluated not only by single-iteration gains, but also by whether such gains can be sustained across iterations.

Dimensions of Experience Internalization

In this framework, we study three dimensions that shape sustained experience internalization.

Experience Granularity specifies the abstraction level of the experience pool ℰ⁽ᵏ⁾. Instance-level experience preserves trajectory-specific details, while principle-level experience abstracts reusable strategies, decision rules, and failure patterns.

Experience Injection Pattern specifies how experience is provided to the teacher during supervision construction. Under global injection, the teacher uses a fixed experience context c^glob = [x; ℰ⁽ᵏ⁾] for the whole trajectory, inducing the teacher distribution pₜ^glob = πₜ(·|hₜ₋₁, c^glob). Under step-wise injection, an LLM-based selector Rφ selects experience according to the current interaction history, ℰₜ^step = Rφ(hₜ₋₁, ℰ⁽ᵏ⁾), inducing pₜ^step = πₜ(·|hₜ₋₁, ℰₜ^step).

Internalization Regime specifies the trajectory distribution on which experience-conditioned teacher behavior is transferred to the student, contrasting off-policy internalization on teacher-generated trajectories with on-policy internalization on student-induced trajectories.

Together, these dimensions define the design space for continual experience internalization in this work.

4 Experimental Setup

Models and Environment

We use Qwen3-4B-Instruct-2507 and Qwen3-8B (with thinking mode disabled) as student models. The agent follows the ReAct-style interaction format with five tools: Search, Visit, Python, Scholar, and File Parser.

Training Data and Experience

We construct a 15K-example training corpus from five public web-reasoning QA datasets: WebWalkerQA-silver, DeepDive, WebShaper, WebDancer, and SailorFog-QA. We use this corpus to generate agent trajectories, extract natural-language experience, and then use the resulting experience pools to construct experience-conditioned supervision under the internalization regimes defined in Section 3.

Benchmarks and Metrics

We evaluate on WebWalkerQA, GAIA-Text-103, and BrowseComp-ZH. Since WebWalkerQA-silver is included in our training corpus, we treat WebWalkerQA as in-domain and the other two as out-of-domain benchmarks. We report Pass@1 on WebWalkerQA and BrowseComp-ZH with one rollout per query, and average accuracy on GAIA-Text-103 over three rollouts. For brevity, we refer to GAIA-Text-103 as GAIA in tables.

Training and Inference

All methods are implemented with verl. We train students using a learning rate of 1×10⁻⁵, a batch size of 128, and 5 epochs on 8× NVIDIA A800 GPUs. During inference, we use temperature 0.7, allow at most Tₘₐₓ = 100 interaction steps, and set the context window to 32,768 tokens.

5 Toward Stable Continual Experience Internalization

5.1 Effect of Experience Granularity

We first examine how experience granularity shapes the reliability of experience internalization across iterations. We compare instance-level experience, which preserves trajectory-specific details, with principle-level experience, which abstracts reusable strategies, search principles, and failure patterns. Both are evaluated under in-context use and iterative internalization.

Instance-level experience yields only transient gains. Although it improves performance in the first iteration, these gains quickly diminish as self-evolution proceeds and fall below the base model. This fragility stems from the localized content profile of instance-level data. In our sampled pool, 74.4% of instance-level items contain specific URLs or domains, 57.3% contain concrete numbers, and 93.9% contain query- or entity-specific strings. Such trajectory-specific traces facilitate in-distribution exploitation but transfer poorly once the model encounters new queries or induces different trajectories.

Principle-level experience provides a durable signal by filtering out such local artifacts and retaining reusable decision rules. In our sample, 84.0% of principle-level items contain reusable strategy-like statements, compared with only 3.7% of instance-level items. This abstraction reduces dependence on source trajectories and better supports internalization across updated trajectory distributions.

Overall, instance-level experience mainly provides short-term gains, whereas principle-level experience offers a more stable basis for sustained multi-iteration self-evolution.

5.2 Effect of Experience Injection Pattern

Having established that principle-level experience provides a more suitable signal for internalization, we next examine how such experience should be injected into the teacher prompt when constructing supervision. We fix the experience granularity to principle-level experience and study the two injection patterns under on-policy context-distillation, where trajectories are sampled from the student and the teacher supervises student-induced states.

Following Section 3, the two injection patterns induce different teacher distributions, pₜ^glob and pₜ^step, while the student remains experience-free with qₜ = πθ(·|hₜ₋₁).

Under on-policy distillation, both settings supervise the same student-induced trajectory distribution and differ only in the teacher distribution used as the distillation target. The global-injection objective is:

ℒ_on^glob(θ) = 𝔼_ℋ∼πθ ∑ₜ₌₁ᵀ D_KL(qₜ ∥ pₜ^glob)

Here, the teacher uses a fixed trajectory-level experience context, whereas step-wise injection uses a state-dependent teacher distribution:

ℒ_on^step(θ) = 𝔼_ℋ∼πθ ∑ₜ₌₁ᵀ D_KL(qₜ ∥ pₜ^step)

5.2.1 Injection Pattern in Single-Iteration Internalization

At iteration 1, step-wise injection consistently yields stronger internalization than global injection. This indicates that merely making experience accessible to the teacher is insufficient. The injection pattern affects whether the experience can shape the teacher distribution used for distillation.

Injection	WebWalkerQA	GAIA	BrowseComp-ZH
Global	23.2	16.8	4.5
Step-wise	31.2 (+8.0)	22.7 (+5.9)	5.2 (+0.7)

This result suggests that the utility of experience is determined not only by the experience pool itself, but also by whether its content is selected and injected at the appropriate supervision state. Such state-specific selection is crucial in long-horizon tool-use tasks, because experience that helps search planning may become irrelevant, or even misleading, at later states where the model should verify evidence or decide whether to terminate.

Global injection treats experience as a fixed trajectory-level context, which can misalign the injected experience with the decision currently being supervised. Step-wise injection mitigates this issue by selecting experience according to the current interaction history, turning experience from static background context into decision-relevant supervision.

This advantage is also evident when the experience is generated by the student-side model itself. As shown in the table above, under the Qwen self-generated setting, step-wise injection improves over global injection across all three benchmarks. Compared with using a stronger external model for experience extraction and selection, the Qwen self-generated setting relies on the student-side model itself, providing a more challenging test of whether the injection pattern can exploit weaker experience. This indicates that step-wise injection can extract useful supervision from self-generated experience, supporting experience-based self-evolution.

5.2.2 Injection Pattern in Iterative Internalization

While single-iteration gains are valuable, the critical question for continual experience learning is whether an injection pattern can sustain improvement as the model and the experience pool co-evolve. Global injection yields only transient improvements and degrades as self-evolution proceeds. In contrast, step-wise injection maintains stronger performance across iterations, especially on WebWalkerQA and GAIA.

This indicates that experience injection pattern affects not only the current internalization step, but also the sustainability of experience internalization under iterative updates.

This distinction is particularly important under Qwen self-generated experience. Since the experience pool is produced by the student-side model, it provides a more challenging source of supervision than experience generated by a stronger external model. Step-wise injection better preserves the model's ability to benefit from explicit experience across iterations. After later internalization rounds, step-wise-trained models can still improve when the corresponding experience pool is provided in context, whereas global-injection models degrade in both in-context and internalized performance. This indicates that step-wise injection helps the updated model continue to use its newly generated experience pool when serving as the teacher in later iterations. Without it, the newly generated experience pool cannot provide effective supervision for subsequent internalization.

These results suggest that step-wise injection provides a viable path for experience-based self-evolution, while global injection fails to preserve the utility of experience as the model and experience pool co-evolve.

5.2.3 Why Step-wise Injection Supports Continual Experience Internalization

We further analyze why step-wise injection better sustains continual experience internalization. In iterative self-evolution, the model obtained from one internalization iteration is reused to construct supervision for the next. Thus, the updated model must not only perform well without inference-time experience, but also retain experience-use ability: the ability to further benefit from its corresponding experience pool at inference time, measured by the gap between in-context and experience-free inference. This ability is necessary because the next-round teacher must use the updated experience pool to produce supervision.

Step-wise models continue to benefit from experience across iterations, whereas global-injection models degrade both with and without experience context. This indicates that global injection not only fails to fully convert experience into parametric capability, but also weakens experience-use ability. When reused in the next iteration, the model may provide weaker experience-conditioned supervision and destabilize the model–experience loop.

We also observe a premature-answer failure mode caused by the injection pattern:

Setting	Premature-answer rate
Global	63.82%
Step-wise	0%

This failure stems not from the experience form itself, but from a mismatch between the injected experience and the current decision state. Under global injection, the teacher receives the same fixed experience context throughout the whole trajectory, regardless of whether the current state requires search planning, evidence verification, or termination. As a result, experience that is useful for later-stage decision making may be exposed too early, while experience relevant to the current state may not be emphasized. This misalignment can shift the teacher distribution toward premature answer generation rather than continued tool use.

In contrast, step-wise injection selects experience according to the current interaction history, making the injected experience more decision-relevant at each state. The global-injection model terminates before search, while the step-wise model continues evidence-seeking tool use.

Together, these analyses show that step-wise injection benefits both the current internalization round and the subsequent self-evolution loop. By preserving experience-use ability and reducing exposure to irrelevant terminal information, it helps the internalized model remain an effective experience-aware teacher in later iterations, whereas global injection can weaken this role and make the model–experience loop less sustainable.

5.3 Effect of Internalization Regime

The previous two dimensions improve experience internalization, but performance can still degrade across self-evolution iterations. We therefore revisit on-policy context-distillation, the dominant paradigm for experience internalization, and examine whether the transfer regime affects the stability of continual internalization.

5.3.1 Trajectory Distribution and Supervision Coherence

We compare on-policy context-distillation and off-policy internalization under the same principle-level, step-wise experience configuration, differing only in the trajectory distribution used for supervision. On-policy context-distillation samples trajectories from the current experience-free student and queries the experience-aware teacher on the resulting student-induced states. Off-policy internalization instead samples trajectories directly from the experience-aware teacher (i.e., the student conditioned on step-wise experience) and applies rejection sampling to retain successful trajectories.

This difference in trajectory distribution affects the coherence of the resulting supervision signal.

For on-policy context-distillation, supervision is fundamentally reactive. Because the preceding trajectory is generated by the student without experience, the teacher can only provide corrections on states that may already be inefficient or off target. When the student has deviated substantially from a useful search path, the teacher may struggle to provide valid guidance on these degraded states. As a result, on-policy supervision can improve localized decisions, but it does not necessarily demonstrate how experience should guide a coherent trajectory. This limitation is especially important in long-horizon tool use, where search planning, evidence verification, and termination decisions must be coordinated.

Off-policy distillation instead provides proactive experience-guided supervision. Because the experience-aware teacher generates the full trajectory from the beginning, experience can shape the entire decision sequence, from initial search planning to final answering. After rejection sampling, the student is trained on compact and successful trajectories that directly demonstrate end-to-end experience-guided behavior. This yields a cleaner supervision signal that is better aligned with the behavior we aim to internalize.

5.3.2 Rollout Cost and Trajectory Efficiency

The two regimes also differ in effective rollout cost. We control the query-level rollout budget by using the same set of rollout queries for both regimes, but the actual interaction cost largely depends on trajectory length.

Setting	Avg. assistant turns
Base	2.5
Teacher	4.5
Updated Student	21.9

After one internal weight update in on-policy context-distillation, the updated student produces substantially longer trajectories, averaging 21.9 assistant turns compared with only 2.5 for the base model and 4.5 for the experience-aware teacher. This trajectory inflation increases the practical interaction cost of the on-policy regime, even under an identical query budget.

In contrast, off-policy context-distillation avoids this overhead by sampling shorter trajectories directly from the experience-aware teacher and applying rejection sampling to filter low-quality variants. By leveraging concise teacher rollouts, off-policy context-distillation provides a more efficient supervision loop for iterative internalization.

5.4 Stable Multi-Iteration Experience-Based Self-Evolution

Having analyzed the three dimensions separately, we evaluate whether their synthesis supports stable experience-based self-evolution. Our final configuration integrates principle-level experience, step-wise injection, and off-policy context-distillation.

This combined design successfully sustains robust performance gains across consecutive iterations. The internalized model consistently outperforms the vanilla base model, demonstrating that experience-conditioned behavior is reliably embedded into model parameters.

Furthermore, in-context evaluation reveals that the updated model retains its capacity to exploit the experience pool, ensuring that the student can effectively serve as the experience-aware teacher for the subsequent iteration. Unlike unstable baselines, this design simultaneously preserves standalone parametric execution and in-context responsiveness across iterative updates. Together, these three complementary dimensions form a stable recipe for multi-iteration experience internalization and sustainable self-evolution.

6 Conclusion

We study experience internalization beyond single-iteration transfer and show that existing methods can fail to sustain improvement across self-evolution iterations. Through three dimensions, we find that principle-level experience provides a more durable signal than instance-level experience, step-wise injection better aligns experience with intermediate decision states, and off-policy context-distillation offers more coherent supervision than on-policy context-distillation.

Combining these findings yields a stable recipe for multi-iteration experience internalization, enabling LLM agents to better transform accumulated experience into reusable capability across self-evolution cycles.

Limitations

Our experiments focus on web-reasoning agent tasks, so further evaluation is needed to assess whether the findings generalize to other domains, languages, and agent settings. In addition, while we study three key dimensions of experience internalization, other factors such as experience-pool size, selector quality, and filtering criteria may also affect stability. We leave a more comprehensive exploration of these factors to future work.

Broader Impact

This work studies stable experience internalization for self-evolving LLM agents. By analyzing why experience internalization can degrade across iterations, our findings may help build agents that more reliably transform accumulated experience into reusable model capability. This can benefit long-horizon tool-use applications such as web reasoning, information seeking, and research assistance, where agents must search, verify evidence, and update their behavior from past interactions.

At the same time, more stable internalization may also reinforce undesirable behaviors if the accumulated experience contains incorrect, biased, or unsafe patterns. This risk is especially relevant in self-evolving systems, where models repeatedly generate, internalize, and reuse their own experience. Practical deployment should therefore include trajectory filtering, experience-pool auditing, human oversight, and restrictions in high-risk settings. Our work focuses on improving the stability of experience internalization across self-evolution iterations, while practical deployment should still involve appropriate oversight and safeguards.