Auto-generated from arXiv metadata + an LLM reading only titles/abstracts. Equations are interpretive; always verify with the PDF.

1) The Role of Feedback Alignment in Self-Distillation

Authors: Semih Kara, Oğuzhan Ersoy
arXiv: 2606.11173 · pdf
LLM context source: arXiv HTML (html)
Categories: cs.AI, cs.LG

Abstract

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model’s output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver’s reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver’s reasoning is a key driver of self-distillation effectiveness.

Formula and Experiment Notes (LLM)

Formula Walkthrough

Equation 1: π_{\theta}

π_{\theta} represents the output distribution of the model θ. It is the probability distribution over all possible outputs given the input x.

Equation 2: π_{T}

π_{T} represents the output distribution of the teacher model T. It is the probability distribution over all possible outputs given the input x and the context c.

Equation 3: \mathcal{L}_{\text{KD}}

\mathcal{L}{\text{KD}} is the knowledge distillation loss. It measures the difference between the output distribution of the teacher model π{T} and the output distribution of the student model π_{\theta}. The loss is calculated as the expected KL divergence between the two distributions.

Equation 4: {y_{i}}{i=1}^{G} \sim \pi{\theta}(\cdot \mid x)

This equation represents the generation of step-tagged reasoning traces by the solver model π_{\theta} for a given input x. The traces are generated by sampling from the output distribution of the model.

Equation 5: r_{i} \in {0,1}

r_{i} represents the reward or feedback received by the solver model for each generated trace. The reward is either 0 or 1, indicating whether the trace is correct or not.

Method Summary

The authors implement self-distillation in a solver-critic setup, where the solver model generates step-tagged reasoning traces and the critic model provides feedback on the solver’s response.
The self-distillation method trains the solver model to retain the improvement in response quality when the context is not present.
The authors compare the self-distillation method with three feedback variants: GRPO, RefSol, and StepAlignFB.

Experimental Overview

The authors evaluate the three feedback variants on a held-out 30-sample test split from OpenMathReasoning.
The tasks/datasets used are math questions with step-tagged reasoning traces.
The baselines/comparisons are the self-distillation variants (RefSol and StepAlignFB) and the GRPO method.
The main claimed findings are that the self-distillation method outperforms the baselines in terms of accuracy metrics (Avg@12, Majority-Vote@12, and Pass@12).

What to Verify in the PDF

The implementation details of the solver-critic setup, including the architecture of the solver and critic models.
The training procedure for the self-distillation method, including the hyperparameter settings and the optimization algorithm used.
The results of the ablation study, which is not mentioned in the provided context, to verify the importance of each component of the self-distillation method.

2) Predicting Future Behaviors in Reasoning Models Enables Better Steering

Authors: Evgenii Kortukov, Piotr Komorowski, Florian Klein, Paula Engl, Gabriele Sarti, Seong Joon Oh, Sebastian Lapuschkin, Wojciech Samek
arXiv: 2606.11172 · pdf
LLM context source: arXiv HTML (html)
Categories: cs.LG

Abstract

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

Formula and Experiment Notes (LLM)

Formula Walkthrough

Equation 1: T=1.0

Equation: T=1.0
Symbols: T (temperature)
Why it matters: Temperature is a hyperparameter that controls the randomness of the model’s output. A temperature of 1.0 means the model is not random at all, producing the most likely output.

Equation 2: B(q,r)=1

Equation: B(q,r)=1
Symbols: B (behavior function), q (query), r (response)
Why it matters: This equation defines the behavior function, which maps the query and response to a binary label (1 or 0) indicating whether the model exhibits the desired behavior.

Equation 3: B(q,r)=0

Equation: B(q,r)=0
Symbols: B (behavior function), q (query), r (response)
Why it matters: This equation is the negation of Equation 2, indicating that the model does not exhibit the desired behavior.

Equation 4: p_i

Equation: p_i
Symbols: p_i (probability)
Why it matters: This equation represents the probability of the model exhibiting a certain behavior.

Equation 7: \bar{B}(p_i)=\frac{1}{S}\sum_{j=1}^{S}B(p_i,r_{ij})

Equation: \bar{B}(p_i)=\frac{1}{S}\sum_{j=1}^{S}B(p_i,r_{ij})
Symbols: \bar{B} (average behavior function), p_i (probability), S (number of samples), B (behavior function), r_{ij} (response)
Why it matters: This equation calculates the average behavior function across multiple samples, providing a more robust estimate of the model’s behavior.

Method Summary

The authors propose a new method for steering large reasoning models (LRMs) called Future Probe Controlled Generation (FPCG).
FPCG uses activation probes to predict the likelihood of the model exhibiting a certain behavior.
The authors train these probes using a combination of linear and multi-layer perceptron (MLP) architectures.
FPCG samples multiple candidate responses and chooses the best one based on the probe’s prediction.
The authors evaluate FPCG on six behavioral evaluation datasets and compare it to activation steering.

Experimental Overview

The authors study four open-weight reasoning language models of various sizes and model families.
They evaluate the models on six behavioral evaluation datasets, including multiple-choice question, free-form generation, and prompt injection datasets.
The authors compare FPCG to activation steering and report the results in Appendix D.
The main claimed findings are that FPCG outperforms activation steering in terms of output quality and that the authors’ method can be used to steer models with almost no output quality degradation.

What to Verify in the PDF

The authors’ claim that the detection features used in prior steering work are poor predictors of future behavioral outcomes.
The authors’ method for training activation probes and how it compares to other architectures.
The authors’ evaluation of FPCG on the six behavioral evaluation datasets and how it compares to activation steering.

3) Algorithmic and Minimax Complexities in Kernel Bandits

Authors: Yunbei Xu
arXiv: 2606.11171 · pdf
LLM context source: arXiv HTML (html)
Categories: cs.LG, cond-mat.stat-mech, cs.IT, math.OC, math.ST

Abstract

Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods may appear, at first sight, to belong to different theories. This paper places the two viewpoints in a common algorithmic-information language for frequentist RKHS bandits. GP-UCB fixes an algorithmic, rather than true, Gaussian-process prior and exploits realized-trajectory complexity together with computational tractability, whereas MAMS optimizes a robust class-wide MAIR/DEC envelope. Through the unified MAIR framework and heterogeneous positive-semidefinite algorithmic priors, we generalize both the GP-UCB analysis and the MAMS algorithm, propose a safeguarded master that combines their advantages, and provide a kernel-bandit construction showing that algorithmic complexity can be more informative than class-wide minimax or DEC certificates in overparameterized models. The resulting message is that algorithmic information and class-wide minimax coefficients answer different questions and can lead to different gaps; kernel bandits provide a clean setting in which this distinction becomes mathematically visible.

Formula and Experiment Notes (LLM)

Formula Walkthrough

Equation 1: f^{\star}

f^{\star} = \max_{\pi \in \Pi} r_M(\pi)

Equation: The optimal decision under model M.
Symbols: f^{\star} (star), π (pi), r_M (reward function of model M).
Why it matters: This equation represents the optimal decision that maximizes the expected reward under model M.

Equation 2: M ∈ \mathcal{M}

M ∈ \mathcal{M} is a model in the model class \mathcal{M}

Equation: A model M belongs to the model class \mathcal{M}.
Symbols: M (model), \mathcal{M} (model class).
Why it matters: This equation defines the model class that the algorithms are operating on.

Equation 3: \pi ∈ \Pi

\pi ∈ \Pi is a decision in the decision space \Pi

Equation: A decision π belongs to the decision space \Pi.
Symbols: π (decision), \Pi (decision space).
Why it matters: This equation defines the decision space that the algorithms are operating on.

Equation 4: r_M(\pi)

r_M(\pi) = \mathbb{E}_{\pi \sim p} r_M(\pi)

Equation: The expected reward of decision π under model M.
Symbols: r_M (reward function of model M), π (decision), p (prior distribution).
Why it matters: This equation represents the expected reward of a decision under model M.

Equation 5: P_{M,\pi}

P_{M,\pi} = \mathbb{P}{\pi \sim p} (r_M(\pi) > r_M(\pi{M}))

Equation: The probability that the reward of decision π is greater than the optimal reward under model M.
Symbols: P_{M,\pi} (probability), π (decision), p (prior distribution), r_M (reward function of model M), π_M (optimal decision).
Why it matters: This equation represents the probability that a decision is better than the optimal decision under model M.

Method Summary

The paper introduces a unified framework for kernel bandits using the MAIR (Model-Index Algorithmic Information Ratio) framework.
The framework allows for the comparison of different algorithms, including GP-UCB and MAMS, on the same instance.
The paper proposes a safeguarded master algorithm that combines the advantages of GP-UCB and MAMS.
The framework provides a clean setting for studying the relationship between algorithmic information and class-wide minimax coefficients.

Experimental Overview

The paper presents experiments on a finite grid for Matérn and squared-exponential kernels.
The experiments compare the performance of GP-UCB, MAMS, and the safeguarded master algorithm.
The paper plots cumulative regret and realized information traces to evaluate the performance of the algorithms.
The experiments aim to answer the following questions:
- Can GP-UCB collect more realized information than MAMS?
- Does the safeguarded master algorithm outperform GP-UCB and MAMS?
- How does the algorithmic complexity of the algorithms affect their performance?

What to Verify in the PDF

The derivation of the MAIR objective and gradient bracket.
The proof of the safeguarded master algorithm’s optimality.
The experimental results and plots.
The theoretical analysis of the algorithmic complexity of the algorithms.

4) COGENT: Continuous Graph Emulators with Neural Ordinary Differential Equations for Long-Term Physical Forecasting

Authors: Zesheng Liu, Maryam Rahnemoonfar
arXiv: 2606.11162 · pdf
LLM context source: arXiv HTML (html)
Categories: cs.LG

Abstract

In this work, we present COGENT, a continuous graph emulator with Neural Ordinary Differential Equations for long-term physical forecasting on irregular geospatial meshes. COGENT encodes a finite history of system states and associated forcing fields and external forcings with a graph-based history encoder, producing node-wise context vectors that capture both local spatial interactions and temporal evolution. These context vectors initialize and condition a latent Neural Ordinary Differential Equation whose dynamics are driven by interpolated future forcings and explicit relative rollout time. By modeling the forecast trajectory as a continuous latent dynamical system, COGENT can generate predictions at arbitrary future times rather than being restricted to a fixed temporal discretization. A residual decoder maps the resulting latent trajectories back to future physical states, enabling direct multi-step forecasting without repeatedly feeding predicted states back into the model. This formulation combines graph-based spatial representation, history-conditioned latent dynamics, and continuous-time rollout in a unified framework for mesh-based physical simulation emulation. In order to stabilize training with long-horizon supervision, we also propose effective rollout-horizon sampling and a progressive rollout-horizon scheduling strategy. We evaluate COGENT on transient ice-sheet simulations generated by the Ice-sheet and Sea-level System Model, demonstrating improved long-range stability over autoregressive graph baselines. These results suggest that continuous graph Neural ODEs provide a promising methodology for scalable physical forecasting on irregular geospatial meshes, particularly in applications that require stable long-horizon predictions and the ability to query system states at arbitrary times.

Formula and Experiment Notes (LLM)

Formula Walkthrough

Equation 1: t+\Delta

Equation: t+\Delta
Symbols: t, \Delta
Why it matters: This equation represents the relative rollout time, which is injected into the ODE block. It allows the latent dynamics to depend on both future drivers and historical spatial-temporal interactions.

Equation 2: \Delta

Equation: \Delta
Symbols: \Delta
Why it matters: This equation represents the time step size, which is used to discretize the continuous-time rollout.

Equation 3: z_{0}

Equation: z_{0}
Symbols: z_{0}
Why it matters: This equation represents the initial latent state, which is constructed from the history context, the most recent physical state, and the static embedding.

Equation 4: t!\to!t+1

Equation: t\!\to\!t+1
Symbols: t, t+1
Why it matters: This equation represents the transition from one time step to the next, which is used to initialize the latent ODE.

Equation 5: t!\to!t+h

Equation: t\!\to\!t+h
Symbols: t, t+h
Why it matters: This equation represents the transition from one time step to a future time step, which is used to initialize the latent ODE for future predictions.

Method Summary

COGENT is a continuous graph emulator with Neural Ordinary Differential Equations (ODEs) for long-term physical forecasting on irregular geospatial meshes.
COGENT encodes a finite history of system states and associated forcing fields and external forcings with a graph-based history encoder.
The model uses a residual decoder to map the resulting latent trajectories back to future physical states, enabling direct multi-step forecasting without repeatedly feeding predicted states back into the model.
COGENT is evaluated on transient ice-sheet simulations generated by the Ice-sheet and Sea-level System Model.

Experimental Overview

Tasks/Datasets: COGENT is evaluated on transient ice-sheet simulations generated by the Ice-sheet and Sea-level System Model.
Baselines/Comparisons: COGENT is compared to single-step autoregressive models and multi-horizon emulators.
Main Claimed Findings: COGENT achieves the strongest whole-trajectory performance, reducing the error by 26.3% compared with the best non-COGENT model and by 63.7% compared with the single-step autoregressive model.

What to Verify in the PDF

The effectiveness of the rollout-horizon sampling and progressive rollout-horizon scheduling strategy.
The impact of the graph structure and node features on the performance of COGENT.
The robustness of COGENT to different forcing scenarios and initial conditions.

5) Itô maps for any-step SDEs

Authors: Zhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo, Jakiw Pidstrigach
arXiv: 2606.11156 · pdf
LLM context source: arXiv HTML (html)
Categories: stat.ML, cs.LG

Abstract

Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation procedure for stochastic dynamics. We introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass. The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples. Empirically, Itô maps produce diverse, conditionally valid endpoint samples from fixed intermediate states and support strong steering performance on synthetic and image-generation benchmarks. These results establish any-step SDE integration as a useful primitive for posterior sampling and stochastic control.

Formula and Experiment Notes (LLM)

Formula Walkthrough

Equation 1: Not found in extracted context.

Equation 2:

[ p_{1\mid t}(\cdot\mid x) ] Symbols: ( p_{1\mid t} ), ( \cdot ), ( x ) Why it matters: This equation represents the conditional probability distribution of the future state given the current state.

Equation 3:

[ x_{t} ] Symbols: ( x_{t} ) Why it matters: This equation represents the state at time t.

Equation 4:

[ x_{1} ] Symbols: ( x_{1} ) Why it matters: This equation represents the state at time 1.

Equation 5:

[ X_{s} ] Symbols: ( X_{s} ) Why it matters: This equation represents the state at time s.

Equation 6:

[ (W_{t}){t\in[0,1]} ] Symbols: ( W{t} ), ( t \in [0,1] ) Why it matters: This equation represents a Brownian path from time 0 to 1.

Method Summary

The authors introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass.
The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples.
The authors propose a progressive self-distillation (PSD) objective that can be used to train Itô maps, which is based on the semigroup property of the Itô map.
The authors also propose a reward-tilted distribution that can be used for inference-time steering, which is a natural downstream application of the Itô map.

Experimental Overview

The authors evaluate the Itô map on two tasks: same-path stochastic prediction and inference-time steering.
The tasks are performed on two datasets: low-dimensional Gaussian mixtures and MNIST and ImageNet.
The authors compare the results of the Itô map with a baseline of deterministic flow maps.
The main claimed findings are that the Itô map produces diverse, conditionally valid endpoint samples and supports strong steering performance on synthetic and image-generation benchmarks.

What to Verify in the PDF

The mathematical derivations and proofs of the Itô map formulation and the PSD objective.
The details of the reward-tilted distribution and its application to inference-time steering.
The experimental results and the comparison with the baseline of deterministic flow maps.