Auto-generated from arXiv metadata + an LLM reading only titles/abstracts. Equations are interpretive; always verify with the PDF.
1) Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
- Authors: Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, Yuxiong He
- arXiv: 2606.27369 · pdf
- LLM context source: arXiv HTML (html)
- Categories: cs.LG
Abstract
Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: πθ
πθ = πθ(old) * (1 - ε) + ε * πref
- Symbols: πθ (policy), πθ(old) (old policy), ε (clipping range), πref (reference policy)
- Why it matters: This equation represents the policy update in the Group Relative Policy Optimization (GRPO) algorithm. The new policy is a mixture of the old policy and a clipped version of the reference policy.
Equation 2: o
o = (o1, …, oT)
- Symbols: o (response), o1, …, oT (individual responses)
- Why it matters: This equation represents the response generated by the policy. The response is a sequence of individual responses.
Equation 3: D
D = {Tm}m=1M
- Symbols: D (dataset), Tm (test instance)
- Why it matters: This equation represents the dataset used for training. The dataset consists of a set of test instances.
Equation 4: R
R: Q × O → ℝ
- Symbols: R (reward function), Q (state space), O (action space), ℝ (real numbers)
- Why it matters: This equation represents the reward function used in the RLVR framework. The reward function maps a state-action pair to a real-valued reward.
Equation 5: (q, o)
(q, o) ∈ Q × O
- Symbols: q (state), o (action)
- Why it matters: This equation represents the state-action pair used in the RLVR framework. The state-action pair consists of a state and an action.
Equation 6: R
R ∈ {0, 1}
- Symbols: R (reward), ∈ {0, 1} (binary reward)
- Why it matters: This equation represents the binary reward used in the RLVR framework. The reward is either 0 or 1.
Equation 7: R(q, o)
R(q, o) ∈ {0, 1}
- Symbols: q (state), o (action), R (reward)
- Why it matters: This equation represents the binary reward assigned to a state-action pair. The reward is either 0 or 1.
Equation 8: maxθJ(θ)
maxθJ(θ) = Eq_{q∼D} [ Eo∼πθ(⋅|q) [ R(q, o) ] ]
- Symbols: θ (policy parameter), J(θ) (expected return), q (state), o (action), R (reward)
- Why it matters: This equation represents the expected return of the policy. The expected return is the expected value of the reward over all possible states and actions.
Method Summary
- The RLVR framework uses a score-based optimization approach to train LLMs.
- The framework uses a deterministic execution feedback mechanism to provide continuous-valued supervision.
- The framework addresses two key challenges: scale dominance and frequency dominance.
- The framework uses a calibrated reward shaping mechanism to address these challenges.
- The framework is evaluated on a set of open-ended optimization tasks and exact-solution benchmarks.
Experimental Overview
- The experiments use a set of 12 AtCoder Heuristic Contest tasks as the training set.
- The experiments use a set of baselines, including raw-score baselines and instance-wise ranking baselines.
- The main claimed findings are:
- RiVER improves both backbones on every benchmark.
- RiVER raises the ALE rating by 142 points on Qwen3-8B and 157 points on GLM-Z1-9B-0414.
- RiVER increases performance of the two backbones across LCB v5, LCB v6, and USACO.
What to Verify in the PDF
- The details of the reward shaping mechanism used in the RiVER framework.
- The evaluation of the raw-score baselines and instance-wise ranking baselines on the exact-solution benchmarks.
- The analysis of the frequency dominance challenge and how the RiVER framework addresses it.
- The comparison of the RiVER framework with other score-based optimization approaches.
2) Autoregressive Boltzmann Generators
- Authors: Danyal Rehman, Charlie B. Tan, Yoshua Bengio, Avishek Joey Bose, Alexander Tong
- arXiv: 2606.27361 · pdf
- LLM context source: abstract only
- Categories: cs.LG, cs.AI
Abstract
Efficient sampling of molecular systems at thermodynamic equilibrium is a hallmark challenge in statistical physics. This challenge has driven the development of Boltzmann Generators (BGs), which allow rapid generation of uncorrelated equilibrium samples by combining a generative model with exact likelihoods and an importance sampling correction. However, modern BGs predominantly rely on normalizing flows (NFs), which either suffer from limited expressivity due to strict invertibility constraints (discrete time) or computationally expensive likelihoods (continuous time). In this paper, we propose Autoregressive Boltzmann Generators (ArBG) – a novel autoregressive modelling framework – that overcomes these limitations by departing from the flow-based BG paradigm. ArBG circumvents the topological constraints of flows and enables sequential inference-time interventions, while offering enhanced scalability by leveraging architectures effective in Large Language Models. We empirically demonstrate that ArBG leads to significant improvements over flow-based models across all benchmarks, but particularly in larger peptide systems such as the 10-residue Chignolin. Furthermore, we introduce Robin, a 132 million parameter transferable model trained with the ArBG framework which improves over the previous state-of-the-art, reducing the zero-shot energy error, E-W$_2$, on 8-residue systems by over 60$\%$. The code can be found at the following link: https://github.com/danyalrehman/autobg.
Formula and Experiment Notes (LLM)
Formula Walkthrough
- Importance Sampling Correction
- Equation: Not explicitly provided in the context
- Symbols: Not provided
- Why it matters: The importance sampling correction is a crucial component of Boltzmann Generators, allowing for efficient sampling of molecular systems at thermodynamic equilibrium. The exact formulation of this correction is not provided in the context.
- Normalizing Flows (NFs)
- Equation: Not explicitly provided in the context
- Symbols: Not provided
- Why it matters: NFs are a type of generative model used in Boltzmann Generators, but the context notes their limitations, including limited expressivity due to strict invertibility constraints and computationally expensive likelihoods.
- Autoregressive Boltzmann Generators (ArBG)
- Equation: Not explicitly provided in the context
- Symbols: Not provided
- Why it matters: ArBG is a novel autoregressive modelling framework proposed in the paper, which overcomes the limitations of NFs and enables sequential inference-time interventions.
- Zero-Shot Energy Error (E-W2)
- Equation: Not explicitly provided in the context
- Symbols: Not provided
- Why it matters: E-W2 is a metric used to evaluate the performance of Boltzmann Generators, and the context notes that ArBG improves over the previous state-of-the-art, reducing E-W2 by over 60% on 8-residue systems.
- Robin Model
- Equation: Not explicitly provided in the context
- Symbols: Not provided
- Why it matters: Robin is a 132 million parameter transferable model trained with the ArBG framework, which improves over the previous state-of-the-art and demonstrates the effectiveness of ArBG in practice.
Method Summary
- The authors propose Autoregressive Boltzmann Generators (ArBG), a novel autoregressive modelling framework that overcomes the limitations of normalizing flows (NFs) in Boltzmann Generators.
- ArBG enables sequential inference-time interventions and offers enhanced scalability by leveraging architectures effective in Large Language Models.
- The authors introduce Robin, a 132 million parameter transferable model trained with the ArBG framework, which improves over the previous state-of-the-art.
Experimental Overview
- Tasks/Datasets: The authors evaluate ArBG on various benchmarks, including peptide systems such as the 10-residue Chignolin.
- Baselines/Comparisons: The authors compare ArBG to flow-based models and note that ArBG leads to significant improvements across all benchmarks.
- Main Claimed Findings: ArBG improves over flow-based models, particularly in larger peptide systems, and reduces the zero-shot energy error (E-W2) on 8-residue systems by over 60%.
What to Verify in the PDF
- The exact formulation of the importance sampling correction used in ArBG.
- The mathematical derivations of the ArBG framework and its advantages over NFs.
- The details of the Robin model, including its architecture and training procedure.
3) When are likely answers right? On Sequence Probability and Correctness in LLMs
- Authors: Johannes Zenn, Jonas Geiping
- arXiv: 2606.27359 · pdf
- LLM context source: arXiv HTML (html)
- Categories: stat.ML, cs.LG
Abstract
Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: N→∞
- Equation: N→∞
- Symbols: N (number of tokens)
- Why it matters: This equation is likely used to describe the behavior of the model as the number of tokens (N) approaches infinity. It may be used to analyze the model’s behavior in the limit of large sequences.
Equation 2: N=32
- Equation: N=32
- Symbols: N (number of tokens)
- Why it matters: This equation is likely used to specify a particular value of N for the experiment or analysis. It may be used to compare the model’s behavior with a fixed number of tokens.
Equation 3: {\mathbf{s}}=(s1,⋯,sT)
- Equation: {\mathbf{s}}=(s1,⋯,sT)
- Symbols: {\mathbf{s}} (sequence), s1, ⋯, sT (individual tokens)
- Why it matters: This equation is used to represent a sequence of tokens as a vector {\mathbf{s}}. It is likely used to analyze the structure of the sequence.
Equation 4: \bar{\mathbf{s}}
- Equation: \bar{\mathbf{s}}
- Symbols: \bar{\mathbf{s}} (average sequence)
- Why it matters: This equation is likely used to represent the average sequence over multiple trials or iterations. It may be used to analyze the model’s behavior over time.
Equation 5: p({\mathbf{s}}|{\bar{\mathbf{s}}})=∏t=1TP(s_t|{\bar{\mathbf{s}}},{\mathbf{s}}_{<t})
-
Equation: p({\mathbf{s}} {\bar{\mathbf{s}}})=∏t=1TP(s_t {\bar{\mathbf{s}}},{\mathbf{s}}_{<t}) -
Symbols: p({\mathbf{s}} {\bar{\mathbf{s}}}) (conditional probability), T (temperature), s_t (token at time t), {\bar{\mathbf{s}}} (average sequence), {\mathbf{s}}_{<t} (previous tokens) - Why it matters: This equation is used to calculate the conditional probability of a sequence given the average sequence. It is likely used to analyze the model’s behavior in terms of sequence probability.
Method Summary
- The paper discusses two classes of decoding methods for large language models: local decoding methods and global decoding methods.
- Local decoding methods change the next-token distribution at every prefix, while global decoding methods target the globally most likely sequences.
- The paper also discusses the variational objective for global decoding, which trades off expected sequence probability and entropy.
- The authors find that higher sequence probability is often predictive of correctness, but this relationship does not generally transfer to decoding decisions.
Experimental Overview
- The paper uses a variety of datasets and models, including vLLM, power-SMC, and SPS.
- The authors compare the correlation coefficients between sequence probability and correctness across different methods and datasets.
- The main claimed findings are that higher sequence probability is often predictive of correctness, but this relationship does not generally transfer to decoding decisions.
What to Verify in the PDF
- The authors’ implementation of the power-SMC algorithm and its effects on sequence probability and correctness.
- The results of the additional experiments and analysis, including the correlation coefficients and sequence lengths.
- The authors’ discussion of the limitations and potential biases of the experimental setup.
4) Error-Conditioned Neural Solvers
- Authors: Haina Jiang, Liam Wang, Peng-Chen Chen, Min Seop Kwak, Seungryong Kim, Brian Bell, Jeong Joon Park
- arXiv: 2606.27354 · pdf
- LLM context source: arXiv HTML (html)
- Categories: cs.LG, cs.AI, cs.CV, math.NA
Abstract
Neural surrogate models offer fast approximate mappings from PDE parameters to solutions, but they typically treat solving as a purely statistical task: once trained, they struggle to correct their own constraint violations and extrapolate beyond the training distribution. Recent hybrid methods promote physical correctness by targeting the PDE residual via gradient descent or Gauss–Newton steps, but inherit the compute cost and instability of the underlying classical optimizers. We show, theoretically and empirically, that numerically minimizing the PDE residual can be an unreliable proxy for reconstruction accuracy in ill-conditioned systems, explaining why these methods often do not make accurate predictions despite achieving low residuals. We propose error-conditioned Neural Solvers (ENS), built on a different principle: rather than an optimization target, the PDE residual field is passed as a direct input to the network at each iteration, enabling it to read the spatial structure of its own errors and learn an update policy to iteratively correct its predictions. Across four PDE families, ENS attains the highest prediction accuracy in the large majority of settings, with gains reaching $10\times$ on turbulent Kolmogorov flow, while avoiding the expensive compute cost of hybrid methods. ENS’s learned correction policy generalizes under distribution shift, including zero-shot parameter changes and cross-equation transfer, where its relative advantage is largest in the ill-conditioned regimes where residual minimization is least reliable. Project website: https://neuralsolver.github.io/.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: 10 ×
- Equation: 10 ×
- Symbols: × (multiplication symbol)
- Why it matters: This equation is used to represent the gain in prediction accuracy achieved by the Error-Conditioned Neural Solvers (ENS) method. The × symbol indicates multiplication, but the actual value is not provided in the context.
Equation 2: L1
- Equation: L1
- Symbols: L1 (L1 loss function)
- Why it matters: This equation is not explicitly mentioned in the context, but it is likely related to the L1 loss function used in the experiments. The L1 loss function is a common metric for evaluating the accuracy of neural network predictions.
Equation 3: r(u) = F(u; f)
- Equation: r(u) = F(u; f)
- Symbols: r(u) (residual function), F(u; f) (forward operator)
- Why it matters: This equation represents the residual function, which is used to compute the difference between the predicted solution and the actual solution. The forward operator F(u; f) is used to compute the predicted solution.
Equation 4: r^(k)
- Equation: r^(k)
- Symbols: r^(k) (residual at iteration k)
- Why it matters: This equation represents the residual function at a specific iteration k. The residual is used to compute the correction term in the ENS method.
Equation 5: u^*
- Equation: u^*
- Symbols: u^* (exact solution)
- Why it matters: This equation represents the exact solution to the partial differential equation (PDE). The ENS method aims to approximate this solution.
Method Summary
- The Error-Conditioned Neural Solvers (ENS) method is a novel approach to solving partial differential equations (PDEs) using neural networks.
- The method consists of a predictor and a recurrent corrector, which are trained separately to minimize the L2 reconstruction error and the PDE-residual MSE.
- The predictor is a feed-forward neural network (FNO) with CNN lifting and projection layers, while the corrector is a recurrent neural network (RNN) with a transformer-based backbone.
- The ENS method is trained on in-distribution datasets and tested on held-out in-distribution data plus four out-of-distribution regimes: super-resolution, parameter extrapolation, and cross-equation transfer.
Experimental Overview
- Tasks/Datasets:
- Four PDE families: linear and nonlinear Helmholtz, Darcy flow, Poisson, and Navier-Stokes in vorticity form.
- Four out-of-distribution regimes: super-resolution, parameter extrapolation, and cross-equation transfer.
- Baselines/Comparisons:
- FNO (Li et al., 2020)
- PINO (Li et al., 2024) with and without test-time optimization (TTOP)
- POSEIDON (Herde et al., 2024)
- DiffusionPDE (Huang et al., 2024)
- PCFM (Utkarsh et al., 2025)
- Main Claimed Findings:
- ENS achieves the highest prediction accuracy in the large majority of settings.
- ENS attains gains of 10 × 10 × on turbulent Kolmogorov flow.
- ENS avoids the expensive compute cost of hybrid methods.
What to Verify in the PDF
- The data generation protocols for both training and in-distribution test data.
- The configurations for extrapolation test data.
- The detailed analysis of the ENS method’s performance on each PDE family.
- The comparison of ENS with other methods in terms of reconstruction error and PDE-residual MSE.
5) All you need is log
- Authors: Akshay Balsubramani
- arXiv: 2606.27349 · pdf
- LLM context source: arXiv HTML (html)
- Categories: cs.IT, math.PR, math.ST, stat.ML
Abstract
Comparing two probability distributions is a basic building block of statistics and machine learning, and the right family is well understood: the Rényi divergences of order $α\in[0,\infty]$ are the unique family monotone under data processing and additive on independent products. Many problems instead compare more than two distributions at once – multi-population fairness, multi-prior PAC-Bayes bounds, multi-hypothesis testing – and the right multi-distribution generalization of the Rényi family has been an open question. We characterize it. Every functional of $W$-tuples of distributions that is monotone under data processing and additive on independent products is a positive integral of multi-way coincidence divergences $C_α(π_1,\dots,π_W) := -\log\int π_1^{α_1}\cdotsπ_W^{α_W}$ (with $\sum_k α_k = 1$) over a parameter space with four strata: the simplex interior; mixed-sign exponent cones (the analogue of Rényi orders $>1$); a tropical boundary at infinity carrying max-divergences; and pairwise Kullback-Leibler edges at the simplex vertices. Each stratum is necessary – the destination of an explicit data-processing-monotone, product-additive divergence the others cannot reproduce – and each is a clean limit of simplex-interior atoms. The same family arises from five independent routes – the structural axioms, Kolmogorov-Nagumo means with Rényi’s entropy axiomatics, classical entropy characterizations, multi-hypothesis testing error exponents, and a multi-lottery betting interpretation – structural evidence that this is the canonical multi-distribution Rényi calculus rather than an artefact of any one axiomatic input. The two-prior case recovers the standard Rényi result; a worked $W=3$ instance, numerical verification, and a conditional extension round out the treatment.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: α ∈ [0, ∞]
\alpha\in[0,\infty]
- Symbols: α (alpha)
- Why it matters: This equation defines the domain of α, which is a parameter in the Rényi divergences.
Equation 2: C_{α}(\pi_{1}, …, π_{W})
C_{\alpha}(\pi_{1},\dots,\pi_{W}):=-\log\!\int\pi_{1}^{\alpha_{1}}\cdots\pi_{W}^{\alpha_{W}},\qquad\textstyle\sum_{k}\alpha_{k}=1
- Symbols: C_{α} (C_alpha), π_{1} to π_{W} (probability measures), α_{k} (parameters)
- Why it matters: This equation defines the Rényi divergence C_{α} between multiple probability measures π_{1} to π_{W} with parameters α_{k} that sum to 1.
Equation 3: α_{k} ∈ [0, 1]
\alpha_{k}\in[0,1]
- Symbols: α_{k} (parameters)
- Why it matters: This equation defines the domain of α_{k}, which is a parameter in the Rényi divergences.
Equation 4: α_{l} > 1
\alpha_{l}>1
- Symbols: α_{l} (parameters)
- Why it matters: This equation defines a condition on α_{l}, which is a parameter in the Rényi divergences.
Equation 5: ≤ 0
\leq 0
- Symbols: (no specific symbols)
- Why it matters: This equation defines a condition on the Rényi divergence C_{α}, which is a measure of the difference between multiple probability measures.
Method Summary
- The authors use a functional-analytic argument to generalize the Rényi divergences to multiple probability measures.
- They invoke the matrix-majorization spectrum result as a black box and assemble the three boundary strata as necessary.
- The authors also use a Riesz-Markov representation against the richer parameter space on a locally compact Hausdorff topology.
Experimental Overview
- Tasks:
- Compare multiple probability measures using the Rényi divergences.
- Evaluate the performance of the proposed method on synthetic and real-world datasets.
- Datasets:
- Synthetic datasets for controlled relative precision.
- Real-world datasets for stress-testing the structural axioms.
- Baselines:
- KL divergence.
- L_{∞} aggregator.
- Main claimed findings:
- The proposed method recovers the L_{∞} aggregator at α = ∞.
- The method achieves high precision and low bias on synthetic and real-world datasets.
What to Verify in the PDF
- The proof of the Choquet linearity sweep across all atom-family cells.
- The spectral reconstruction of the representing measure (m^{D}, m^{D^{T}}, c_{kℓ}) from finite D D -values.
- The high-W W stress test confirming the per-atom structural identities survive at large prior count and alphabet.