Auto-generated from arXiv metadata + an LLM reading only titles/abstracts. Equations are interpretive; always verify with the PDF.
1) Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
- Authors: Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith
- arXiv: 2605.10931 · pdf
- LLM context source: arXiv HTML (html)
- Categories: math.AP, cs.LG, math.DS
Abstract
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: \sqrt{\nicefrac{{\log(\beta+1)}}{{\beta}}}\exp(Ct)+\exp(-ct)
- Equation: \sqrt{\nicefrac{{\log(\beta+1)}}{{\beta}}}\exp(Ct)+\exp(-ct)
- Symbols:
- β: temperature parameter
- C: constant
- t: inference time
- Why it matters: This equation represents the Wasserstein distance between the token distribution at time t and the limiting distribution at time t → ∞.
Equation 2: \beta^{-1} \to 0
- Equation: β^{-1} \to 0
- Symbols:
- β: temperature parameter
- Why it matters: This equation represents the limit of the temperature parameter β as it approaches 0, which is the zero-temperature regime.
Equation 3: t ≥ 0
- Equation: t ≥ 0
- Symbols:
- t: inference time
- Why it matters: This equation represents the domain of the inference time t, which is non-negative.
Equation 4: t → ∞
- Equation: t → ∞
- Symbols:
- t: inference time
- Why it matters: This equation represents the limit of the inference time t as it approaches infinity, which is the long-term behavior of the system.
Equation 5: \log\beta
- Equation: \log\beta
- Symbols:
- β: temperature parameter
- Why it matters: This equation represents the logarithm of the temperature parameter β, which is used in the calculation of the Wasserstein distance.
Equation 6: \beta
- Equation: \beta
- Symbols:
- β: temperature parameter
- Why it matters: This equation represents the temperature parameter β, which is a key parameter in the system.
Equation 7: {x_i}_{i=1}^{n} ⊂ \mathbb{R}^{d}
- Equation: {x_i}_{i=1}^{n} ⊂ \mathbb{R}^{d}
- Symbols:
- x_i: tokens
- n: number of tokens
- d: dimensionality
- Why it matters: This equation represents the set of tokens x_i, which are elements of the input space \mathbb{R}^{d}.
Equation 8: \dot{x}_i(t)
- Equation: \dot{x}_i(t)
- Symbols:
- x_i: tokens
- t: inference time
- Why it matters: This equation represents the time derivative of the tokens x_i with respect to the inference time t.
Method Summary
- The authors study the evolution of tokens in deep encoder-only transformers at inference time using a mean-field continuity equation.
- They leverage ideas from the convergence analysis of interacting multi-particle systems to prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices.
- The authors establish Lyapunov-type estimates for the zero-temperature equation and employ a stability estimate in Wasserstein space to couple the two equations.
Experimental Overview
- Tasks/Datasets: The authors use an explicit Euler discretization of the mean-field continuity equation to simulate the evolution of tokens in deep encoder-only transformers.
- Baselines/Comparisons: The authors compare their results with numerical experiments and provide a quantitative analysis of the concentration phenomena.
- Main Claimed Findings: The authors show that for time scales of order log(β), the token distribution concentrates near the dominant eigenspace E of V^TB^{\top}, and that the system enters a terminal phase for large t, dominated by the spectrum of the value matrix V.
What to Verify in the PDF
- The authors claim that the Wasserstein distance between the token distribution at time t and the limiting distribution at time t → ∞ scales like log(β) / β exp(Ct) + exp(-ct).
- The authors also claim that the time scales at which the two effects are present can be derived directly from Theorem 1.
- The authors mention that the numerical experiments confirm the predicted concentration near the dominant eigenspace E of V^TB^{\top} up to time scales of order log(β).
- The authors also mention that the system enters a terminal phase for large t, dominated by the spectrum of the value matrix V.
2) Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges
- Authors: Usman A. Khan, Joseph W. Durham
- arXiv: 2605.10917 · pdf
- LLM context source: arXiv HTML (html)
- Categories: cs.LG, cs.MA, cs.RO
Abstract
We consider anonymous multi-agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min-cost, integral $({0,1})$ transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: {0,1}
- Equation: Not found in extracted context.
- Symbols: Not found in extracted context.
- Why it matters: Not found in extracted context.
Equation 2: ε → 0
- Equation: Not found in extracted context.
- Symbols: ε (epsilon)
- Why it matters: ε is a parameter that controls the trade-off between the cost and the feasibility of the solution. As ε approaches 0, the solution becomes more feasible but may not be optimal.
Equation 3: 10%
- Equation: Not found in extracted context.
- Symbols: Not found in extracted context.
- Why it matters: Not found in extracted context.
Equation 4: 60–80%
- Equation: Not found in extracted context.
- Symbols: Not found in extracted context.
- Why it matters: Not found in extracted context.
Equation 5: P
- Equation: Not found in extracted context.
- Symbols: P (probability measure)
- Why it matters: P is a probability measure that represents the distribution of the robots and targets.
Equation 6: G
- Equation: Not found in extracted context.
- Symbols: G (reference diffusion process)
- Why it matters: G is a reference diffusion process that is used to define the Schrödinger bridge formulation.
Equation 7: {\mathbf{P} ∈ ℝ^{K×…×K}}
- Equation: Not found in extracted context.
- Symbols: {\mathbf{P}} (probability tensor)
- Why it matters: {\mathbf{P}} is a probability tensor that represents the joint probability distribution of the robots and targets.
Method Summary
- The authors cast the MAPF problem as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure.
- The authors establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min-cost, integral transports that do not overlap in both space and time.
- The authors cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges, which reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution.
- The authors demonstrate that the Schrödinger bridge formulation results in near-optimal, integral transports at a significant reduction in complexity.
Experimental Overview
- Tasks/Datasets: The authors conduct experiments on grids of size K × H with N robots and M targets, where N = M ≤ K / 2.
- Baselines/Comparisons: The authors compare the proposed approaches with existing baselines, including the Sinkhorn-type solution.
- Main Claimed Findings: The authors demonstrate that the proposed approaches achieve near-optimal, integral transports at a significant reduction in complexity, with a runtime scaling of O(K^{1.68}) for the P1 approach and O(K^{1.15}) for the P2 + P3 pipeline.
What to Verify in the PDF
- The authors claim that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. Verify that this is indeed the case and that the resulting solution is near-optimal and integral.
- The authors claim that the P1 approach achieves a significant reduction in complexity compared to existing baselines. Verify that this is indeed the case and that the P1 approach is near-optimal and integral.
- The authors claim that the P2 + P3 pipeline scales nearly linearly with K. Verify that this is indeed the case and that the pipeline is efficient and scalable.
- The authors claim that the proposed approaches achieve a cost gap consistently below 10%. Verify that this is indeed the case and that the cost gap is a meaningful metric for evaluating the performance of the proposed approaches.
3) Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis
- Authors: Richie Yeung, Aleks Kissinger, Rob Cornish
- arXiv: 2605.10910 · pdf
- LLM context source: arXiv HTML (html)
- Categories: quant-ph, cs.LG
Abstract
We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit’s Aaronson-Gottesman and greedy Clifford synthesizers.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: $2^{n} \times 2^{n}$
- Equation: $2^{n} \times 2^{n}$
- Symbols: $n$ (qubit count)
- Why it matters: This equation represents the total number of possible binary symplectic matrices of size $2n \times 2n$, which is the state space in the reinforcement learning setup.
Equation 2: $2n \times 2n$
- Equation: $2n \times 2n$
- Symbols: $n$ (qubit count)
- Why it matters: This equation represents the size of the state space in the reinforcement learning setup, which is the set of binary symplectic matrices of size $2n \times 2n$.
Equation 3: $\frac{995}{1003}$
- Equation: $\frac{995}{1003}$
- Symbols: None
- Why it matters: This equation represents the proportion of optimal six-qubit tableaus that the agent achieves in its experiments, which is 99.2%.
Equation 4: $99.2\%$
- Equation: $99.2\%$
- Symbols: None
- Why it matters: This equation represents the performance of the agent on the benchmark of 1003 optimal six-qubit tableaus, which is 99.2%.
Equation 5: $\frac{982}{1003}$
- Equation: $\frac{982}{1003}$
- Symbols: None
- Why it matters: This equation represents the proportion of optimal six-qubit tableaus that the baseline (symbolic peephole optimizer) achieves in its experiments, which is $\frac{982}{1003}$.
Equation 6: $97.9\%$
- Equation: $97.9\%$
- Symbols: None
- Why it matters: This equation represents the performance of the baseline (symbolic peephole optimizer) on the benchmark of 1003 optimal six-qubit tableaus, which is 97.9%.
Method Summary
- The authors approach the problem of synthesizing Clifford quantum circuits as a reinforcement learning problem.
- They use a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation.
- The architecture is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization.
- The authors use a deterministic Markov decision process (MDP) formulation, where the state space is the set of binary symplectic matrices of size $2n \times 2n$, the action space is the set of generator matrices, and the reward function is the negative T-count of the resulting Clifford circuit.
Experimental Overview
- The authors apply their approach to the benchmark of 1003 optimal six-qubit tableaus considered by Bravyi et al.
- The baseline is the symbolic peephole optimizer, which matches the optimal entangling-gate count for 982/1003 circuits after 217 hours.
- The authors claim that their agent achieves 99.2% of the optimal tableaus in its experiments, which is a significant improvement over the baseline.
What to Verify in the PDF
- The authors mention that the optimal six-qubit tableaus were taken from the exhaustive database of optimal six-qubit Clifford circuits generated by Bravyi–Latone–Maslov. Can the authors provide more information about this database and how it was generated?
- The authors claim that the baseline (symbolic peephole optimizer) achieves 97.9% of the optimal tableaus. Can the authors provide more information about the implementation and hyperparameters of the baseline?
- The authors mention that the agent achieves lower average two-qubit gate counts than Qiskit’s Aaronson-Gottesman and greedy Clifford synthesizers. Can the authors provide more information about these baselines and how they were implemented?
4) V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction
- Authors: Marcin Kostrzewa, Sebastian Tomczak, Roman Furman, Anna Poberezhna, Michał Furgała, Oleksii Furman, Maciej Zięba
- arXiv: 2605.10896 · pdf
- LLM context source: arXiv HTML (html)
- Categories: cs.LG
Abstract
Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegràd Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: F1-score
[ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ]
- Symbols: ( F_1 ), Precision, Recall
- Why it matters: The F1-score is a measure of the model’s accuracy, balancing precision and recall. It is used to evaluate the performance of the model on the V4FinBench dataset.
Equation 2: ROC-AUC
[ \text{ROC-AUC} = \frac{1}{n} \sum_{i=1}^{n} \text{Area under the curve at point } x_i ]
- Symbols: ( \text{ROC-AUC} ), ( n ), ( x_i )
- Why it matters: The ROC-AUC (Receiver Operating Characteristic-Area Under the Curve) is a measure of the model’s ability to distinguish between positive and negative classes. It is used to evaluate the performance of the model on the V4FinBench dataset.
Equation 3: h ∈ {0, …, 5}
[ h \in {0, \ldots, 5} ]
- Symbols: ( h )
- Why it matters: This equation represents the six prediction horizons used in the V4FinBench dataset, ranging from 0 to 5.
Equation 4: equity / total_assets < 0
[ \text{equity} / \text{total_assets} < 0 ]
- Symbols: equity, total_assets
- Why it matters: This equation represents the composite distress criterion used to predict bankruptcy in the V4FinBench dataset, which captures solvency, profitability, and liquidity deterioration.
Method Summary
- The authors evaluate three families of methods on V4FinBench: tabular foundation models (TabPFN with imbalance-aware in-context finetuning), large language models (Llama-3-8B with QLoRA on serialized financial records), and classical tabular baselines.
- The authors compare the performance of these methods on six prediction horizons and report the results on F1-score and ROC-AUC.
- The authors also evaluate the performance of TabPFN on the American Bankruptcy Dataset and find that adaptation captures transferable financial-distress structure rather than only V4-specific patterns.
Experimental Overview
- Tasks: Bankruptcy prediction on V4FinBench dataset
- Datasets: V4FinBench, American Bankruptcy Dataset
- Baselines: TabPFN, Llama-3-8B, classical tabular baselines (gradient-boosted trees, logistic regression, random forest, multilayer perceptron)
- Main claimed findings: Imbalance-aware TabPFN finetuning can compete with strong classical baselines, and V4FinBench finetuning transfers beyond the original V4 distribution.
What to Verify in the PDF
- The authors’ choice of hyperparameter grids for standard methods benchmarked against the TabPFN model (Appendix B).
- The results of the Llama-3-8B experiment, which deviates from the full protocol due to compute constraints (Section 6.2).
- The evaluation protocol used for the American Bankruptcy Dataset, which is not explicitly stated in the paper but is mentioned in the context of the V4FinBench finetuned TabPFN checkpoint.
5) Predicting 3D structure by latent posterior sampling
- Authors: Azmi Haider, Dan Rosenbaum
- arXiv: 2605.10830 · pdf
- LLM context source: arXiv HTML (html)
- Categories: cs.CV, cs.LG
Abstract
The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.
Formula and Experiment Notes (LLM)
Formula Walkthrough
Equation 1: x_{t}
x_{t} = \frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t},t)\right)
- Symbols:
x_{t},α_{t},ε_{θ}(x_{t},t),I - Matters: This equation represents the diffusion process, where
x_{t}is the latent variable at timet,α_{t}is the diffusion coefficient,ε_{θ}(x_{t},t)is the noise term, andIis the identity matrix.
Equation 2: x_{t-1}
x_{t-1}
- Symbols:
x_{t-1} - Matters: This equation is not explicitly defined in the provided context, but it is likely a reference to the previous latent variable
x_{t-1}.
Equation 3: \hat{x}_{0}
\hat{x}_{0} = \frac{1}{\sqrt{\bar{\alpha}_{t}}}\left(x_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(x_{t},t)\right)
- Symbols:
\hat{x}_{0},x_{t},\alpha_{t},\epsilon_{θ}(x_{t},t) - Matters: This equation represents the denoising step in the diffusion process, where
\hat{x}_{0}is the denoised latent variable,x_{t}is the current latent variable,\alpha_{t}is the diffusion coefficient, and\epsilon_{θ}(x_{t},t)is the noise term.
Equation 4: \displaystyle x_{t-1}
\displaystyle x_{t-1}
- Symbols:
x_{t-1} - Matters: This equation is not explicitly defined in the provided context, but it is likely a reference to the previous latent variable
x_{t-1}.
Equation 5: \displaystyle \sim \mathcal{N}\left(\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t},t)\right), \;\tilde{\beta}_{t}I\right)
\displaystyle \sim \mathcal{N}\left(\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t},t)\right), \;\tilde{\beta}_{t}I\right)
- Symbols:
x_{t},\alpha_{t},\epsilon_{θ}(x_{t},t),\tilde{\beta}_{t},I - Matters: This equation represents the likelihood term in the diffusion process, where
x_{t}is the latent variable,\alpha_{t}is the diffusion coefficient,\epsilon_{θ}(x_{t},t)is the noise term,\tilde{\beta}_{t}is the variance, andIis the identity matrix.
Method Summary
- The proposed method combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models.
- The method consists of two stages: (1) training the reconstruction model (RM) while optimizing the latent representation of the training scenes (auto-decoding), and (2) training a diffusion model over the latents as a prior.
- The reconstruction model is a CNF followed by a volumetric renderer, which predicts the values of 3D positions within the scene.
- The diffusion model is used to perform posterior sampling of the latents.
Experimental Overview
- Tasks: 3D reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data.
- Baselines: 2D generative models for 3D generation approaches.
- Main claimed findings: The proposed method can accurately predict 3D structure from diverse types of observations, and outperforms 2D generative models in terms of 3D reconstruction quality.
What to Verify in the PDF
- The implementation details of the reconstruction model and the diffusion model.
- The training procedure for the reconstruction model and the diffusion model.
- The evaluation metrics used to compare the performance of the proposed method with 2D generative models.