Présidents de session
ML: Exploring the synergy between stochastic optimization, dynamics, sampling, inference, and optimal transport II
- Taiji Suzuki (The University of Tokyo / RIKEN AIP)
ML: Contributed talks
- Vladimir Norkin (V.M.Glushkov Institute of Cybernetics)
ML: Functional Models In Stochastic Optimization
- Andrzej Ruszczynski (Rutgers University)
ML: Contributed talks
- Sebastian Maier
ML: Machine Learning and Off-Policy Learning Robust Distribution Shifts
- Phebe Vayanos (University of Southern California)
ML: Exploring the synergy between stochastic optimization, dynamics, sampling, inference, and optimal transport I
- Jia-Jie Zhu (Weierstrass Institute, Berlin)
ML: Contributed talks
- Philip Thompson (FGV-EMAp)
ML: Neural networks with combinatorial optimization layers: encoding policies for (multistage) stochastic combinatorial optimization
- Léo Baty (Ecole des Ponts)
ML: Stochastic gradient methods for reinforcement learning and optimal control
- Caleb Ju (Georgia Tech)
In this talk, I will talk about some recent results on distribution optimization methods. First, we will talk about a non-convex optimization technique for a Wasserstein gradient flow (WGF). While WGF is guaranteed to converge to a first-order stationary point, for nonconvex functionals the converged solution does not necessarily satisfy the second-order optimality condition; i.e., it could...
The Polyak-Łojasiewicz (PL) constant for a given function exactly characterizes the exponential rate of convergence of gradient flow uniformly over initializations, and has been of major recent interest in optimization and machine learning because it is strictly weaker than strong convexity yet implies many of the same results. In the world of sampling, the log-Sobolev inequality plays an...
ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. In this talk, we present a new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which exploits a duality of the posterior distributions...
Problem definition: A key challenge in supervised learning is data scarcity, which can cause prediction models to overfit to the training data and perform poorly out of sample. A contemporary approach to combat overfitting is offered by distributionally robust problem formulations that consider all data-generating distributions close to the empirical distribution derived from historical...
This work explores constrained Wasserstein barycenters (CWB), extending the applicability of the classical Wasserstein barycenter (WB) to pre-required geometric, statistical, or constraints. While the WB problem has been extensively studied, constrained settings pose significant challenges, particularly for nonconvex constraint sets. Building upon the Method of Averaged Marginals (MAM), we...
The study proposes and explores a wide area of application of machine learning methods in applied mathematics, namely, parametric analysis of mathematical models by machine learning methods, in particular, by adaptive kernel support vector machines. The kernel support vector machine is extended by the ability to use a continuum (multivariate parametric) family of kernels to approximate the...
Naive Bayes is one of the most widely used machine learning algorithms, appreciated for its simplicity, efficiency, and ease of interpretation—qualities that make it appealing across various fields. However, Naive Bayes operates under the assumption that the explanatory variables are conditionally independent given the class label, an assumption that often does not hold true in practice.
To...
Optimization problem with stochastic dominance constraints provide a possibility to shape risk by selecting a benchmark random outcome with a desired distribution. A difficulty arises when no feasible decision results in a distribution that dominates the benchmark. In this talk, we address the problem of choosing a tight relaxation of the stochastic dominance constraint by selecting a feasible...
We consider stochastic optimization problems involving an expected value of a nonlinear function of a base random vector and a conditional expectation of another function depending on the base random vector, a dependent random vector, and the decision variables. We call such problems conditional stochastic optimization problems. They arise in many applications, such as uplift modeling,...
Stochastic dominance rules are well-characterized and widely used. This work aims to describe and better understand the situations when they do not hold by developing measures of stochastic non-dominance. They quantify the error caused by assuming that one random variable dominates another one when it does not. To calculate them, we search for a hypothetical random variable that satisfies the...
In this paper, we introduce an approach for obtaining probabilistically guaranteed upper and lower bounds on the true optimal value of stopping problems. Bounds of existing simulation-and-regression approaches, such as those based on least squares Monte Carlo and information relaxation, are stochastic in nature and therefore do not come with a finite sample guarantee. Our data-driven approach...
Decision-makers often face complex problems under uncertainty. Statistical and Machine Learning (ML) tools can support these decisions by reducing the lack of knowledge. However, in many real-world scenarios, the uncertainty itself is induced by the decisions taken. In such cases, standard ML models that provide a priori insights of the problem environment become ineffective. In the context of...
Federated learning concerns training global models in a decentralized manner. Federated learning is important in many applications, especially when training data come from different sources that cannot be shared with the central server due to restrictions on data sharing. With the increasing capacity of data sources, training samples are usually collected and stored on a regular basis, which...
Chance-constrained optimization (CCO) problems are stochastic optimization problems with probabilistic constraints defined by a confidence level $\alpha$. A standard solution approach is to transform the CCO problem into a deterministic optimization problem, which is then solved by a numerical solver. However, this approach becomes computationally expensive when dealing with multiple...
Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem...
In domains such as personalized medicine, historical data is used to learn what treatments to prescribe to maximize positive outcomes. Previous studies have proposed methods for creating prescriptive trees: human-interpretable diagrams that indicate what type of treatment an individual should get based on their measurements. However, a remaining problem is that the models perform worse over...
We consider the problem of learning classification trees that are robust to distribution shifts between training and testing/deployment data. This problem arises frequently in high stakes settings such as public health and social work where data is often collected using self-reported surveys which are highly sensitive to e.g., the framing of the questions, the time when and place where the...
Causal structure learning, the task of inferring causal relationships from data, is computationally challenging due to its combinatorial nature. State-of-the-art integer programming formulations suffer from exponential growth in the number of variables and constraints, while traditional column generation approaches struggle with the complexity of solving mixed-integer nonlinear programming...
The score function plays a central role in modern generative modeling, particularly in diffusion models and related score-based methods. Despite its theoretical appeal, learning the score function in practice is notoriously difficult: it is sensitive to hyperparameter choices and prone to various forms of instability that often require ad hoc corrections.
In this talk, we study score...
The supercooled Stefan problem describes freezing of supercooled water. In contrast to the Stefan problem that describes melting of ice, the supercooled problem exhibits unstable behaviour which makes the usual PDE methods break down. We will discuss some recent progress which employs a stochastic version of optimal transport, involving optimal stopping times of the Brownian motion.
Many problems in machine learning can be framed as variational problems that minimize the relative entropy between two probability measures. Many recent works have exploited the connection between the (Otto-)Wasserstein gradient flow of the Kullback-–Leibler (KL) divergence and various sampling, Bayesian inference, and generative modeling algorithms. In this talk, I will first contrast the...
The notions of relative (weak) convexity and variation (e.g., Lipschitzness and smoothness) have been successfully applied to some optimization problems, including applications in machine learning. While typically harder to prove, these properties encode better dependence of the objective with respect to the intrinsic geometry of the problem. We review previous analysis of the mirror descent...
The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 (Cauchy, 1847). In this paper, we provide an improved convergence analysis of gradient descent and its variants, assuming generalized smoothness (L0,L1). In particular, we show that GD has the following behavior of...
Polyak averaging is a well-known technique for achieving asymptotically optimal convergence in Stochastic Approximation. In this work, we establish the first high-probability bound for general Stochastic Approximation with Polyak Averaging. We take a black-box approach, assuming access to an anytime high-probability bound for a given Stochastic Approximation, and derive tight finite-time...
When facing contextual multi-stage optimization problems, training combinatorial optimization-enriched machine learning pipelines (ML-CO-pipelines) to date either requires imitating expert solutions or utilizing unstructured learning algorithms. While the former restricts the use of ML-CO-pipelines to problems with traceable offline solutions and relatively homogenous state spaces, the latter...
Multistage stochastic optimization (MSO) is pivotal for sequential decision-making under uncertainty, with prominent approaches in stochastic optimal control and reinforcement learning. While methods like Stochastic Dual Dynamic Programming excel with moderate-dimensional states and large continuous control spaces, and reinforcement learning handles large state spaces with smaller control...
Federated learning faces severe communication bottlenecks due to the high dimensionality of model updates. Communication compression with contractive compressors (e.g., Top-K) is often preferable in practice but can degrade performance without proper handling. Error feedback (EF) mitigates such issues but has been largely restricted to smooth, unconstrained problems, limiting its real-world...
Network schedule optimization is one of the main applications of OR to the air transport industry. This process involves designing aircraft schedules that are both operationally feasible and cost-optimal. Tail assignment is the process of assigning specific aircraft (tails) to planned flights, typically occurring a few days to weeks before operations. Airlines routinely solve this problem...
We propose a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that...
Sinkhorn’s algorithm is the go-to method for solving large-scale optimal transport problems. While its theoretical foundations are rich and still expanding, a blind spot has persisted regarding the widely used heuristic of ϵ-scaling (temperature annealing), which lacks convergence guarantees. In this talk I will present such guarantees as well as theoretical insights into the design of...
Modern generative AI has developed along two distinct paths: autoregressive models for discrete data (such as text) and diffusion models for continuous data (like images). Bridging this divide by adapting diffusion models to handle discrete data represents a compelling avenue for unifying these disparate approaches. However, existing work in this area has been hindered by unnecessarily complex...
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using any constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break...
We study the problem of learning a treatment assignment policy based on observable covariates, where there are potential shifts in the distribution of covariates from historical data (training) to deployment (test). We formulate a distributional robust policy optimization problem with the objective of maximizing the worst-case (out-of-sample) expected outcomes, considering all possible...
Federated learning enables training machine learning models while preserving the privacy of participants. Surprisingly, there is no differentially private distributed method for smooth, non-convex optimization problems. The reason is that standard privacy techniques require bounding the participants' contributions, usually enforced via clipping of the updates. Existing literature typically...