Timezone:
↑

Please join us immediately after the opening remarks in Emmanuel Candes' Invited Talk

Recent progress in machine learning provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do I have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not “open up the black box” and try understanding its underpinnings. Rather we discuss broad methodologies that can be wrapped around any black box to produce results that can be trusted and are equitable. We also show how our ideas can inform causal inference predictive; for instance, we will answer counterfactual predictive problems: i.e. predict the outcome of a treatment would have been given that the patient was actually not treated.

We make inroads into understanding the robustness of Variational Autoencoders (VAEs) to adversarial attacks and other input perturbations. While previous work has developed algorithmic approaches to attacking and defending VAEs, there remains a lack of formalization for what it means for a VAE to be robust. To address this, we develop a novel criterion for robustness in probabilistic models: $r$-robustness. We then use this to construct the first theoretical results for the robustness of VAEs, deriving margins in the input space for which we can provide guarantees about the resulting reconstruction. Informally, we are able to define a region within which any perturbation will produce a reconstruction that is similar to the original reconstruction. To support our analysis, we show that VAEs trained using disentangling methods not only score well under our robustness metrics, but that the reasons for this can be interpreted through our theoretical results.

WiML and the Caucus for Women in Statistics (CWS) are excited to announce a joint event at AISTATS 2021. The event has two components: community-driven mentoring, and a panel. The event will be held on the Icebreaker.video platform on Tuesday, April 13, 2021, 12.30pm – 2pm PT.

WiML Homepage

Caucus for Women in Statistics Homepage

Please use the Mementor portal to schedule virtual mentor sessions during AISTATS and beyond. The goal is to enable mentorship opportunities for researchers in machine learning, both as mentors and mentees, with a special focus on under-represented minorities.

The mentorship session serves as a platform to share experiences. These could be technical and research related (e.g., research topics and technical discussions), or could be about scientific communication (e.g., paper writing, presentation, networking), or could also be mental health, burnouts, work ethics, PhD life etc. The goal is to facilitate sharing of experiences between members of the community which would not happen otherwise.

When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding the loss dynamics by training neural networks with the gradient descent (GD), in this work, we consider a broad class of optimization algorithms that are commonly used in practice. For example, we show from a dynamical system perspective that the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) may only converge to global minimum sublinearly.

Our results rely on the connection between neural tangent kernel (NTK) and finitely-wide over-parameterized neural networks with ReLU activation, which leads to analyzing the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. As a consequence, we can leverage the classical convex optimization theory to understand the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network architectures.

We propose a novel estimator of the mutual information between two ordinal vectors $x$ and $y$. Our approach is inductive (as opposed to deductive) in that it depends on the data generating distribution solely through some nonparametric properties revealing associations in the data, and does not require having enough data to fully characterize the true joint distributions $P_{x, y}$. Specifically, our approach consists of (i) noting that $I\left(y; x\right) = I\left(u_y; u_x\right)$ where $u_y$ and $u_x$ are the \emph{copula-uniform dual representations} of $y$ and $x$ (i.e. their images under the probability integral transform), and (ii) estimating the copula entropies $h\left(u_y\right)$, $h\left(u_x\right)$ and $h\left(u_y, u_x\right)$ by solving a maximum-entropy problem over the space of copula densities under a constraint of the type $\bm{\alpha}_m = E\left[\phi_m(u_y, u_x)\right]$. We prove that, so long as the constraint is feasible, this problem admits a unique solution, it is in the exponential family, and it can be learned by solving a convex optimization problem. The resulting estimator, which we denote MIND, is marginal-invariant, always non-negative, unbounded for any sample size $n$, consistent, has MSE rate $O(1/n)$, and is more data-efficient than competing approaches.

Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with single-sample update at each iteration. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two time-scale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with mini-batch data for each update. For linear TDC, we provide a novel non-asymptotic analysis and our sample complexity result achieves the complexity $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$. For nonlinear TDC and Greedy-GQ, we show that both algorithms attain $\epsilon$-accurate stationary solution with sample complexity $\mathcal{O}(\epsilon^{-2})$. It is the first time that non-asymptotic convergence result has been established for nonlinear TDC and our result for Greedy-GQ outperforms previous result orderwisely by a factor of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$.

In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structure (e.g., a ranking, a clustering, or a hierarchical clustering) in a way that maximizes agreement with the input. Well-studied problems such as rank aggregation, correlation clustering, and hierarchical clustering with triplet constraints fall in this class of problems.

We study these problems on stochastic instances with a hidden embedded ground truth solution. Our main algorithmic contribution is a unified technique that uses the maximum cut problem in graphs to approximately solve these problems. Using this technique, we can often get approximation guarantees in the stochastic setting that are better than the known worst case inapproximability bounds for the corresponding problem. On the negative side, we improve the worst case inapproximability bound on several hierarchical clustering formulations through a reduction to related ranking problems.

In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to train novice workers of unknown quality in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from $T$ pulls, we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls, and aim to maximize the expected value of the highest cumulative reward across the $K$ arms. This corresponds to the objective of training a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and an instance-independent regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy, featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). Our numerical experiments demonstrate the efficacy of this policy compared to several natural alternatives in practical …

We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $\tilde{O}(\sqrt{dT})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.

In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $\epsilon$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg($\epsilon$), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.

We consider the problem of model selection for two popular stochastic linear bandit settings, and propose algorithms that adapts to the unknown problem complexity. In the first setting, we consider the $K$ armed mixture bandits, where the mean reward of arm $i \in [K]$ is $\mu_i+ \langle \alpha_{i,t},\theta^* \rangle $, with $\alpha_{i,t} \in \mathbb{R}^d$ being the known context vector and $\mu_i \in [-1,1]$ and $\theta^*$ are unknown parameters.
We define $\|\theta^*\|$ as the problem complexity and consider a sequence of nested hypothesis classes, each positing a different upper bound on $\|\theta^*\|$. Exploiting this, we propose Adaptive Linear Bandit (ALB), a novel phase based algorithm that adapts to the true problem complexity, $\|\theta^*\|$.
We show that ALB achieves regret scaling of $\widetilde{O}(\|\theta^*\|\sqrt{T})$, where $\|\theta^*\|$ is apriori unknown.
As a corollary, when $\theta^*=0$, ALB recovers the minimax regret for the simple bandit algorithm without such knowledge of $\theta^*$. ALB is the first algorithm that uses parameter norm as model section criteria for linear bandits. Prior state of art algorithms achieve a regret of $\widetilde{O}(L\sqrt{T})$, where $L$ is the upper bound on $\|\theta^*\|$, fed as an input to the problem.
In the second setting, we consider the standard linear bandit problem (with possibly …

We consider the problem of listwise learning-to-rank (LTR) on data with \textit{partitioned preference}, where a set of items are sliced into ordered and disjoint partitions, but the ranking of items within a partition is unknown. The Plackett-Luce (PL) model has been widely used in listwise LTR methods. However, given $N$ items with $M$ partitions, calculating the likelihood of data with partitioned preference under the PL model has a time complexity of $O(N+S!)$, where $S$ is the maximum size of the top $M-1$ partitions. This computational challenge restrains existing PL-based listwise LTR methods to only a special case of partitioned preference, \textit{top-$K$ ranking}, where the exact order of the top $K$ items is known. In this paper, we exploit a random utility model formulation of the PL model and propose an efficient approach through numerical integration for calculating the likelihood. This numerical approach reduces the aforementioned time complexity to $O(N+MS)$, which allows training deep-neural-network-based ranking models with a large output space. We demonstrate that the proposed method outperforms well-known LTR baselines and remains scalable through both simulation experiments and applications to real-world eXtreme Multi-Label (XML) classification tasks. The proposed method also achieves state-of-the-art performance on XML datasets with relatively large numbers …

Let $\mathbf{A} = \mathbf{L}_0 + \mathbf{S}_0$, where $\mathbf{L}_0 \in \mathbb{R}^{d\times d}$ is low rank and $\mathbf{S}_0$ is a perturbation matrix.
We study the principal subspace estimation of $\mathbf{L}_0$ through observations $\mathbf{y}_j = f(\mathbf{A})\mathbf{x}_j$, $j=1,\dots,n$, where
$f:\mathbb{R}\rightarrow \mathbb{R}$ is an unknown polynomial and $\mathbf{x}_j$'s are i.i.d. random input signals. Such models are widely used in graph signal processing
to model information diffusion dynamics over networks with applications in network topology inference and data analysis. We develop an estimation procedure based on nuclear norm penalization, and establish upper bounds on the principal subspace estimation error when $\mathbf{A}$ is the adjacency matrix of a random graph generated by $\mathbf{L}_0$.
Our theory shows that when the signal strength is strong enough, the exact rank of $\mathbf{L}_0$ can be recovered.
By applying our results to blind community detection, we show that consistency of spectral clustering can be achieved for some popular stochastic block models.
Together with the experimental results, our theory show that there is a fundamental limit of using the principal components obtained from diffused graph signals which is commonly adapted in current practice. Finally, under some structured perturbation $\mathbf{S}_0$, we build the connection between this model with spiked covariance model and develop a …

In this work we consider the problem of online submodular maximization under a cardinality constraint with differential privacy (DP). A stream of T submodular functions over a common finite ground set U arrives online, and at each time-step the decision maker must choose at most k elements of U before observing the function. The decision maker obtains a profit equal to the function evaluated on the chosen set and aims to learn a sequence of sets that achieves low expected regret.
In the full-information setting, we develop an $(\varepsilon,\delta)$-DP algorithm with expected (1-1/e)-regret bound of $O( \frac{k^2\log |U|\sqrt{T \log k/\delta}}{\varepsilon} )$. This algorithm contains k ordered experts that learn the best marginal increments for each item over the whole time horizon while maintaining privacy of the functions. In the bandit setting, we provide an $(\varepsilon,\delta+ O(e^{-T^{1/3}}))$-DP algorithm with expected (1-1/e)-regret bound of $O( \frac{\sqrt{\log k/\delta}}{\varepsilon} (k (|U| \log |U|)^{1/3})^2 T^{2/3} )$. One challenge for privacy in this setting is that the payoff and feedback of expert i depends on the actions taken by her i-1 predecessors. This particular type of information leakage is not covered by post-processing, and new analysis is required. Our techniques for maintaining privacy with feedforward may …

The use of min-max optimization in the adversarial training of deep neural network classifiers, and the training of generative adversarial networks has motivated the study of nonconvex-nonconcave optimization objectives, which frequently arise in these applications. Unfortunately, recent results have established that even approximate first-order stationary points of such objectives are intractable, even under smoothness conditions, motivating the study of min-max objectives with additional structure. We introduce a new class of structured nonconvex-nonconcave min-max optimization problems, proposing a generalization of the extragradient algorithm which provably converges to a stationary point. The algorithm applies not only to Euclidean spaces, but also to general $\ell_p$-normed finite-dimensional real vector spaces. We also discuss its stability under stochastic oracles and provide bounds on its sample complexity. Our iteration complexity and sample complexity bounds either match or improve the best known bounds for the same or less general nonconvex-nonconcave settings, such as those that satisfy variational coherence or in which a weak solution to the associated variational inequality problem is assumed to exist.

We study probabilistic prediction games when the underlying model is
misspecified, investigating the consequences of predicting using an
incorrect parametric model. We show that for a broad class of loss
functions and parametric families of distributions, the regret of playing
a ``proper'' predictor---one from the putative model class---relative to
the best predictor in the same model class has lower bound scaling at
least as $\sqrt{\gamma n}$, where $\gamma$ is a measure of the model
misspecification to the true distribution in terms of total variation
distance. In contrast, using an aggregation-based (improper) learner, one
can obtain regret $d \log n$ for any underlying generating distribution,
where $d$ is the dimension of the parameter; we exhibit instances in which
this is unimprovable even over the family of all learners that may play
distributions in the convex hull of the parametric family. These results
suggest that simple strategies for aggregating multiple learners together
should be more robust, and several experiments conform to this hypothesis.

Random forests have become an important tool for improving accuracy in regression and classification problems since their inception by Leo Breiman in 2001. In this paper, we revisit a historically important random forest model, called centered random forests, originally proposed by Breiman in 2004 and later studied by G\'erard Biau in 2012, where a feature is selected at random and the splits occurs at the midpoint of the node along the chosen feature. If the regression function is $d$-dimensional and Lipschitz, we show that, given access to $n$ observations, the mean-squared prediction error is $O((n(\log n)^{(d-1)/2})^{-\frac{1}{d\log2+1}})$. This positively answers an outstanding question of Biau about whether the rate of convergence for this random forest model could be improved beyond $O(n^{-\frac{1}{d(4/3)\log2+1}})$. Furthermore, by a refined analysis of the approximation and estimation errors for linear models, we show that our new rate cannot be improved in general. Finally, we generalize our analysis and improve current prediction error bounds for another random forest model, called median random forests, in which each tree is constructed from subsampled data and the splits are performed at the empirical median along a chosen feature.

The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $p$-Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space.
A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales …

The problem of \emph{Offline Policy Evaluation} (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real life applications. Existing work on OPE mostly focus on evaluating a \emph{fixed} target policy $\pi$, which does not provide useful bounds for offline policy learning as $\pi$ will then be data-dependent. We address this problem by \emph{simultaneously} evaluating all policies in a policy class $\Pi$ --- uniform convergence in OPE --- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the \emph{time-inhomogeneous episodic} MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $\mu$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.

`density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the`

probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks.

We provide a general constrained risk inequality that applies to arbitrary
non-decreasing losses, extending a result of Brown and Low
[\emph{Ann.~Stat.~1996}]. Given two distributions $P_0$ and $P_1$, we find
a lower bound for the risk of estimating a parameter $\theta(P_1)$ under
$P_1$ given an upper bound on the risk of estimating the parameter
$\theta(P_0)$ under $P_0$. The inequality is a useful pedagogical tool, as
its proof relies only on the Cauchy-Schwartz inequality, it applies to
general losses, and it transparently gives risk lower bounds on
super-efficient and adaptive estimators.

Motivated by, e.g., sensitivity analysis and end-to-end learning, the demand for differentiable optimization algorithms has been increasing. This paper presents a theoretically guaranteed differentiable greedy algorithm for monotone submodular function maximization. We smooth the greedy algorithm via randomization, and prove that it almost recovers original approximation guarantees in expectation for the cases of cardinality and $\kappa$-extendible system constraints. We then present how to efficiently compute gradient estimators of any expected output-dependent quantities. We demonstrate the usefulness of our method by instantiating it for various applications.

In this paper we study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression. Under standard regularity conditions, we establish upper bounds on the error of the Laplacian smoothing estimator \smash{$\widehat{f}$}, and a goodness-of-fit test also based on \smash{$\widehat{f}$}. These upper bounds match the minimax optimal estimation and testing rates of convergence over the first-order Sobolev class $H^1(\mathcal{X})$, for $\mathcal{X} \subseteq \mathbb{R}^d$ and $1 \leq d < 4$; in the estimation problem, for $d = 4$, they are optimal modulo a $\log n$ factor. Additionally, we prove that Laplacian smoothing is manifold-adaptive: if $\mathcal{X} \subseteq \mathbb{R}^d$ is an $m$-dimensional manifold with $m < d$, then the error rate of Laplacian smoothing (in either estimation or testing) depends only on $m$, in the same way it would if $\mathcal{X}$ were a full-dimensional set in $\mathbb{R}^m$.

It is important to collect credible training samples $(x,y)$ for building data-intensive learning systems (e.g., a deep learning system). Asking people to report complex distribution $p(x)$, though theoretically viable, is challenging in practice. This is primarily due to the cognitive loads required for human agents to form the report of this highly complicated information. While classical elicitation mechanisms apply to eliciting a complex and generative (and continuous) distribution $p(x)$, we are interested in eliciting samples $x_i \sim p(x)$ from agents directly. We coin the above problem sample elicitation. This paper introduces a deep learning aided method to incentivize credible sample contributions from self-interested and rational agents. We show that with an accurate estimation of a certain $f$-divergence function we can achieve approximate incentive compatibility in eliciting truthful samples. We then present an efficient estimator with theoretical guarantees via studying the variational forms of the $f$-divergence function. We also show a connection between this sample elicitation problem and $f$-GAN, and how this connection can help reconstruct an estimator of the distribution based on collected samples. Experiments on synthetic data, MNIST, and CIFAR-10 datasets demonstrate that our mechanism elicits truthful samples. Our implementation is available at https://github.com/weijiaheng/Credible-sample-elicitation.git.

This paper addresses the meta-learning problem in sparse linear regression with infinite tasks. We assume that the learner can access several similar tasks. The goal of the learner is to transfer knowledge from the prior tasks to a similar but novel task. For $p$ parameters, size of the support set $k$, and $l$ samples per task, we show that $T \in O((k \log (p-k)) / l)$ tasks are sufficient in order to recover the common support of all tasks. With the recovered support, we can greatly reduce the sample complexity for estimating the parameter of the novel task, i.e., $l \in O(1)$ with respect to $T$ and $p$. We also prove that our rates are minimax optimal. A key difference between meta-learning and the classical multi-task learning, is that meta-learning focuses only on the recovery of the parameters of the novel task, while multi-task learning estimates the parameter of all tasks, which requires $l$ to grow with $T$. Instead, our efficient meta-learning estimator allows for $l$ to be constant with respect to $T$ (i.e., few-shot learning).

`robust control design'' (which assumes small model uncertainty) and`

chasing consistent models'', which can be solved using existing tools from control theory and online learning, respectively. We provide a learning convergence analysis that yields a finite mistake bound on the number of times performance requirements are not met and can provide strong safety guarantees, by bounding the worst-case state deviation. To the best of our knowledge, this is the first approach for online robust control of nonlinear systems with such learning theoretic and safety guarantees. We also show how to instantiate this framework for general robotic systems, demonstrating the practicality of our approach.

We provide a general and easy to implement method for reducing the number of parameters of Convolutional Neural Networks (CNNs) during the training and inference phases. We introduce a simple trainable auxiliary neural network which can generate approximate versions of ``slices'' of the sets of convolutional filters of any CNN architecture from a low dimensional ``code'' space. These slices are then concatenated to form the sets of filters in the CNN architecture. The auxiliary neural network, which we call “Convolutional Slice Generator” (CSG), is unique to the network and provides the association among its convolutional layers. We apply our method to various CNN architectures including ResNet, DenseNet, MobileNet and ShuffleNet. Experiments on CIFAR-10 and ImageNet-1000, without any hyper-parameter tuning, show that our approach reduces the network parameters by approximately $2\times$ while the reduction in accuracy is confined to within one percent and sometimes the accuracy even improves after compression. Interestingly, through our experiments, we show that even when the CSG takes random binary values for its weights that are not learned, still acceptable performances are achieved. To show that our approach generalizes to other tasks, we apply it to an image segmentation architecture, Deeplab V3, on the Pascal VOC 2012 …

We consider the problem of estimating a function from $n$ noisy samples whose discrete Total Variation (TV) is bounded by $C_n$. We reveal a deep connection to the seemingly disparate problem of \emph{Strongly Adaptive} online learning [Daniely et al 2015] and provide an $O(n \log n)$ time algorithm that attains the near minimax optimal rate of $\tilde O (n^{1/3}C_n^{2/3})$ under squared error loss. The resulting algorithm runs online and optimally \emph{adapts} to the \emph{unknown} smoothness parameter $C_n$. This leads to a new and more versatile alternative to wavelets-based methods for (1) adaptively estimating TV bounded functions; (2) online forecasting of TV bounded trends in time series.

We study the safe reinforcement learning problem using the constrained Markov decision processes in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing safe reinforcement learning algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization \mbox{(OPDOP)} algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the …

Recently, invariant risk minimization (IRM) (Arjovsky et al. 2019) was proposed as a promising solution to address out-of-distribution (OOD) generalization. In Ahuja et al. (2020), it was shown that solving for the Nash equilibria of a new class of “ensemble-games” is equivalent to solving IRM. In this work, we extend the framework in Ahuja et al. (2020) for linear regressions by projecting the ensemble-game on an $\ell_{\infty}$ ball. We show that such projections help achieve non-trivial out-of-distribution guarantees despite not achieving perfect invariance. For linear models with confounders, we prove that Nash equilibria of these games are closer to the ideal OOD solutions than the standard empirical risk minimization (ERM) and we also provide learning algorithms that provably converge to these Nash Equilibria. Empirical comparisons of the proposed approach with the state-of-the-art show consistent gains in achieving OOD solutions in several settings involving anti-causal variables and confounders.

Given a set $\mathcal{C}=\{C_i\}_{i=1}^m$ of square matrices,
the matrix blind joint block diagonalization problem (BJBDP) is to
find a full column rank matrix $A$ such that $C_i=A\Sigma_iA^{\T}$ for all $i$,
where $\Sigma_i$'s are all block diagonal matrices with as many diagonal blocks as possible.
The BJBDP plays an important role in independent subspace analysis.
This paper considers the identification problem for BJBDP, that is,
under what conditions and by what means, we can identify the diagonalizer $A$ and the block diagonal structure of $\Sigma_i$,
especially when there is noise in $C_i$'s.
In this paper, we propose a ``bi-block diagonalization'' method to solve BJBDP,
and establish sufficient conditions for when the method is able to accomplish the task.
Numerical simulations validate our theoretical results.
To the best of the authors' knowledge, current numerical methods for BJBDP
have no theoretical guarantees for the identification of the exact solution,
whereas our method does.

This paper studies the generalization bounds for the empirical saddle point (ESP) solution to stochastic saddle point (SSP) problems. For SSP with Lipschitz continuous and strongly convex-strongly concave objective functions, we establish an {\footnotesize$\cO\left(1/n\right)$} generalization bound by using a probabilistic stability argument. We also provide generalization bounds under a variety of assumptions, including the cases without strong convexity and without bounded domains. We illustrate our results in three examples: batch policy learning in Markov decision process, stochastic composite optimization problem, and mixed strategy Nash equilibrium estimation for stochastic games. In each of these examples, we show that a regularized ESP solution enjoys a near-optimal sample complexity. To the best of our knowledge, this is the first set of results on the generalization theory of ESP.

This work investigates the predictive performance of the classical $k$ Nearest Neighbors ($k$-NN) algorithm when the testing data are corrupted by random perturbation. The impact of corruption level on the asymptotic regret is carefully characterized and we reveal a phase-transition phenomenon that, when the corruption level of the random perturbation $\omega$ is below a critical order (i.e., small-$\omega$ regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large-$\omega$ regime), the asymptotic regret deteriorates polynomially. More importantly, the regret of $k$-NN classifier heuristically matches the rate of minimax regret for randomly perturbed testing data, thus implies the strong robustness of $k$-NN against random perturbation on testing data. In fact, we show that the classical $k$-NN can achieve no worse predictive performance, compared to the NN classifiers trained via the popular noise-injection strategy. Our numerical experiment also illustrates that combining $k$-NN component with modern learning algorithms will inherit the strong robustness of $k$-NN. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in \cite{xue2017achieving} will at most achieve a sub-optimal rate when the data dimension $d>4$ even if $k$ is chosen optimally in the pre-processing step.

Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. It has recently emerged as one of the leading learning paradigms in the absence of labels across many different domains (e.g. brain imaging, text, images). However, theoretical understanding of many aspects of training, both statistical and algorithmic, remain fairly elusive.

In this work, we study the setting of time series---more precisely, when we get data from a strong-mixing continuous-time stochastic process. We show that a properly constructed contrastive learning task can be used to the transition kernel for small-to-mid-range intervals in the diffusion case. Moreover, we give sample complexity bounds for solving this task and quantitatively characterize what the value of the contrastive loss implies for distributional closeness of the learned kernel. As a byproduct, we illuminate the appropriate settings for the contrastive distribution, as well as other hyperparameters in this setup.

This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of non-negative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1-\frac{\kappa}{e})$-approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the non-private setting.

There is an increasing use of algorithms to inform decisions in many settings, from student evaluations, college admissions, to credit scoring. These decisions are made by applying a decision rule to individual's observed features. Given the impacts of these decisions on individuals, decision makers are increasingly required to be transparent on their decision making to offer the ``right to explanation.'' Meanwhile, being transparent also invites potential manipulations, also known as gaming, that the individuals can utilize the knowledge to strategically alter their features in order to receive a more beneficial decision.

In this work, we study the problem of \emph{robust} decision-making under strategic behavior. Prior works often assume that the decision maker has full knowledge of individuals' cost structure for manipulations. We study the robust variant that relaxes this assumption: The decision maker does not have full knowledge but knows only a subset of the individuals' available actions and associated costs. To approach this non-quantifiable uncertainty, we define robustness based on the worst-case guarantee of a decision, over all possible actions (including actions unknown to the decision maker) individuals might take. A decision rule is called \emph{robust optimal} if its worst case performance is (weakly) better than that of all …

This paper considers the problem of high-dimensional sparse precision matrix estimation under Laplacian constraints. We prove that the Laplacian constraints bring favorable properties for estimation: the Gaussian maximum likelihood estimator exists and is unique almost surely on the basis of one observation, irrespective of the dimension. We establish the optimal rate of convergence under Frobenius norm by the derivation of the minimax lower and upper bounds. The minimax lower bound is obtained by applying Le Cam-Assouad's method with a novel construction of a subparameter space of multivariate normal distributions. The minimax upper bound is established by designing an adaptive $\ell_1$-norm regularized maximum likelihood estimation method and quantifying the rate of convergence. We prove that the proposed estimator attains the optimal rate of convergence with an overwhelming probability. Numerical experiments demonstrate the effectiveness of the proposed estimator.

We investigate the misspecified linear contextual bandit (MLCB) problem, which is a generalization of the linear contextual bandit (LCB) problem. The MLCB problem is a decision-making problem in which a learner observes $d$-dimensional feature vectors, called arms, chooses an arm from $K$ arms, and then obtains a reward from the chosen arm in each round. The learner aims to maximize the sum of the rewards over $T$ rounds. In contrast to the LCB problem, the rewards in the MLCB problem may not be represented by a linear function in feature vectors; instead, it is approximated by a linear function with additive approximation parameter $\varepsilon \geq 0$. In this paper, we propose an algorithm that achieves $\tilde{O}(\sqrt{dT\log(K)} + \eps\sqrt{d}T)$ regret, where $\tilde{O}(\cdot)$ ignores polylogarithmic factors in $d$ and $T$. This is the first algorithm that guarantees a high-probability regret bound for the MLCB problem without knowledge of the approximation parameter $\varepsilon$.

Optimal transport (OT) distances are increasingly used as loss functions for statistical inference, notably in the learning of generative models or supervised learning. Yet, the behavior of minimum Wasserstein estimators is poorly understood, notably in high-dimensional regimes or under model misspecification. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected. Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances, complementing and improving previous literature that has been restricted to one-dimensional and well-specified cases. Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces. Our complexity bounds can help explain why both PRW and IPRW distances outperform Wasserstein distances empirically in high-dimensional inference tasks. Finally, we consider parametric inference using the PRW distance. We provide an asymptotic guarantee of two types of minimum PRW estimators and formulate a central limit theorem for max-sliced Wasserstein estimator under model misspecification. To enable our analysis on PRW with projection dimension larger than one, we devise a novel combination of variational analysis and …

Using only samples from a probabilistic model, we predict properties of the model and of future observations. The prediction game continues in an online fashion as the sample size grows with new observations. After each prediction, the predictor incurs a binary (0-1) loss. The probability model underlying a sample is otherwise unknown except that it belongs to a known class of models. The goal is to make finitely many errors (i.e. loss of 1) with probability 1 under the generating model, no matter what it may be in the known model class.

Model classes admitting predictors that make only finitely many errors are eventually almost surely (eas) predictable. When the losses incurred are observable (the supervised case), we completely characterize eas predictable classes. We provide analogous results in the unsupervised case. Our results have a natural interpretation in terms of regularization. In eas-predictable classes, we study if it is possible to have a universal stopping rule that identifies (to any given confidence) when no more errors will be made. Classes admitting such a stopping rule are eas learnable. When samples are generated iid, we provide a complete characterization of eas learnability. We also study cases when samples are not generated …

```
We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three scenarios based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments.
```

The Iterative Hard Thresholding (IHT) algorithm is one of the most popular and promising greedy pursuit methods for high-dimensional statistical estimation under cardinality constraint. The existing analysis of IHT mostly focuses on parameter estimation and sparsity recovery consistency. From the perspective of statistical learning theory, another fundamental question is how well the IHT estimation would perform on unseen samples. The answer to this question is important for understanding the generalization ability of IHT yet has remaind elusive. In this paper, we investigate this problem and develop a novel generalization theory for IHT from the viewpoint of algorithmic stability. Our theory reveals that: 1) under natural conditions on the empirical risk function over $n$ samples of dimension $p$, IHT with sparsity level $k$ enjoys an $\mathcal{\tilde O}(n^{-1/2}\sqrt{k\log(n)\log(p)})$ rate of convergence in sparse excess risk; and 2) a fast rate of order $\mathcal{\tilde O}(n^{-1}k(\log^3(n)+\log(p)))$ can be derived for strongly convex risk function under certain strong-signal conditions. The results have been substantialized to sparse linear regression and logistic regression models along with numerical evidence provided to support our theory.

Decentralized optimization problems frequently appear in the large scale machine learning problems. However, few works work on the difficult nonconvex nonsmooth case. In this paper, we propose a decentralized primal-dual algorithm to solve this type of problem in a decentralized manner and the proposed algorithm can achieve an $\mathcal{O}(1/\epsilon^2)$ iteration complexity to attain an $\epsilon-$solution, which is the well-known lower iteration complexity bound for nonconvex optimization. To our knowledge, it is the first algorithm achieving this rate under a nonconvex, nonsmooth decentralized setting.
Furthermore, to reduce communication overhead, we also modifying our algorithm by compressing the vectors exchanged between agents.
The iteration complexity of the algorithm with compression is still $\mathcal{O}(1/\epsilon^2)$.
Besides, we apply the proposed algorithm to solve nonconvex linear regression problem and train deep learning model, both of which demonstrate the efficiency and efficacy of the proposed algorithm.

Transformation based methods have been an attractive approach in non-parametric inference for problems such as unconditioned and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformation based models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both non-parametric inference and variational inference lacks theoretical justification. In the context of non-linear latent variable models (NL-LVM), we provide theoretical justification for the use of these models in non-parametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the $L_1$ sense and show that for this class of priors the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the non-parametric setting we use the NL-LVM to construct an implicit family of variational distributions, deemed as GP-IVI. We delineate sufficient conditions under which GP-IVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback-Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit …

Modern machine learning and deep learning models are shown to be vulnerable when testing data are slightly perturbed. Theoretical studies of adversarial training algorithms mostly focus on their adversarial training losses or local convergence properties. In contrast, this paper studies the generalization performance of a generic adversarial training algorithm. Specifically, we consider linear regression models and two-layer neural networks (with lazy training) using squared loss under low-dimensional regime and high-dimensional regime. In the former regime, after overcoming the non-smoothness of adversarial training, the adversarial risk of the trained models will converge to the minimal adversarial risk. In the latter regime, we discover that data interpolation prevents the adversarial robust estimator from being consistent (i.e. converge in probability). Therefore, inspired by successes of the least absolute shrinkage and selection operator (LASSO), we incorporate the $\mathcal{L}_1$ penalty in the high dimensional adversarial learning, and show that it leads to consistent adversarial robust estimation. A series of numerical studies are conducted to demonstrate that how the smoothness and $\mathcal{L}_1$ penalization help to improve the adversarial robustness of DNN models.

Recent years have witnessed the success of adaptive (or unified) approaches in estimating symmetric properties of discrete distributions, where the learner first obtains a distribution estimator independent of the target property, and then plugs the estimator into the target property as the final estimator. Several such approaches have been proposed and proved to be adaptively optimal, i.e. they achieve the optimal sample complexity for a large class of properties within a low accuracy, especially for a large estimation error $\varepsilon\gg n^{-1/3}$ where $n$ is the sample size.
In this paper, we characterize the high accuracy limitation, or the penalty for adaptation, for general adaptive approaches. Specifically, we obtain the first known adaptation lower bound that under a mild condition, any adaptive approach cannot achieve the optimal sample complexity for every $1$-Lipschitz property within accuracy $\varepsilon \ll n^{-1/3}$. In particular, this result disproves a conjecture in [Acharya et al. 2017] that the profile maximum likelihood (PML) plug-in approach is optimal in property estimation for all ranges of $\varepsilon$, and confirms a conjecture in [Han and Shiragur 2020] that their competitive analysis of the PML is tight.

In this work, we present a family of vector quantization schemes vqSGD (Vector-Quantized Stochastic Gradient Descent) that provide an asymptotic reduction in the communication cost with convergence guarantees in first-order distributed optimization.
In the process we derive the following fundamental information theoretic fact: $\Theta(\frac{d}{R^2})$ bits are necessary and sufficient (up to an additive $O(\log d)$ term) to describe an unbiased estimator $\hat{g}(g)$ for any $g$ in the $d$-dimensional unit sphere, under the constraint that $\|\hat{g}(g)\|_2\le R$ almost surely.
In particular, we consider a randomized scheme based on the convex hull of a point set, that returns an unbiased estimator of a $d$-dimensional gradient vector with almost surely bounded norm.
We provide multiple efficient instances of our scheme, that are near optimal, and require only $o(d)$ bits of communication at the expense of tolerable increase in error. The instances of our quantization scheme are obtained using the properties of binary error-correcting codes and provide a smooth tradeoff between the communication and the estimation error of quantization. Furthermore, we show that vqSGD also offers some automatic privacy guarantees.

Within the machine learning community, the widely-used uniform convergence framework has been used to answer the question of how complex, over-parameterized models can generalize well to new data. This approach bounds the test error of the \emph{worst-case} model one could have fit to the data, but it has fundamental limitations. Inspired by the statistical mechanics approach to learning, we formally define and develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers from several model classes. We apply our method to compute this distribution for several real and synthetic datasets, with both linear and random feature classification models. We find that test errors tend to concentrate around a small \emph{typical} value $\varepsilon^*$, which deviates substantially from the test error of the worst-case interpolating model on the same datasets, indicating that ``bad'' classifiers are extremely rare. We provide theoretical results in a simple setting in which we characterize the full asymptotic distribution of test errors, and we show that these indeed concentrate around a value $\varepsilon^*$, which we also identify exactly. We then formalize a more general conjecture supported by our empirical findings. Our results show that the usual style of analysis in statistical learning theory …

We introduce graph gamma process (GGP) linear dynamical systems to model real-valued multivariate time series. GGP generates $S$ latent states that are shared by $K$ different communities, each of which is characterized by its own pattern of activation probabilities imposed on a $S\times S$ directed sparse graph, and allow both $S$ and $K$ to grow without bound. For temporal pattern discovery, the latent representation under the model is used to decompose the time series into a parsimonious set of multivariate sub-sequences generated by formed communities. In each sub-sequence, different data dimensions often share similar temporal patterns but may exhibit distinct magnitudes, and hence allowing the superposition of all sub-sequences to exhibit diverse behaviors at different data dimensions. On both synthetic and real-world time series, the proposed nonparametric Bayesian dynamic models, which are initialized at random, consistently exhibit good predictive performance in comparison to a variety of baseline models, revealing interpretable latent state transition patterns and decomposing the time series into distinctly behaved sub-sequences.