Invited Research Talks

Note: conference talks are not included here, and they can be found on the publication page.

  1. "Why Does Deep Learning Perform Deep Learning".

Despite the success of deep learning, from a theoretical standpoint, it remains absurdly unclear why deep learning is better than shallow learning. The main difficulty comes from that we do not actually know how the network features at intermediate layers are learned hierarchically. Why does the first layer learn edges, the second layer learn textures, the third layer learn patterns? Or, do they? (Spoiler alert, not really.)

We present a theory towards explaining how hierarchical feature learning is done, by breaking the learning process into three principles: forward feature learning, backward feature correction, and feature purification. They together show how network features evolve during the training process, and elegantly match what happen in real world deep learning tasks.

On the technical side, we give (1) the first theory showing that deep learning can be time efficient on certain learning tasks, where NO KNOWN shallow learning algorithms are efficient; and (2) the first theory showing how adversarial training of a neural network can lead to provably robust models, when low-complexity models such as linear models MUST be vulnerable to adversarial attacks even with infinite training data.

MSR AI Seminar Aug/11/2020
  1. "Contemporary Theory of Deep Learning".

We will present an accessible overview of recent advances on theory of deep learning. For example, we will try to address the following questions:

  1. How does SGD find global minima efficiently when training non-convex, non-smooth networks?
  2. What classes of functions can neural networks provably learn?
  3. Why do overparameterized networks generalize?
  4. Do neural networks provably generalize better than traditional kernel methods (including neural tangent kernel)?

MSR AI Seminar
Baidu Research
  1. "Nearly Linear-Time Packing and Covering LP Solvers".

Packing and covering linear programs (PC-LPs) form an important class of LPs across computer science, operations research, and optimization. In 1993, Luby and Nisan constructed an iterative algorithm for approximately solving PC-LPs in nearly linear time, where the time complexity scales nearly linearly in N, the number of nonzero entries of the matrix, and polynomially in e, the (multiplicative) approximation error.

Unfortunately, existing nearly linear-time algorithms [4, 10, 23, 31, 35, 36] for solving PC-LP s require time at least proportional to 1/e^2. In this paper, we break this longstanding barrier by designing a solver that runs in time O(N/e).

Joint work with Lorenzo Orecchia

MIT Mar/02/2018
  1. "How to Swing By Saddle Points: Faster Non-Convex Optimization Than SGD".

The diverse world of deep learning research has given rise to thousands of architectures for neural networks. However, to this date, the underlying training algorithms for neural networks are still stochastic gradient descent (SGD) and its heuristic variants.

In this talk, we present a new stochastic algorithm to train any smooth neural network to \(\varepsilon\)-approximate local minima, using \(O(\varepsilon^{-3.25})\) backpropagations. The best provable result was \(O(\varepsilon^{-4})\) by SGD before this work.

More broadly, it finds \(\varepsilon\)-approximate local minima of any smooth nonconvex function using only \(O(\varepsilon^{-3.25})\) stochastic gradient computations

University of Washington
Princeton University
NY Academy of Science
  1. "Optimal Experimental Design via A New Regret Minimization Framework".

The experimental design problem concerns the selection of \(k\) points from a potentially very large design pool of \(p\)-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected \(k\) design points.

We achieve \(1+\varepsilon\) approximations to all popular optimality criteria for this problem, including A-optimality, D-optimality, T-optimality, E-optimality, V-optimality and G-optimality, with only \(k = O(p/\varepsilon^2)\) design points.

In contrast, to the best of our knowledge, previously (1) no polynomial-time algorithm exists for achieving any \(1+\varepsilon\) approximations for D/T/E/G-optimality, and (2) the algorithm achieving \(1+\varepsilon\) approximation for A/V-optimality require \(k \geq \Omega(p^2/\varepsilon)\).

Joint work with Yuanzhi Li, Aarti Singh, and Yining Wang.

UC Berkeley (BLISS)
University of Washington
Oct/09/2017 Nov/07/2017
  1. "Natasha 2: Faster Non-Convex Optimization Than SGD".

We design a stochastic algorithm to train any smooth neural network to \(\varepsilon\)-approximate local minima, using \(O(\varepsilon^{-3.25})\) backpropagations. The best result was essentially \(O(\varepsilon^{-4})\) by SGD.

More broadly, it finds \(\varepsilon\)-approximate local minima of any smooth nonconvex function in rate \(O(\varepsilon^{-3.25})\), with only oracle access to stochastic gradients and Hessian-vector products.

UC Berkeley (Simons) Oct/06/2017
  1. "Recent Advances in Stochastic Convex and Non-Convex Optimization".

In this tutorial, we will provide an accessible and extensive overview on recent advances to optimization methods based on stochastic gradient descent (SGD), for both convex and non-convex tasks.

In particular, this tutorial shall try to answer the following questions with theoretical support. How can we properly use momentum to speed up SGD? What is the maximum parallel speedup can we achieve for SGD? When should we use dual or primal-dual approach to replace SGD? What is the difference between coordinate descent (e.g. SDCA) and SGD? How is variance reduction affecting the performance of SGD? Why does the second-order information help us improve the convergence of SGD?

ICML 2017 Tutorial Aug/6/2017
  1. "Faster Stochastic Gradient Descent: From Nesterov's Momentum to Katyusha Momentum".

Nesterov's momentum is famously known for being able to accelerate "full-gradient descent", and thus useful in building fast first-order algorithms. However, in the offline stochastic setting, counter examples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex. In this talk, I shall discuss a Katyusha momentum framework that provides the first direct acceleration to stochastic gradient descent. This work is going to appear in STOC 2017.

New York Theory Day Apr/28/2017
  1. "New Regret Minimization Framework and Its Applications".

In STOC 2015, we developed a regret minimization framework for matrix games, and applied it to give faster algorithms for "finding graph linear-sized spectral sparsifiers."

In this talk, I will review and upgrade this framework to obtain nearly optimal algorithms for (1) online eigenvector and for (2) experiment design, two famous problems in learning theory and statistics respectively.

University of Washington Mar/10/2017
  1. "From Linear Coupling to More Efficient Optimization Algorithms".

Developing novel optimization frameworks can often enrich the dimension of algorithm design. In this talk, I will present a "linear-coupling" framework, and apply it to obtain the fastest first-order algorithms for (1) packing linear programming, (2) stochastic convex optimization, and (3) stochastic non-convex optimization.

University of Washington Mar/09/2017
  1. "From Optimization to More Efficient Machine Learning Algorithms".

Many machine learning tasks can be solved via generic optimization methods such as gradient descent and stochastic gradient descent. In this talk, I will argue that this path of algorithm design may be restrictive and less efficient,

I will show instead how to develop novel optimization tools to enrich the dimension of algorithm design. In particular, I will present a "linear-coupling" framework, and apply it to obtain faster first-order algorithms for stochastic convex optimization (e.g., SVM, Lasso regression) and for stochastic non-convex optimization (e.g., deep learning).

Brown University Mar/06/2017
  1. "Towards More Efficient Optimization Algorithms".

Many computational tasks can be solved via classical optimization methods such as gradient descent, SGD, LP, or SDP. In this talk, I will argue that this path of algorithm design may be restrictive and less efficient.

I will show instead how to develop novel optimization tools to enrich the dimension of algorithm design. In particular, I will present a "linear-coupling" framework, and apply it to obtain the fastest first-order algorithms for (1) packing LP, (2) stochastic convex optimization, and (3) stochastic non-convex optimization.

Microsoft Research Redmond
Columbia University
Mar/14/2017 Feb/28/2017
  1. "Theory of Accelerated Methods".

In this talk I will show how to derive the fastest coordinate descent method [1] and the fastest stochastic gradient descent method [2], both from the linear-coupling framework [3]. I will relate them to linear system solving, conjugate gradient method, the Chebyshev approximation theory, and raise several open questions at the end. No prior knowledge is required on first-order methods.

[1] Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling.
[2] The First Direct Acceleration of Stochastic Gradient Methods.
[3] Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent.

Institute for Advanced Study Nov/22/2016
  1. "Sublinear Time Algorithms via Optimization".
In this talk, I will discuss how to use optimization to obtain sublinear algorithms, as well as an interpolation between sublinear-time and linear-time algorithms using optimization insights and acceleration techniques.
MSR-NE / MIT Oct/14/2016
  1. "Accelerated Stochastic Gradient Descent via New Model for First-Order Optimization".
Institute for Advanced Study Sep/26/2016
  1. "Optimization and Computation".

Some computer science literature creates new algorithms by reducing the problem they solve to existing optimization methods (such as LP solvers, gradient descent, or follow-the-regularized-leader). In this talk, I will argue that this path of algorithm design is somewhat restrictive and less likely to yield the best possible algorithm. I will show that, on a conceptual level, how to design CS algorithms using optimization insights as a "guiding light" --- rather than applying optimization solvers as blackboxs. With more liberty introduced in this way, one can better harvest the hidden potential of optimization-based algorithms.

More concretely, in two separate examples, I am going to demonstrate how to develop new optimization techniques, in order to (1) solve positive linear programs faster, beating the state-of-the-art that has stood for ~20 years, and (2) understand how the Brain compresses data, contributing to the field of computational neuroscience.

Boston University
Simons A&G
Apr/2/2015 Mar/11/2016
  1. "Beyond Matrix Multiplicative Weight Updates".

Regret minimization over the PSD cone is a key component in many fast spectral algorithms, and yields nearly-linear time algorithms for many versions of graph partitioning problems. All applications of this idea rely on the matrix version of the classical multiplicative weight update method (MWU).

In this talk, I explain how Matrix MWU naturally arises as an instance of the Follow-the-Regularized-Leader method with the entropy regularizer. I will then generalize this approach to a larger class of regularizers. As an application, I will provide an alternative construction of the linear-sized spectral sparsifiers of Batson, Spielman and Srivastava, and show how to construct these sparsifiers in almost quadratic time.

UC Berkeley (Simons)
Princeton University
Nov/5/2014 Apr/29/2016
  1. "Linear Coupling of Gradient and Mirror Descent".

First-order methods play a central role in large-scale convex optimization. Even though many variations exist, almost all such methods fundamentally rely on two types of algorithmic steps: gradient-descent step and mirror-descent step. In this paper, we observe that the performances of these two types of step are complementary, so that faster algorithms can be designed by linearly coupling the two steps. In particular, we show how to obtain a conceptually simple interpretation of Nesterov's accelerated gradient method, a cornerstone algorithm in convex optimization, as a simple linear coupling of the convergence analyses of the two underlying steps.

We believe that the complementary view and the linear coupling proposed in this paper will prove very useful in the design of first-order methods as it allows us to design fast algorithms in a conceptually easier way. For instance, our technique greatly facilitates the recent breakthroughs in solving packing and covering linear programs [AO14a, AO14b].

INFORMS Annual Meeting
  1. "First-Order Iterative Methods in the Design of Fast Graph Algorithms".

Fast iterative methods from Convex Optimization play a crucial role in a number of recent breakthroughs in the design of nearly-linear-time algorithms for fundamental graph problems, such as maximum flow and graph partitioning. Multiplicative Weight Updates, Euclidean and Non-Euclidean Gradient Descent, and Nesterov's Method have become a mainstay in the construction and analysis of fast algorithms.

However, until recently, the relation between these different methods and the reason for their success have been somewhat murky. What is the exact relation between Multiplicative Weight Updates and Gradient Descent? Why do Multiplicative Weight Updates show up in so many settings? What is the intuition behind Nesterov's iterative algorithm that achieves the asymptotically optimal iteration count for smooth convex functions (hint: it isn't just an algebraic trick)? The answer to these questions was not clear.

In this survey, we will provide answers by presenting a unified framework that reveals the power and limitations of each method and provides much needed geometric intuition. Among other insights, we will explain how Multiplicative Weight Updates are a particular instance of a dual iterative method, known as Mirror Descent, and how Nesterov's algorithm can be naturally derived as a combination of Mirror and Gradient Descent.

This talk is based on joint work and also co-lectured with Lorenzo Orecchia.

MSR-NE Apr/11/2014
  1. "A Simple, Combinatorial and Nearly-Linear-Time Algorithm for Solving Ax=b When A is Symmetric Diagonally Dominant".

Fast algorithms for solving linear systems Ax=b when A is symmetric diagonally dominant (SDD) or graph Laplacian have found broad applications across both the theory and practice of computer science, playing a foundational role in scientific computing, machine learning, random processes, image processing, network analysis, and computational biology, among others. More recently, they have emerged as a powerful tool in the design of graph algorithms, enabling researchers to break longstanding algorithmic barriers for a wide range of fundamental graph problems, including finding maximum flows and multicommodity flows, generating random spanning trees, sparsifying graphs, routing in distributed systems, and finding sparse cuts and balanced separators.

Finding fast solvers for these systems has been an active area of research, culminating in algorithms that solve them in nearly-linear time. These solvers are quite complex, and developing them required multiple fundamental innovations in spectral and combinatorial graph theory, graph algorithms, and computational linear algebra, which were used to construct and analyze an intricate recursively preconditioned iterative solver.

In this talk, I will give a very simple combinatorial algorithm that solves SDD systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, or even the Chebyshev Method or Conjugate Gradient. After constructing a "nice" spanning tree of a graph associated to the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm presented has the fastest known running time under the standard unit-cost RAM model.

This is joint work with Jonathan Kelner, Lorenzo Orecchia and Aaron Sidford.

ICML 2013 workshop
  1. "Local Graph Clustering".

Motivated by applications of large-scale graph clustering, we study random-walk-based local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. All previously known such algorithms guarantee an output conductance of \(\tilde{O}(\sqrt{\phi(A)})\) when the target set A has conductance \(\phi(A)\in[0,1]\). In this paper, we improve it to \[\tilde{O}\bigg( \min\Big\{\sqrt{\phi(A)}, \frac{\phi(A)}{\sqrt{\mathsf{Conn}(A)}} \Big\} \bigg)\enspace, \] where the internal connectivity parameter \(\mathsf{Conn}(A)\in[0,1]\) is defined as the reciprocal of the mixing time of the random walk over the induced subgraph on A.

For instance, using \(\mathsf{Conn}(A)=\Omega(\lambda(A)/\log n)\) where \(\lambda\) is the second eigenvalue of the Laplacian of the induced subgraph on A, our conductance guarantee can be as good as \(\tilde{O}(\phi(A)/\sqrt{\lambda(A)})\). This builds an interesting connection to the recent advance of the so-called improved Cheeger's Inequality [KKL+13], which says that global spectral algorithms can provide a conductance guarantee of \(O(\phi_{\mathsf{opt}}/\sqrt{\lambda_3})\) instead of \(O(\sqrt{\phi_{\mathsf{opt}}})\).

In addition, we provide theoretical guarantee on the clustering accuracy (in terms of precision and recall) of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data.

It is worth noting that, our analysis outperforms prior work when the cluster is well-connected. In fact, the better it is well-connected inside, the more significant improvement (both in terms of conductance and accuracy) we can obtain. Our results shed light on why in practice some random-walk-based algorithms perform better than its previous theory, and help guide future research about local clustering.

This is joint work with Silvio Lattanzi and Vahab Mirrokni.

Google Research
  1. "Mechanism Design with Approximate Valuations (a.k.a. Knightian Auctions)".
Google Research Aug/28/2012
  1. "A Novel Click Model and Its Applications to Online Advertising".
MSR-Asia Aug/19/2009
  1. "Inverse Dependency on Training Data Size".
MSR-Asia Apr/30/2009
  1. "Parallel SVM on MLPlatform".
MSR-Asia Feb/03/2009

Lecture Talks

  • "Interior Point Method"
(1.5hrs) MIT / 6.854 Oct/21/2011
  • "Link-Cut Trees"
(1.5hrs) MIT / 6.854 Oct/14/2011
  • "Robust Mechanism Design"
     - Lecture 1: Introduction to Robust Mechanism Design
     - Lecture 2: The Distinguishably Dominance and the Robustness
     - Lecture 3: The 1/2 Revenue Mechanism and Open Problems
(7hrs) Microsoft Research Asia Jan/21/2010
  • "How to Become Outstanding in USA Mathematical Contest of Modeling" hot
(2.5hrs) Tsinghua University Dec/29/2009
  • "Introduction to Quantum Teleportation"
(25min) Tsinghua University Nov/03/2008
  • "Two Mergeable Data Structures: Disjoint-set and Leftist Tree"
(45min) Tsinghua University May/30/2008