Abstract

In this tutorial, we will provide an accessible and extensive overview on recent advances to optimization methods based on stochastic gradient descent (SGD), for both convex and non-convex tasks.

In particular, this tutorial shall try to answer the following questions with theoretical support. How can we properly use momentum to speed up SGD? What is the maximum parallel speedup can we achieve for SGD? When should we use dual or primal-dual approach to replace SGD? What is the difference between coordinate descent (e.g. SDCA) and SGD? How is variance reduction affecting the performance of SGD? Why does the second-order information help us improve the convergence of SGD?

Corrections and Discussions

I opened up a blog post for corrections and discussions about this talk. Please find it here on wordpress.

Since this is a survey talk, I may likely have missed many interesting relevant results in the field. I will try to include more references in the discussion above.

Downloads

Video is now available for watch. The voice of this video comes from my personal recording, so may not be of a high quality. Please stay tuned until the organizers release their version of the video.

The list of the papers mentioned during the talk can be downloaded here.

ICML 2017 Tutorial

Abstract

Corrections and Discussions

Downloads