Course Content

This page includes all the content for the course thus far. We will update this page with all lecture materials, readings, and homework as the class goes on.

  1. Schedule and Main Content
  2. Resources
    1. LaTeX
    2. Python
    3. Linear Algebra Prerequisites
    4. Multivariable Calculus Prerequisites
    5. Probability Theory and Statistics Prerequisites

Schedule and Main Content

This class has six main modules, two for each “pillar” of machine learning: linear algebra, calculus and optimization, and probability and statistics. All class files will be available here. For a more detailed outline of the course thus far, see the Course Skeleton.

  • Lecture slides can be found by clicking on the lecture title for the appropriate day.
  • All the materials and reading on the right column is optional, but reading (a subset of) these materials before each lecture might help digesting the content during lecture.
  • Problem sets will be posted here, as well as their solutions.

This is a tentative schedule and is subject to change. Readings, slides, and assignments will be posted as the class goes on.

Optional readings. MML refers to Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. VMLS refers to Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares by Stephen Boyd and Lieven Vandenberghe.

Story of the course. As the lectures go on, the goal will be to develop two main ideas from machine learning: least squares regression (LS) and gradient descent (GD). During each lecture, we will build these ideas with the mathematical tools from that lecture; at the same time, we’ll gradually develop a “picture” of LS and GD as the course goes on. An evolving 3D rendering of each “picture” will be linked in each module below.

Problem sets. The problem sets will usually look relatively long, but much of it is exposition – the problems in this course are mostly structured to guide you through the discovery or derivation of some result or perspective on a concept. As such, the problem sets serve the double purpose of some “required reading” interspersed with problems for you to fill in the gaps.

Lecture pace. It’s really easy, in my experience, to get lost in a math lecture when lots of derivations or proofs are involved. At the same time, though, it can often be intimidating to speak up for fear of asking a “dumb question” (no such thing!). To this end, during every lecture, I’ll have a fully anonymous interactive poll to keep an eye on how people are feeling during lecture and I’ll check it intermittently, especially during proofs. When prompted to regsiter, just click “Skip for now.” The poll link is here.

Linear Algebra I (matrices, vectors, bases, and orthogonality)

Linear Algebra II (singular value decomposition and eigendecomposition)

Calculus and Optimization I (differentiation and Taylor Series)

Jul 15
Lecture: Differentiation and vector calculus
“Peaks” Function, Derivative Ex. 1, Derivative Ex. 2, Derivative Ex. 3, MML 5.1 - 5.5, The Matrix Cookbook
Jul 17
Lecture: Taylor Series, Linearization, and Gradient Descent
GD Example 1 (big eta), GD Example 1 (small eta), GD Example 2 (big eta), GD Example 2 (small eta), Linearization in 3D, Polynomial 1, Polynomial 2, Beta-smooth function, 3Blue1Brown video on Taylor Series
Jul 18
PS 3 released, due July 29, 11:59 PM ET
ps3.pdf, ps3_template.zip, ps3.ipynb, ps3_tex.zip
LS (Story thus far)
Lecture 3.1, 3.2: We can derive the exact same OLS theorem from linear algebra section from just the tools of optimization and viewing the notion of least squares error as an “objective function.”
GD (Story thus far)
Lecture 3.1: We can now write down the algorithm for gradient descent. Intuitively, positive semidefinite or positive definite quadratic forms seem good for gradient descent.
Lecture 3.2: Using Taylor’s approximations and Taylor’s theorem for the first-order approximation (linearization), we can provide intuition and a formal guarantee that gradient descent makes the function values decrease. The behavior of gradient descent depends on the learning rate eta: eta too big will result in erratic behavior but small enough eta results in stable convergence.

Calculus and Optimization II (optimization and convexity) -- SAM OUT OF TOWN

Jul 22
Lecture: Optimization and the Lagrangian (recording in three parts in Video Library)
Constrained least squares (ridge regression), MML 7.1 - 7.2
DUE PS 2 due
Jul 24
Lecture: Convexity and convex optimization (recording in one part in Video Library)
MML 7.3, Convexity Definition in 3D, Convexity First-order Definition in 3D, Boyd and Vandenberghe’s Convex Optimization Chapters 1 - 3
PS 4 released, due Aug 6th, 11:59 PM ET
ps4.pdf, ps4_template.zip, ps4.ipynb, ps4_tex.zip
LS (Story thus far)
Lecture 4.1: In some applications, it may be favorable to regularize the least squares objective by trading off minimizing the objective with the norm of the weights.
Lecture 4.2: The least squares objective is a convex function (also: first-order definition); applying gradient descent takes us to a global minimum.
GD (Story thus far)
Lecture 4.1: Nothing new here.
Lecture 4.2: Applying gradient descent to beta-smooth, convex functions takes us to a global minimum. One such function is the least squares objective.

Probability and Statistics I (basic probability theory and statistical estimation)

Jul 29
Lecture: Basic Probability Theory, Models, and Data
Regression setup w/ randomness, MML 6.1-6.4, Blitzstein and Hwang’s Ch. 9 on Conditional Expectation
DUE PS 3 due
Jul 31
Lecture: Bias, Variance, and Statistical Estimators
Regression (d = 2) with test point, SGD with batch size 1, SGD with batch size 10
Final paper reading evaluation released. Evaluation due August 12 11:59 PM ET
Aug 1
PS 5 released, due Aug 13th, 11:59 PM ET (no programming portion)
ps5.pdf, ps5_template.zip, ps5_tex.zip
LS (Story thus far)
Lecture 5.1: Modeled the regression problem with a linear model with random errors. Found that OLS’ conditional expectation is the true linear model and its variance scales with the variance of the random errors.
Lecture 5.2: OLS is the lowest variance unbiased linear estimator (Gauss-Markov Theorem). Derived expression for the risk (generalization error) of OLS.
GD (Story thus far)
Lecture 5.1: Nothing new here.
Lecture 5.2: Closed the story of gradient descent by defining stochastic gradient descent, where we use unbiased estimators of the gradient instead of the full gradient over all the data.

Probability and Statistics II (Maximum likelihood and Gaussian distribution)

Resources

I’ll update this with additional resources as the class progresses. Feel free to use these or ignore completely. If you know of any additional resources that you think would be helpful for the class, let me know and I’ll add it here!

LaTeX

In general, Googling an issue you’re having with LaTeX usually provides a plethora of solutions.

Python

  • Whirlwind Tour of Python should have most everything you need to get up to speed with the programming required in this course.

Linear Algebra Prerequisites

If you need to refresh any linear algebra, these may be good resources.

Multivariable Calculus Prerequisites

If you need to refresh any multivariable calculus, these may be good resources.

Probability Theory and Statistics Prerequisites

If you need to refresh any probability and statistics, these may be good resources.