Course Content

This page includes all the content for the course thus far. We will update this page with all lecture materials, readings, and homework as the class goes on.

Schedule and Main Content
Resources

Schedule and Main Content

This class has six main modules, two for each “pillar” of machine learning: linear algebra, calculus and optimization, and probability and statistics. All class files will be available here. For a more detailed outline of the course thus far, see the Course Skeleton.

Lecture slides can be found by clicking on the lecture title for the appropriate day.
All the materials and reading on the right column is optional, but reading (a subset of) these materials before each lecture might help digesting the content during lecture.
Problem sets will be posted here, as well as their solutions.

This is a tentative schedule and is subject to change. Readings, slides, and assignments will be posted as the class goes on.

Optional readings. MML refers to Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. VMLS refers to Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares by Stephen Boyd and Lieven Vandenberghe.

Story of the course. As the lectures go on, the goal will be to develop two main ideas from machine learning: least squares regression (LS) and gradient descent (GD). During each lecture, we will build these ideas with the mathematical tools from that lecture; at the same time, we’ll gradually develop a “picture” of LS and GD as the course goes on. An evolving 3D rendering of each “picture” will be linked in each module below.

Problem sets. The problem sets will usually look relatively long, but much of it is exposition – the problems in this course are mostly structured to guide you through the discovery or derivation of some result or perspective on a concept. As such, the problem sets serve the double purpose of some “required reading” interspersed with problems for you to fill in the gaps.

Lecture pace. It’s really easy, in my experience, to get lost in a math lecture when lots of derivations or proofs are involved. At the same time, though, it can often be intimidating to speak up for fear of asking a “dumb question” (no such thing!). To this end, during every lecture, I’ll have a fully anonymous interactive poll to keep an eye on how people are feeling during lecture and I’ll check it intermittently, especially during proofs. Access the poll on the Pacing page.

Unit reviews. At the end of each “pillar” of the course, we will hold an optional unit review session to make sure that everyone is on the same page before moving onto the next session. These will be informal recitations where we recap the Course Skeleton to get a big picture view and, more importantly, answer any questions and confusion you might have. The dates/times/locations will be posted here and on the Calendar.

Linear Algebra I (matrices, vectors, bases, and orthogonality)

May 22

PS 0 released (due May 30 11:59 PM ET) + Ed Announcement: ps0_template.zip

May 27

Lecture: Vectors, matrices, and least squares: MML 2.1 - 2.8, 3.1 - 3.3, VMLS 1.1-1.5, 2.1-2.3, 3.1-3.4, 5.1, 5.2, 6.1-6.4, 12.1-12.4, Regression (d=2)

PS 1 released (due June 6 11:59 PM ET): ps1.pdf, ps1_student.zip, ps1.ipynb, ps1_tex.zip

Reading Project released (due June 3 11:59 PM ET): project instructions

May 29

Lecture: Subspaces, bases, and orthogonality: MML 2.1 - 2.8, 3.1 - 3.3, VMLS 1.1-1.5, 2.1-2.3, 3.1-3.4, 5.1, 5.2, 6.1-6.4, 12.1-12.4, Alternate basis, 3Blue1Brown video on bases, 3Blue1Brown video on matrices as linear transformations

May 30

DUE PS 0 due

LS (Story thus far)

Lecture 1.1: Least squares regression can be solved geometrically with the Pythagorean Theorem.

Lecture 1.2: Least squares regression has a simpler solution with orthonormal bases.

GD (Story thus far)

Lecture 1.1, 1.2: Gradient descent with a “bowl-shaped” function gets us to the minimum.

Linear Algebra II (singular value decomposition and eigendecomposition)

Jun 3

Lecture: Singular Value Decomposition: 3D SVD (unprojected), 3D SVD (u1, u2), 3D SVD (u1), Orthogonal Complement, MML 4.2, 4.4, 4.5, Daniel Hsu’s Computational Linear Algebra (CLA) course notes on SVD, Daniel Hsu’s CLA interactive example of “best-fitting 1d subspace”

DUE Reading Project first evaluation due

PS 2 released (due June 13 11:59 PM ET): ps2.pdf, ps2_student.zip, ps2.ipynb, ps2_tex.zip

Jun 5

Lecture: Eigendecomposition and PSD Matrices: Positive Definite Quad. Form, Positive Semidefinite Quad. Form, Indefinite Quad. Form, Indefinite Quad. Form (another initialization), Quadratics are dominated by the degree-2 terms, MML 4.2, 4.4, 4.5, 3Blue1Brown on eigenvalues/eigenvectors

Jun 6

DUE PS 1 due

LS (Story thus far)

Lecture 2.1 & 2.2: The problem of least squares regression is unified under the pseudoinverse.

GD (Story thus far)

Lecture 2.1 (nothing new): Gradient descent with a “bowl-shaped” function gets us to the minimum.

Lecture 2.2: On quadratic forms, it seems that gradient descent on three different types of shapes has different behavior: positive definite, positive definite, and indefinite.

Calculus and Optimization I (differentiation and Taylor Series)

Jun 10

Lecture: Differentiation and vector calculus: “Peaks” Function, Derivative Ex. 1, Derivative Ex. 2, Derivative Ex. 3, MML 5.1 - 5.5, The Matrix Cookbook, Annotated Slides

PS 3 released (due June 20 11:59 PM ET): ps3.pdf, ps3_student.zip, ps3.ipynb

Jun 12

Lecture: Gradient Descent, Linearization, and Taylor Series: 3Blue1Brown video on Taylor Series, MML 5.8, 3Blue1Brown video on Gradient Descent and Neural Networks

Jun 13

DUE PS 2 due

LS (Story thus far)

Lecture 3.1: We can derive the exact same OLS theorem from linear algebra section from just the tools of optimization and viewing the notion of least squares error as an “objective function.”

GD (Story thus far)

Lecture 3.1: We can now write down the algorithm for gradient descent. Intuitively, positive semidefinite or positive definite quadratic forms seem good for gradient descent.

Lecture 3.2: Using Taylor’s theorem for the first-order approximation (linearization), we can provide intuition and a formal guarantee that gradient descent makes the function values decrease. The behavior of gradient descent depends on the learning rate eta: eta too big will result in erratic behavior but small enough eta results in stable convergence. This eta setting depends intimately on the second order information, or “smoothness” of the function

Calculus and Optimization II (optimization and convexity)

Jun 17

Lecture: Optimization and the Lagrangian: Constrained least squares (ridge regression), MML 7.1 - 7.2

PS 4 released (due June 27 11:59 PM ET): ps4.pdf, ps4_student.zip, ps4.ipynb

Jun 19

Class rescheduled to Friday, June 20th due to Juneteenth

Jun 20

Lecture: Convexity and convex optimization (Changed time and location: 12:45pm - 4pm in CSB 451): MML 7.3, Convexity Definition in 3D, Convexity First-order Definition in 3D, Boyd and Vandenberghe’s Convex Optimization Chapters 1 - 3

Jun 20

DUE PS 3 due

LS (Story thus far)

Lecture 4.1: In some applications, it may be favorable to regularize the least squares objective by trading off minimizing the objective with the norm of the weights.

Lecture 4.2: The least squares objective is a convex function (also: first-order definition); applying gradient descent takes us to a global minimum

GD (Story thus far)

Lecture 4.1: Nothing new here.

Lecture 4.2: Applying gradient descent to beta-smooth, convex functions takes us to a global minimum. One such function is the least squares objective.

Probability and Statistics I (basic probability theory and statistical estimation)

Jun 24

Lecture: Basic Probability Theory, Models, and Data: Regression setup w/ randomness, MML 6.1-6.4, Blitzstein and Hwang’s Ch. 9 on Conditional Expectation, Leo Breiman’s “Two Cultures” paper, Carlos Fernandez-Granda’s Probability for Data Science Overview

PS 5 released (due July 4 11:59 PM ET), no programming part: ps5.pdf, ps5_student.zip

Final paper reading evaluation released. Evaluation due July 8 11:59 PM ET

Jun 26

Lecture: Bias, Variance, and Statistical Estimators: Regression (d = 2) with test point, SGD with batch size 1, SGD with batch size 10

Jun 27

DUE PS 4 due

Unit 2 Calculus Review Session (in video library): Handwritten notes

LS (Story thus far)

Lecture 5.1: Modeled the regression problem with a linear model with random errors. In this model, OLS is itself a random variable, so we will analyze its statistical properties.

Lecture 5.2: Found two key statistical properties of OLS: OLS’ expectation is the true linear model and its variance scales with the variance of the random errors.

GD (Story thus far)

Lecture 5.1 and 5.2: Nothing new here.

Probability and Statistics II (Maximum likelihood and Gaussian distribution)

Jul 1

Lecture: The Central Limit Theorem, “Named” Distributions, and MLE: MML 6.1-6.8, MML Ch. 8, 3Blue1Brown’s video on the Central Limit Theorem

Please fill out SEAS course evaluations on Courseworks!

Jul 3

Lecture: Multivariate Gaussian and Course Overview: 3Blue1Brown’s video on adding Gaussian distributions, 3Blue1Brown’s video on normalizing the Gaussian, MML Ch. 11 (Gaussian Mixture Models, not covered), OLS distribution with standard normal eps, true w = (1,1), MVN with mean (0, 0), Identity covariance, MVN with mean (0, 0), Diagonal covariance, MVN with mean (0, 0), Non-diagonal covariance, MVN with mean (1, 1), Non-diagonal covariance

This wasn’t covered in class; YouTube link here for mini-lecture.

Jul 4

DUE PS 5 due: Problem 2 is extra credit (see Ed)

Jul 8

DUE Final Project Evaluation due

Please fill out SEAS course evaluations on Courseworks!

LS (Story thus far)

Lecture 6.1: Completed our statistical analysis of OLS by deriving its mean-squared error to the true parameter vector and its risk, or test error.

Lecture 6.2: Under another paradigm for machine learning (maximum likelihood estimation), the OLS estimator corresponds to MLE on the Gaussian error model.

GD (Story thus far)

Lecture 6.1: Closed the story of gradient descent by defining stochastic gradient descent, where we use unbiased estimators of the gradient instead of the full gradient over all the data.

Resources

I’ll update this with additional resources as the class progresses. Feel free to use these or ignore completely. If you know of any additional resources that you think would be helpful for the class, let me know and I’ll add it here!

LaTeX

Overleaf, the Google Docs for LaTeX. Can be used for all the assignments in this class.
Overleaf’s guide to learn LaTeX in 30 minutes
David Xiao’s Beginner’s guide to LaTeX
Eddie Kohler’s LaTeX usage notes. These might be worth a browse to rectify common stylistic problems with using LaTeX.
Detexify, an applet to get the LaTeX command for any handwritten symbol.

In general, Googling an issue you’re having with LaTeX usually provides a plethora of solutions.

Python

Whirlwind Tour of Python should have most everything you need to get up to speed with the programming required in this course.
A condensed version of this Whirlwind Tour of Python can be found here: python_crashcourse.ipynb.
Here is a video going through this crash course in case you want to get up to speed in video format.

Linear Algebra Prerequisites

If you need to refresh any linear algebra, these may be good resources.

Linear Algebra and Applications by Gilbert Strang
Gilbert Strang’s MIT Course on Linear Algebra
Linear Algebra Done Wrong by Sergei Treil, available free as PDF here
Daniel Hsu’s course notes for Computational Linear Algebra
3Blue1Brown’s Essence of Linear Algebra videos

Multivariable Calculus Prerequisites

If you need to refresh any multivariable calculus, these may be good resources.

MIT OpenCourseware course on multivariable calculus
Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach by Barbara Burke Hubbard and John H. Hubbard.
Vector Calculus by Susan Jane Colley

Probability Theory and Statistics Prerequisites

If you need to refresh any probability and statistics, these may be good resources.

Introduction to Probability for Data Science by Stanley H. Chan
A First Course in Probability by Sheldon Ross.
Introduction to Probability by Joseph K. Blitzstein and Jessica Hwang.
Probability and Statistics for Engineers and Scientists by Ronald E. Wadpole.