Course Skeleton

This is lecture-by-lecture skeleton of the course meant to organize and bullet point what we’ve learned so far in each class. It is not meant to replace notes or going to class; it is here as a resource to help you, the student, have a bird’s eye view of the course’s main ideas, definitions, and results.

Table of contents

  1. Lecture 1.1: Vectors, matrices, and least squares (Tues May 27)
  2. Lecture 1.2: Subspaces, bases, and orthogonality (Thurs May 29)
  3. Lecture 2.1: Singular Value Decomposition (Tues Jun 3, 2025)
  4. Lecture 2.2 Eigendecomposition and Positive Semidefinite Matrices (Thurs Jun 5, 2025)
  5. Lecture 3.1: Differentiation and vector calculus (Tues Jun 10, 2025)
  6. Lecture 3.2: Gradient Descent, Linearization, and Taylor Series (Thurs Jun 12, 2025)
  7. Lecture 4.1: Optimization and the Lagrangian (Tues June 17, 2025)
  8. Lecture 4.2: Convexity and convex optimization (Thurs Jun 20, 2025)

Lecture 1.1: Vectors, matrices, and least squares (Tues May 27)

  • Definition (vector).
    • “vector-vector multiplication.”
      • Definition (dot product/Euclidean inner product).
        • Example of an inner product.
        • notion of angle.
        • whenever we have an inner product, we have a norm (notion of length).
  • Definition (matrix).
    • “matrix-vector multiplication.”
      • linear combination view
      • system of equations view
    • “matrix-matrix multiplication.”
      • inner product view (entry-by-entry)
      • matrix vector view (d different matrix-vector multiplications)
      • outer product view (r different rank-1 matrices added up)
    • Definition (matrix inverse).
    • Definition (transpose of a matrix).
  • Definition (span).
    • span(col(X)) is the columnspace of a matrix X.
  • Definition (linear independence).
  • Definition (rank).
    • number of linearly independent columns/rows
  • Problem: Regression.
    • problem setup: n training examples, d “measurements” or “features.”
    • our approach: find a linear model, w.
    • find w by the principle of least squares regression.
    • Definition (sum of squared residuals).
  • Prop: rank(X^T X) = rank(X).
    • will prove next time
  • Theorem: Ordinary least squares solution.
    • ingredient 1: pythagorean theorem and orthogonality.
    • ingredient 2: invertibility
    • solve the normal equations.
  • Story 1: least squares regression
    • Got a solution to least squares regression by using geometric intuition and solving the normal equations.
  • Story 2: gradient descent
    • The sum of squared residuals looks like a “bowl.”

Lecture 1.2: Subspaces, bases, and orthogonality (Thurs May 29)

  • Note: to fit an intercept, add a “dummy” dimension of all 1’s to go from d to d+1.
  • Definition (subspace).
  • Definition (basis).
    • linear independence, span
  • Definition (columnspace).
  • Definition (rank).
  • missing parts of Theorem (OLS)
    • Theorem: invertibility of X^T X
      • ingredient 1: definition of rank/linear independence
      • ingredient 2: positive definiteness of inner product
    • Theorem: Pythagorean theorem
      • ingredient 1: all the properties of inner products
      • ingredient 2: definition of orthogonality
    • Theorem: projection minimizes distance
      • ingredient: Pythagorean theorem
  • Definition (projection).
    • equivalent to OLS
  • Definition (linearity).
    • matrices as linear transformations and vice versa (HW problem)
    • one such linear transformation: projection
  • Definition (unit vector).
  • Definition (orthonormal basis).
  • Definition (orthogonal matrix).
    • for rectangular matrices: Definition (semi-orthogonal matrix).
  • Theorem: Ordinary least squares with orthonormal basis.
    • main theorem in this lecture – simplifies OLS solution greatly (no inverses).
  • Story 1: least squares regression
    • Filled in the “geometric intuition” with Theorem: invertibility of X^T X and Theorem: projection minimizes distance
    • Projection is equivalent to OLS.
    • When we have an orthonormal basis, we get a much simpler solution to OLS.
  • Story 2: gradient descent
    • Nothing new: the sum of squared residuals looks like a “bowl.”

Lecture 2.1: Singular Value Decomposition (Tues Jun 3, 2025)

  • Definition (orthogonal complement).
  • Properties (projection matrices).
    • Prop (Orthogonal decomposition).
    • Prop (Projection and orthogonal complement matrices)
    • Prop (Projecting twice doesn’t do anything).
    • Prop (Projections are symmetric).
    • Prop (1D Projection formula).
    • Proofs left as exercises – ask Sam if you can’t figure one of them out.
  • Problem: best-fitting 1D subspace.
    • Use all the properties of projection matrices to get the final form.
    • Final form gives 1st singular vector and singular value.
  • Definition (Full SVD).
    • Definition (left singular vectors). Give a basis for the columnspace.
    • Definition (right singular vectors). Give a basis for the rowspace (columnspace of X transpose).
    • Definition (singular values). Exactly r (rank of X) positive singular values.
  • Definition (Compact SVD).
  • Theorem (rank-k approximation).
    • “chop off” the SVD at k « r to get an approximation of a matrix.
  • Definition (pseudoinverse).
    • “generalization” of the inverse using the SVD.
  • Theorem (OLS with pseudoinverse).
    • Use the pseudoinverse as if it were the actual inverse to get the OLS solution, w
  • Theorem (minimum norm solution).
    • Using the pseudoinverse gives us the minimum norm solution when d > n and rank(X) = n (infinitely many exact solutions).
  • Story 1: least squares regression
    • The pseudoinverse unified all situations where we want a least squares solution (when d > n or n > d).
  • Story 2: gradient descent
    • Nothing new: the sum of squared residuals looks like a “bowl.”

Lecture 2.2 Eigendecomposition and Positive Semidefinite Matrices (Thurs Jun 5, 2025)

  • Definition (eigenvector).
  • Definition (eigenvalue).
  • Prop (Eigendecomposition of diagonalizable matrices).
    • When are matrices diagonalizable?
    • Connection with the SVD.
  • Definition (Positive Semidefinite Matrices).
    • Three definitions
      • All eigenvalues are nonnegative.
      • Associated quadratic form is nonnegative.
      • Can be “factored” into X^T X
  • Theorem (Spectral Theorem). All symmetric matrices are diagonalizable.
  • Application of eigenvectors/eigenvalues: PCA (principal components analysis).
  • Quick analysis of errors in least squares with eigenvectors/eigenvalues.
  • Definition (Quadratic Forms). Closely related to symmetric matrices.
    • Three possibilities
      • Positive definite
      • Positive semidefinite
      • Indefinite
  • Story 1: least squares regression
    • Eigenvalues/eigenvectors allow us to analyze the errors in least squares regression.
  • Story 2: gradient descent
    • Positive semidefinite and positive definite quadratic forms seem ripe for gradient descent.

Lecture 3.1: Differentiation and vector calculus (Tues Jun 10, 2025)

  • Definition (difference quotient).
  • Definition (single-variable derivative).
  • Main idea: differential calculus allows us to replace nonlinear functions with linear approximations.
  • Definition (directional derivative).
  • Definition (partial derivative).
  • Definition (gradient). Only for scalar-valued functions.
  • Definition (Jacobian). For general vector-valued functions.
  • Definition (open ball/neighborhood). The points local to a point.
  • Definition (total derivative/differentiable). Notion of a multivariable derivative.
  • Definition (smoothness). Continuously differentiable, the class C^1.
  • Theorem (Sufficient criterion for differentiability).
    • If a function is smooth, then it is differentiable and its derivative is its Jacobian/gradient.
  • Theorem (directional derivatives from total derivative).
    • If a function is differentiable, we can get all directional derivatives from matrix-vector product with derivative.
  • Big picture: if a function is smooth, then its derivative is its Jacobian/gradient (which we get from taking the partial derivatives).
    • Directional derivatives come from matrix-vector product with the Jacobian/gradient.
    • We’ll primarily concern ourselves with smooth functions.
  • Definition (Hessian).
  • Theorem (equality of mixed partials). All C^2 functions have symmetric Hessians.
  • Theorem (OLS from optimization).
  • Algorithm (gradient descent).
  • Story 1: least squares regression
    • We can obtain the same OLS theorem using only the tools of optimization/calculus.
  • Story 2: gradient descent
    • We can now properly write out the algorithm for gradient descent now that we know what a gradient is.

Lecture 3.2: Gradient Descent, Linearization, and Taylor Series (Thurs Jun 12, 2025)

  • Algorithm (Gradient Descent).
    • Finally defined gradient descent.
  • Definition (first-order approximation/linearization).
  • Definition (polynomial).
  • Definition (Taylor Series). Defined at a point, x_0.
    • pth order Taylor approximation
      • We will mostly concern ourselves with first-order and second-order approximations.
    • pth order Taylor polynomial
  • Definition (Remainder from Taylor Series).
    • The “leftover” after chopping off the Taylor series at some degree.
  • Theorem (Taylor’s Theorem).
    • Quantifying the remainder/error exactly.
  • Definition (beta-smooth functions/matrices).
    • Bound on the maximum eigenvalue.
  • Theorem (Gradient descent for beta-smooth functions).
    • For beta-smooth functions, gradient descent with a small enough step size makes the function value smaller at each iteration.
    • Ingredients:
      • Taylor’s Theorem (on the 1st order approximation/linearization).
      • beta-smooth function definition.
  • Story 1: least squares regression
    • Nothing too new here – just reviewed obtaining OLS solution via optimization.
  • Story 2: gradient descent
    • Formally wrote the algorithm and gave the first convergence proof of gradient descent for beta-smooth functions.

Lecture 4.1: Optimization and the Lagrangian (Tues June 17, 2025)

  • Definition (optimization problem)
    • Definition (objective function).
    • Definition (constraint set).
  • Definition (neighborhood/open ball).
  • Definition (interior point).
  • Types of minima:
    • Definition (local minimum). AKA unconstrained local minimum.
    • Definition (constrained local minimum)
    • Definition (global minimum).
  • Theorem (Necessary conditions for unconstrained local minimum).
    • First-order condition.
    • Second-order condition.
  • Theorem (Sufficient conditions for unconstrained local minimum).
    • First-order condition.
    • Second-order condition.
  • Definition (Equality constrained optimization).
  • Definition (regular point).
  • Definition (Lagrangian).
  • Theorem (Lagrange Multiplier Theorem).
    • Necessary conditions for local minima in equality-constrained optimization.
  • Definition (inequality constrained optimization).
  • Theorem (KKT Theorem).
    • Definition (complementary slackness)
  • Story 1: least squares regression
    • In some applications, it may be favorable to regularize the least squares objective by trading off minimizing the objective with the norm of the weights (PENDING – this will be covered in the probability lectures)
  • Story 2: gradient descent
    • Nothing new here (we still can only guarantee local minima).

Lecture 4.2: Convexity and convex optimization (Thurs Jun 20, 2025)

  • Definition (Convex Optimization Problem).
  • Definition (line segment).
  • Definition (convex set).
    • Example: lines
    • Example: hyperplane
    • Example: halfspace
  • Definition (convex function).
    • Example: quadratic forms
    • Example: affine functions
    • Example: exponential functions
  • Three characterizations of convex functions.
    • Original definition (secants always above the graph).
    • First-order definition (gradient is always under the graph).
    • Second-order definition (Hessian is PSD).
  • Theorem (Optimality for convex optimization).
    • All local minima are global minima.
  • Theorem (Convergence of GD for smooth, convex functions).
    • Convergence of GD to global minimum.
  • Theorem (GD applied to OLS).
    • Algorithm for OLS.
    • Verifying OLS is a convex optimization problem.
  • Story 1: least squares regression
    • The least squares objective is a convex function; applying gradient descent takes us to a global minimum.
  • Story 2: gradient descent
    • Applying gradient descent to beta-smooth, convex functions takes us to a global minimum. One such function is the least squares objective.