Wigner's Semicircle Law

The theory on random Hermitian matrices seems to be well studied in the wireless communication literature and was recently applied to derive exact asymptotic risk in over-parameterized models (e.g. this paper) and explain the so-called “double-descent” phenomenon. One key term in the bias and variance components of the risk is the empirical covariance matrix \hat{\Sigma}= X^\top X / n \in \mathbb{R}^{d \times d}, where X\in \mathbb{R}^{n\times d} stacks i.i.d. d-dimensional features along its row. The empirical covariance can be classified as what’s known in the random matrix literature a Wigner Hermitian matrix. In more precise terms, a matrix of this type is a random Hermitian matrix with i.i.d. entries above (and including) the diagonal whose mean is 0 and variance 1. The precise value of the variance is unimportant, since one can always rescale a Wigner matrix by a fixed scalar without changing its asymptotic properties by much.

The theory around Wigner matrices are particularly well-developed. One important aspect to theorists is the empirical (eigen-)spectrum distribution (ESD), or in simple terms, the empirical CDF of the eigenvalues. To be precise, let M_n \in \mathbb{R}^{n \times n} be a Wigner Hermitian matrix. We shall operate with a rescale version of this matrix, i.e. M_n' = M_n / \sqrt{n}, just so that we don’t get exploding spectral norms (Bai-Yin theorem states that the spectral norm of M_n is of order \sqrt{n}). The ESD of M_n' can written as \mu_n:= \frac{1}{n} \sum_{i=1}^n \delta_{\lambda_i}, where \lambda_1\le \lambda_2, \dots, \lambda_{n-1}\le \lambda_n are the eigenvalues of M_n'.

Note that given an instantiation of the random matrix, we have a fixed empirical spectral measure. So, the ESD is in essence a random measure. What’s cool in the asymptotic limit as n\to \infty is that this random measure converges (in multiple senses) to a single deterministic measure, known as Wigner’s semicircular distribution, which can be written analytically as follows

\mu_{\mathrm{sc}} = \frac{1}{2 \pi}\left(4-|x|^{2}\right)_{+}^{1 / 2} d x.

Now the question comes how should we reason about the convergence of a sequence of random measures, where the central objects of interest are not random variables, i.e. not measurable functions, but rather measures which themselves are governed by the the underlying probability measure. To address this, people have defined at least three notions/modes of convergence:

  1. (convergence in probability) For every test function \phi \in C_c(\mathbb{R}), \int_{\mathbf{R}} \phi d \mu_n converges in probability to \int_{\mathbf{R}} \phi d \mu_{\mathrm{sc}}.
  2. (convergence almost surely) Same as above but replacing “in probability” with “almost surely”.
  3. (convergence in expectation) The “expected measure” \mathbb{E} [\mu_{n}] converges in the vague topology to \mu_{\mathrm{sc}}.

Note C_c(\mathbb{R}) denotes the space of real-valued continuous functions with compact support. Much of the reason we use this space of functions is that the space of positive linear functionals based on this space can be nicely characterized by integrals based on Radon measures. The following theorem gives a precise characterization.

Riesz Representation Theorem: If I is a positive linear functional on C_c(\mathbb{R}), then there is a unique Radon measure \nu on \mathbb{R} s.t. I(f) = \int f d \nu for all f \in C_c(\mathbb{R}).

A positive linear functional I is a mapping from the function space to a real, satisfying the additional constraint that I(f) \ge 0 whenever f \ge 0. Note that the Riesz Representation Theorem also formally defines the “expected measure” \mathbb{E} [\mu_{n}] .

Going back to the original topic, there are multiple ways to prove convergence to the semicircular distribution. By far, I found the moment method to be the easiest to absorb. For instance, to prove convergence in expectation, the two crucial steps in this method are 1) show the “expected measures” \mathbb{E} [\mu_{n}] are uniformly sub-Gaussian in n, and 2) show that \int_{\mathbf{R}} x^k d \mathbb{E} [\mu_{n}] \to \int_{\mathbf{R}} x^{k} d \mu_{\mathrm{sc}} as n \to \infty pointwise for k \in \mathbb{N}.

Note that 1) is required to show that the kth moments of the “expected measures” are uniformly bounded by some constant of order k^{k/2} in n. This in turns allows us to expand the characteristic functions of the random variables associated with each “expected measure” as a series where the ith term is described by the ith moment. Then, using 2) we know that moments converge, and consequently the characteristic functions convergences pointwise. Weak convergence then follows from Levy’s Continuity Theorem. Equivalently, Terence condensed the latter portion of this whole procedure using the Carleman Continuity Theorem.

Note: This post is based on Terence Tao’s notes on random matrix theory, various wikipedia articles, my real analysis textbook (Folland), and the paper linked in the first paragraph.