Introduction to Stochastic Gradient Descent

Math
Optimization
Machine-Learning
Author

Perry

Published

December 23, 2025

This article provides a preliminary introduction to the Stochastic Gradient Descent (SGD) algorithm and its mini-batch variant, covering their basic principles, assumptions, and mathematical properties.

1 Introduction

Generally, the training process of a machine learning model can be modeled as the following Stochastic Optimization Problem, whose objective is to minimize the Expected Risk over the data distribution:

\[ \min_{\theta \in \mathbb{R}^d} F(\theta) = \mathbb{E}_{z \sim \mathcal{D}} [\ell(z; \theta)] \]

where:

  • \(\theta \in \mathbb{R}^d\) represents the model parameters.
  • \(z\) is a random sample drawn from distribution \(\mathcal{D}\), often represented as a sample-label pair \((x, y)\) in supervised learning.
  • \(\ell(\cdot; \theta)\) is the per-sample loss function.

Since the data distribution \(\mathcal{D}\) is usually unknown or the dataset is extremely large, directly computing the above expectation is often infeasible. The traditional deterministic Gradient Descent Method requires iterating over all data to compute the exact gradient \(\nabla F(\theta)\), making it unsuitable for large-scale data scenarios.

So, is there a good way to solve this type of optimization problem? Let’s look at this problem from a different perspective. We know that “minimizing the objective function” is often related to “finding the zero point of the objective function’s gradient.” That is, finding the root \(\theta^*\) of the equation: \[ M(\theta) \triangleq \nabla F(\theta) = 0 \] As mentioned above, since the data distribution \(\mathcal{D}\) is usually unknown or the dataset is extremely large, we generally cannot obtain its analytical form or exact value, and general numerical methods for solving nonlinear equations cannot be directly applied.

In 1951, Robbins and Monro proposed a Stochastic Approximation method to solve such problems (Robbins and Monro 1951). This is the theoretical foundation of the Stochastic Gradient Descent (SGD) algorithm.

To deepen understanding, we can first look at the form of the original problem in the paper and then map it back to our stochastic optimization problem. Specifically, for any given parameter \(\theta\in \mathbb{R}\), if it corresponds to a random variable \(Y= Y(\theta)\) with distribution function \(H(y \mid \theta)\), we can define \[ M(\theta) = \mathbb{E}[Y(\theta)] = \int_{-\infty}^{\infty} y \, dH(y \mid \theta) \] and then define a nonlinear equation \[ M(\theta) = \alpha \] When we have little knowledge about \(H(y \mid \theta)\) and \(M(\theta)\), traditional numerical methods struggle to find the root of this equation. However, Robbins and Monro proposed an iterative algorithm and proved that, under certain assumptions, the parameter sequence generated by this algorithm converges in mean square to the root of the equation \(M(\theta) = \alpha\), and thus converges in probability. This algorithm first initializes a parameter value \(\theta_0\) and a step size sequence \(\{\eta_t\}\), then updates the parameter via the following iterative formula: \[ \theta_{t+1} = \theta_t + \eta_t(\alpha - y_t) \] where \(y_t\) satisfies \(\Pr\{y_t \leq y \mid \theta_t\} = H(y \mid \theta_t)\). In practical use, \(y_t\) can be obtained by sampling from the random variable \(Y(\theta_t)\).

Applying this theoretical framework to the stochastic optimization problem in machine learning, we can make the following correspondences:

  • We aim to solve \(M(\theta) \triangleq \nabla F(\theta) =0\), where \(F(\theta) = \mathbb{E}_{z \sim \mathcal{D}} [\ell(z; \theta)]\).

  • Since \(\nabla F(\theta) = \nabla_\theta \mathbb{E}_{z \sim \mathcal{D}} [\ell(z; \theta)] = \mathbb{E}_{z \sim \mathcal{D}} [\nabla_\theta \ell(z; \theta)]\), the random variable \(Y(\theta)\) can correspond to the gradient of the loss function for a random sample \(\nabla \ell(z; \theta)\), and \(y_t\) corresponds to a stochastic estimate of the gradient, i.e., \(y_t =\nabla \ell(z_t; \theta)\), where \(z_t\) is a random sample drawn from the data distribution \(\mathcal{D}\).

Under this correspondence, the iterative formula becomes: \[ \theta_{t+1} = \theta_t - \eta_t \nabla \ell(z_t; \theta_t) \]

This is precisely the basic form of the Stochastic Gradient Descent (SGD) algorithm.

In practice, data is usually provided as a finite dataset \(\mathcal{D} = \{z_i\}_{i=1}^N\) (but of large scale). In this case, the expected risk can be approximated by the empirical risk: \[ F(\theta) \approx \hat{F}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(z_i; \theta) \]

2 Stochastic Gradient Descent Algorithm

It should be noted that in the paper (Robbins and Monro 1951), to prove that the iterative sequence converges to the unique root of the equation \(M(\theta)=\alpha\), the authors adopted a series of somewhat “stringent” conditions: for example, the problem is one-dimensional, \(M(\theta)\) is monotonic in some sense and has a unique root, \(Y(\theta)\) is almost surely uniformly bounded for all \(\theta\), and the step size sequence \(\{\eta_t\}\) satisfies \[ \sum_{t=0}^\infty \eta_t = \infty,\qquad \sum_{t=0}^\infty \eta_t^2 < \infty. \] among others. Under these conditions, classical stochastic approximation theory can prove that \(\{\theta_t\}\) converges in mean square.

Today, the dimensionality of problems to which the Stochastic Gradient Descent algorithm is applied is no longer limited to one dimension, and the statistical properties of data are far more complex than “uniformly bounded”. The stochastic gradient used at the \(t\)-th iteration point \(\theta_t\) is also no longer limited to the gradient of a single sampled sample, but is generalized to a concept called a gradient estimator (Oracle). It represents a mechanism that, given the \(t\)-th iteration point \(\theta_t\) and its corresponding randomness \(z_t\) as input, returns an estimate. It can be a single-sample gradient, a mini-batch gradient (see below), or other variants. We denote this mechanism as \(g(\theta_t;z_t)\). For notational simplicity, when no confusion arises, we will sometimes use the symbol \(g(\theta_t)\) in the following text.

To analyze the convergence of the SGD algorithm, the following assumptions are commonly used at present:

  1. Assumptions on Smoothness and (Strong) Convexity of the Objective Function: \(F\) is differentiable over the entire parameter space, and its gradient satisfies Lipschitz continuity. Sometimes \(F\) is required to be a (strongly) convex function. In the strongly convex case, if a minimum exists, it is unique. It is worth noting that the gradient of a convex function satisfies the monotonicity condition, i.e., for any \(\theta_1, \theta_2 \in \mathbb{R}^d\), we have \[ (\nabla F(\theta_1) - \nabla F(\theta_2))^\top (\theta_1 - \theta_2) \geq 0. \]

  2. Unbiasedness Assumption for the Stochastic Gradient: Given the current parameter \(\theta_t\), the gradient estimator \(g(\theta_t)\) satisfies \[ \mathbb{E}[g(\theta_t)] = \nabla F(\theta_t), \] meaning that in the sense of conditional expectation, the stochastic gradient coincides with the true gradient.

  3. Bounded Noise Variance (or Second Moment) Assumption: The deviation between the stochastic gradient and the true gradient has a bounded second moment, i.e., there exists a constant \(\sigma^2\) such that \[ \mathbb{E}\bigl[\|g(\theta_t) - \nabla F(\theta_t)\|^2 \mid \theta_t\bigr] \leq \sigma^2, \] or more generally, there exist constants \(A, B \geq 0\) such that \[ \mathbb{E}\bigl[\|g(\theta_t) - \nabla F(\theta_t)\|^2 \mid \theta_t\bigr] \leq A + B \|\nabla F(\theta_t)\|^2. \]

These assumptions can be seen as preserving the core structure of the Robbins–Monro framework while systematically generalizing and relaxing the convergence conditions.

The iterative formula of the Stochastic Gradient Descent algorithm can be written as: \[ \theta_{t+1} = \theta_t - \eta_t \, g(\theta_t), \]

For example, in the classical single-sample SGD, the gradient estimator \(g(\theta_t)\) is taken as the gradient of a single random sample: \[ g(\theta_t) = \nabla_\theta \ell(z_t; \theta_t), \] where \(z_t\) is a random sample drawn from the data distribution \(\mathcal{D}\).

2.1 Mini-batch SGD: The Case of Sampling with Replacement

Classical single-sample SGD uses only one sample per iteration to compute the gradient. While the computational cost is low, the introduced noise can be large (variance \(\sigma^2\) is large), which can easily lead to severe oscillations in the iterative trajectory and makes it difficult to leverage the parallel computing capabilities of modern hardware. Mini-batch Stochastic Gradient Descent (Mini-batch SGD) is a natural generalization of single-sample SGD, using \(M\) samples per iteration to compute the gradient.

Suppose at the \(t\)-th iteration, \(M\) samples are independently drawn with replacement: \[ \mathcal{B}_t = \{z_{t,1}, \ldots, z_{t,M}\}, \] where \(z_{t,1},\dots,z_{t,M} \overset{\text{i.i.d.}}{\sim} \mathcal{D}\). Define the gradient estimator as the average of the gradients of this batch of samples: \[ g^{(M)}(\theta_t) = \frac{1}{M} \sum_{i=1}^M \nabla_\theta \ell(z_{t,i}; \theta_t). \]

We can prove: If the single-sample gradient satisfies the aforementioned unbiasedness and bounded variance assumptions, then the Mini-batch gradient estimator \(g^{(M)}(\theta_t)\) necessarily satisfies the same unbiasedness assumption, and its variance upper bound is smaller. To simplify notation, in the following derivation, without loss of generality, we fix a certain iteration step \(t\) and parameter \(\theta_t\), and denote the single-sample gradient and the full gradient as: \[ g(z) \triangleq \nabla_\theta \ell(z;\theta_t),\qquad \nabla F \triangleq \nabla F(\theta_t). \] Thus, \[ g^{(M)}(\theta_t) = \frac{1}{M}\sum_{i=1}^M g(z_{t,i}). \]

2.1.1 Preservation of Unbiasedness

Assume the single-sample gradient is unbiased, i.e., \[ \mathbb{E}_{z\sim\mathcal{D}}[g(z)] = \nabla F. \] Since expectation is linear and the samples \(z_{t,1},\dots,z_{t,M}\) are independent and identically distributed according to \(\mathcal{D}\), we have: \[ \begin{aligned} \mathbb{E}[g^{(M)}(\theta_t)] &= \mathbb{E}\left[ \frac{1}{M} \sum_{i=1}^M g(z_{t,i}) \right] \\ &= \frac{1}{M} \sum_{i=1}^M \mathbb{E}[g(z_{t,i})] \\ &= \frac{1}{M} \cdot M \cdot \nabla F \\ &= \nabla F. \end{aligned} \]

This shows that the Mini-batch gradient inherits unbiasedness and remains a correct estimator of the true gradient: \[ \boxed{\mathbb{E}[g_t^{(M)}(\theta_t)] = \nabla F(\theta_t).} \]

2.1.2 Boundedness of Variance

Assume the noise variance of the single-sample gradient is bounded, i.e., there exists a constant \(\sigma^2\) such that \[ \mathbb{E}\bigl[\|g(z) - \nabla F(\theta_t)\|^2\bigr] \leq \sigma^2. \]

Define the gradient noise of the \(i\)-th sample as \[ X_i \triangleq g(z_{t,i}) - \nabla F(\theta_t). \] Then \(\mathbb{E}[X_i]=0\), and \(X_1,\dots,X_M\) are independent and identically distributed. In this case, \[ g_t^{(M)}(\theta_t) - \nabla F(\theta_t) = \frac1M \sum_{i=1}^M X_i. \]

We are concerned with
\[ \mathbb{E}\bigl[\|g_t^{(M)} - \nabla F\|^2\bigr] = \mathbb{E}\left\|\frac1M \sum_{i=1}^M X_i\right\|^2. \]

Note the identity \[ \left\|\sum_{i=1}^M X_i\right\|^2 = \sum_{i=1}^M \|X_i\|^2 + 2\sum_{i<j} \langle X_i,X_j\rangle. \]

Taking expectation on both sides: \[ \mathbb{E}\left\|\sum_{i=1}^M X_i\right\|^2 = \sum_{i=1}^M \mathbb{E}\|X_i\|^2 + 2\sum_{i<j} \mathbb{E}\langle X_i,X_j\rangle. \]

Since \(X_i\) and \(X_j\) are independent and \(\mathbb{E}[X_i]=0\), we have \[ \mathbb{E}\langle X_i,X_j\rangle = \mathbb{E}\bigl[X_i^\top X_j\bigr] = \mathbb{E}\bigl[X_i^\top \mathbb{E}(X_j\mid X_i)\bigr] = \mathbb{E}\bigl[X_i^\top \mathbb{E}(X_j)\bigr] = \mathbb{E}\bigl[X_i^\top 0\bigr] = 0. \] Thus, the expectations of all cross terms are \(0\), and hence \[ \mathbb{E}\left\|\sum_{i=1}^M X_i\right\|^2 = \sum_{i=1}^M \mathbb{E}\|X_i\|^2. \]

Since all \(X_i\) are identically distributed, \[ \mathbb{E}\|X_i\|^2 = \mathbb{E}\|X_1\|^2 = \mathbb{E}\|g(z)-\nabla F(\theta_t)\|^2 \le \sigma^2. \] Therefore, \[ \mathbb{E}\left\|\sum_{i=1}^M X_i\right\|^2 \le M \sigma^2. \]

Returning to the quantity we want to compute: \[ \begin{aligned} \mathbb{E}\bigl[\|g_t^{(M)}(\theta_t) - \nabla F(\theta_t)\|^2\bigr] &= \mathbb{E}\left\|\frac1M \sum_{i=1}^M X_i\right\|^2 \\ &= \frac{1}{M^2} \mathbb{E}\left\|\sum_{i=1}^M X_i\right\|^2 \\ &\le \frac{1}{M^2}\cdot M \sigma^2 \\ &= \frac{\sigma^2}{M}. \end{aligned} \]

Thus, under sampling with replacement, if the single-sample gradient noise satisfies \[ \mathbb{E}\bigl[\|g(z)- \nabla F(\theta_t)\|^2\bigr] \le \sigma^2, \] then the corresponding mini-batch gradient estimator satisfies \[ \boxed{ \mathbb{E}\bigl[\|g_t^{(M)}(\theta_t) - \nabla F(\theta_t)\|^2 \mid \theta_t\bigr] \le \frac{\sigma^2}{M}. } \]

Since \(M \ge 1\), clearly \(\frac{\sigma^2}{M} \le \sigma^2\). This shows that Mini-batch SGD not only satisfies the bounded variance assumption but also reduces the variance of the gradient estimate by a factor of \(M\). This means that for the same number of iterations, the Mini-batch algorithm can provide a more stable gradient direction, thereby allowing the use of larger step sizes \(\eta_t\) and accelerating convergence.

2.2 Sampling Without Replacement and Finite Population Correction (FPC)

The variance bound \(\frac{\sigma^2}{M}\) above is widely used in theoretical analysis, but it implicitly relies on the assumption of sampling with replacement, meaning the samples drawn each time are independent of each other.

As mentioned earlier, in practical scenarios, data is usually given as a finite dataset \(\mathcal{D} = \{z_i\}_{i=1}^N\) (but of large scale). In this case, we typically consider the empirical risk: \[ F(\theta) \approx \hat{F}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(z_i; \theta) \] In the engineering implementations of mainstream deep learning frameworks (such as PyTorch, TensorFlow), a sampling without replacement strategy is usually adopted: the dataset is shuffled (shuffle) at the beginning of each epoch and then split into several non-overlapping mini-batches.

This leads to the following observation:

If we strictly apply the formula \(\frac{\sigma^2}{M}\), when the batch size \(M\) equals the dataset size \(N\), the variance bound for the gradient estimate becomes \(\frac{\sigma^2}{N}\). Since \(\sigma^2 > 0\), this implies the variance upper bound for the full batch gradient is still greater than zero. However, in fact, the full batch gradient \[ \nabla \hat F(\theta) = \frac{1}{N}\sum_{i=1}^N \nabla \ell(z_i;\theta) \] is a deterministic vector, and its variance should be 0.

This discrepancy arises because, under sampling without replacement, the sampled points are not independent. To more accurately describe the behavior of Mini-batch SGD in practical engineering, we need to introduce the statistical concept of Finite Population Correction (FPC).

2.2.1 Mathematical Setup

Consider a finite dataset \[ \mathcal{D} = \{z_1,\dots,z_N\}. \] For fixed parameters \(\theta\), the gradients of all samples constitute a finite population: \[ \mathcal{G} = \{g_1,\dots,g_N\}, \qquad g_i \triangleq \nabla \ell(z_i;\theta). \]

Define the sample mean (i.e., the full batch gradient of the empirical risk) as \[ \nabla \hat F(\theta) = \frac{1}{N}\sum_{i=1}^N g_i, \] and define the sample variance as (here using the unbiased variance): \[ \sigma^2(\theta) \triangleq \frac{1}{N-1}\sum_{i=1}^N \|g_i - \nabla \hat F(\theta)\|^2. \]

We are now interested in: from \(\{1,\dots,N\}\), we without replacement draw a subset \(S\) of size \(M\), and define the mini-batch gradient as \[ g_t^{(M)} = \frac{1}{M}\sum_{i\in S} g_i, \] then what is the exact form of \[ \mathbb{E}\bigl[\|g_t^{(M)} - \nabla \hat F(\theta)\|^2\bigr] \]

2.2.2 Finite Population Correction in the Scalar Case

Since sample gradients often exist in vector form (\(d>1\)), to clarify the reasoning, we first examine the one-dimensional scalar case. Suppose we have fixed samples (scalars): \[ x_1,\dots,x_N, \] Define the sample mean \[ \bar{x} = \frac{1}{N}\sum_{i=1}^N x_i, \] and the sample (unbiased) variance \[ S^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \bar{x})^2. \]

From these, we draw without replacement \(M\) indices to form a subset \(S\). The sample mean is \[ \bar X = \frac{1}{M}\sum_{i\in S} x_i. \]

Introduce indicator variables \[ \xi_i \triangleq \begin{cases} 1, & i\in S,\\ 0, & i\notin S, \end{cases} \qquad i=1,\dots,N. \] Then \[ \bar X = \frac{1}{M}\sum_{i=1}^N \xi_i x_i. \]

Note that \[ \mathbb{E}[\xi_i] = \Pr(i\in S) = \frac{M}{N}. \] A simple calculation verifies unbiasedness: \[ \mathbb{E}[\bar X] = \frac{1}{M}\sum_{i=1}^N x_i\,\mathbb{E}[\xi_i] = \frac{1}{M}\cdot\frac{M}{N}\sum_{i=1}^N x_i = \bar{x}. \]

Now we compute \(\mathbb{E}[(\bar X - \bar{x})^2]\). Let \[ y_i \triangleq x_i - \bar{x},\qquad \sum_{i=1}^N y_i = 0, \] then \[ \bar X - \bar{x} = \frac{1}{M}\sum_{i=1}^N \xi_i y_i. \]

Thus, \[ (\bar X - \bar{x})^2 = \frac{1}{M^2}\left(\sum_{i=1}^N \xi_i y_i\right)^2 = \frac{1}{M^2}\left( \sum_{i=1}^N \xi_i y_i^2 + 2\sum_{1\le i<j\le N} \xi_i\xi_j y_i y_j \right). \]

Taking the expectation of the above: \[ \mathbb{E}[(\bar X - \bar{x})^2] = \frac{1}{M^2}\left( \sum_{i=1}^N \mathbb{E}[\xi_i] y_i^2 + 2\sum_{i<j} \mathbb{E}[\xi_i\xi_j] y_i y_j \right). \]

We already know \(\mathbb{E}[\xi_i]=\frac{M}{N}\). For \(i\neq j\), \(\xi_i\xi_j=1\) if and only if both \(i\) and \(j\) are selected. Note that under simple random sampling without replacement, the probability that any two specified elements are both selected is \[ \Pr(i\in S,\;j\in S) = \frac{\binom{N-2}{M-2}}{\binom{N}{M}} = \frac{M(M-1)}{N(N-1)}, \] Therefore, \[ \mathbb{E}[\xi_i\xi_j]= \mathbb{E}[\mathbb{1}_{\{i\in S,j\in S\}}] = \Pr(i\in S,\;j\in S) = \frac{M(M-1)}{N(N-1)},\qquad i\neq j. \]

Substituting, we get \[ \begin{aligned} \mathbb{E}[(\bar X - \bar{x})^2] &= \frac{1}{M^2}\left( \frac{M}{N}\sum_{i=1}^N y_i^2 + 2\cdot\frac{M(M-1)}{N(N-1)}\sum_{i<j} y_i y_j \right). \end{aligned} \]

Using the identity \[ \left(\sum_{i=1}^N y_i\right)^2 = \sum_{i=1}^N y_i^2 + 2\sum_{i<j} y_i y_j = 0, \] we obtain \[ \sum_{i<j} y_i y_j = -\frac12\sum_{i=1}^N y_i^2. \]

Substituting back: \[ \begin{aligned} \mathbb{E}[(\bar X - \bar{x})^2] &= \frac{1}{M^2}\left( \frac{M}{N}\sum_i y_i^2 + 2\cdot\frac{M(M-1)}{N(N-1)}\cdot\left(-\frac12\sum_i y_i^2\right) \right) \\ &= \frac{1}{M^2}\left[ \frac{M}{N} - \frac{M(M-1)}{N(N-1)} \right]\sum_i y_i^2 \\ &= \frac{1}{M^2}\cdot\frac{M}{N}\cdot\frac{N-M}{N-1}\sum_i y_i^2 \\ &= \frac{1}{M}\cdot\frac{N-M}{N}\cdot\frac{1}{N-1}\sum_i y_i^2. \end{aligned} \]

Note that
\[ S^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \bar{x})^2 = \frac{1}{N-1}\sum_i y_i^2, \] thus we obtain the well-known finite population correction formula: \[ \boxed{ \mathbb{E}[(\bar X - \bar{x})^2] = \frac{1}{M}\left(1 - \frac{M}{N}\right)\,S^2. } \]

Compared with the \(\frac{S^2}{M}\) under sampling with replacement, here there is an additional correction factor \[ 1 - \frac{M}{N} \in (0,1]. \] When \(M\ll N\), this factor is close to \(1\), and sampling without replacement is almost indistinguishable from sampling with replacement; when \(M\to N\), the factor approaches \(0\), and the variance also tends to \(0\), which is fully consistent with the fact that “the full-batch gradient is deterministic.”

2.2.3 Finite Population Correction for the Vector Gradient Case

Returning to the vector gradient case. Recall that we defined \[ g_i \triangleq \nabla \ell(z_i;\theta),\qquad \nabla \hat F(\theta) = \frac{1}{N}\sum_{i=1}^N g_i. \] Let \(S\) be a sample set of size \(M\) drawn without replacement. Then the mini-batch gradient is \[ g_t^{(M)} = \frac{1}{M}\sum_{i\in S} g_i. \]

We are interested in what form \[ \mathbb{E}\bigl[\|g_t^{(M)} - \nabla \hat F(\theta)\|^2\bigr] \] takes.

Note that \[ \|g_t^{(M)} - \nabla \hat F\|^2 = \sum_{k=1}^d \bigl((g_t^{(M)})_k - \nabla \hat F_k\bigr)^2, \] where the subscript \(k\) denotes the \(k\)-th coordinate of the vector. Therefore, \[ \mathbb{E}\bigl[\|g_t^{(M)} - \nabla \hat F\|^2\bigr] = \sum_{k=1}^d \mathbb{E}\bigl[((g_t^{(M)})_k - \nabla \hat F_k)^2\bigr]. \]

Without loss of generality, consider the \(k\)-th coordinate component. Let \[ x_i = (g_i)_{k},\qquad \bar{x}_k = \frac{1}{N}\sum_{i=1}^N x_i = \nabla \hat F_k, \] Applying the one-dimensional result from the previous section, we obtain \[ \mathbb{E}\bigl[((g_t^{(M)})_k - \nabla \hat F_k)^2\bigr] = \frac{1}{M}\left(1 - \frac{M}{N}\right) S_k^2, \] where \[ S_k^2 = \frac{1}{N-1}\sum_{i=1}^N (g_i^{(k)} - \nabla \hat F_k)^2. \]

Thus, \[ \begin{aligned} \mathbb{E}\bigl[\|g_t^{(M)} - \nabla \hat F\|^2\bigr] &= \sum_{k=1}^d \frac{1}{M}\left(1 - \frac{M}{N}\right) S_k^2 \\ &= \frac{1}{M}\left(1 - \frac{M}{N}\right)\sum_{k=1}^d S_k^2. \end{aligned} \]

Now, \[ \sum_{k=1}^d S_k^2 = \sum_{k=1}^d \frac{1}{N-1}\sum_{i=1}^N (g_i^{(k)} - \nabla \hat F_k)^2 = \frac{1}{N-1}\sum_{i=1}^N \sum_{k=1}^d (g_i^{(k)} - \nabla \hat F_k)^2 = \frac{1}{N-1}\sum_{i=1}^N \|g_i - \nabla \hat F\|^2. \]

Therefore, we obtain the vector version of the finite population correction formula: \[ \boxed{ \mathbb{E}\bigl[\|g_t^{(M)} - \nabla \hat F(\theta)\|^2\bigr] = \frac{1}{M}\left(1 - \frac{M}{N}\right) \cdot \frac{1}{N-1}\sum_{i=1}^N \|g_i - \nabla \hat F(\theta)\|^2. } \]

If we assume that the variance of the sample population gradients has a uniform upper bound, i.e., there exists a constant \(\sigma^2\) such that for all \(\theta\), \[ \frac{1}{N-1}\sum_{i=1}^N \|\nabla \ell(z_i;\theta) - \nabla \hat F(\theta)\|^2 \le \sigma^2, \] then the mini-batch gradient without replacement satisfies \[ \boxed{ \mathbb{E}\bigl[\|g_t^{(M)} - \nabla \hat F(\theta)\|^2\bigr] \le \frac{1}{M}\left(1 - \frac{M}{N}\right)\sigma^2. } \]

Compared with the \(\frac{\sigma^2}{M}\) under sampling with replacement, sampling without replacement has an additional finite population correction factor \[ \left(1 - \frac{M}{N}\right). \] It has three direct implications:

  • When \(M=1\), the correction factor is \(1\), reducing to the variance \(\sigma^2\) of single-sample SGD;
  • When \(M\ll N\), \(\frac{M}{N}\) is small, the correction factor is close to \(1\), and sampling without replacement is almost indistinguishable from sampling with replacement, which is a common approximation in big-data scenarios;
  • When \(M=N\), the correction factor is \(0\), so \(\mathbb{E}\|g_t^{(N)} - \nabla \hat F(\theta)\|^2 = 0\), meaning the full-batch gradient is deterministic with no randomness, which more accurately describes the behavior of Mini-batch SGD in practical engineering.

In summary, we can unify the unbiasedness and variance assumptions of mini-batch SGD in the finite sample case as follows:

  • For sampling with replacement: \[ \mathbb{E}[g_t^{(M)}(\theta_t)] = \nabla \hat F(\theta_t), \qquad \mathbb{E}\bigl[\|g_t^{(M)}(\theta_t) - \nabla \hat F(\theta_t)\|^2\bigr] \le \frac{\sigma^2}{M}; \]
  • For sampling without replacement: \[ \mathbb{E}[g_t^{(M)}(\theta_t)] = \nabla \hat F(\theta_t), \qquad \mathbb{E}\bigl[\|g_t^{(M)}(\theta_t) - \nabla \hat F(\theta_t)\|^2\bigr] \le \frac{1}{M}\left(1 - \frac{M}{N}\right)\sigma^2. \]

References

Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” The Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.