Cross-Entropy, Label Smoothing, and Focal Loss

Connections between cross-entropy loss, label smoothing, and focal loss.

5 mins
🧮 math

The cross-entropy loss is one of the most popular loss functions in modern machine learning, often used with classification problems.

One way to derive the cross-entropy loss is by thinking in terms of a true but unknown data distribution pp, and an estimated distribution qq. Using KL-divergence to compare two distributions, our learning objective is to find a distribution qq^\star such that the KL-divergence is minimized w.r.t qq as,

q=argmin KL(pq)q^\star = \mathrm{argmin}~ KL(p \parallel q)

If qq^\star perfectly models the true underlying data distribution pp, then we achieve the global minima of KL(pq)=0KL(p \parallel q^\star) = 0.

Now, let us unpack the KL-divergence term starting with the definition.

KL(pq)=Ep[logpq]=Ep[logp]Ep[logq]KL(pq)=H[p]+CE(pq)\begin{aligned} KL(p \parallel q) &= \mathbb{E}_p\left[\log{\frac{p}{q}}\right] \\ &= \mathbb{E}_p[\log{p}] - \mathbb{E}_p[\log{q}] \\ KL(p \parallel q) &= - H[p] + CE(p \parallel q) \end{aligned}

Where H[p]H[p] is the entropy of distribution pp, and CE(pq)CE(p \parallel q) is the cross-entropy loss between distributions pp and qq.

It can now be seen that, minimizing the KL-divergence is equivalent to minimizing the cross-entropy loss - the entropy term H[p]H[p] is a constant outside our control (a property of the true data-generating process), and more importantly independent of qq for optimization.

In practice, for a dataset D\mathcal{D} of input-label observations {x,y}\{x,y\}, we compute the average cross-entropy loss for a KK-way classification problem as,

LCE=1D{x,y}Dk=1Kδy=klogq(yx),\mathcal{L}_{\mathrm{CE}} = - \frac{1}{\lvert\mathcal{D}\rvert} \sum_{\{x,y\} \in \mathcal{D}} \sum_{k=1}^K \delta_{y=k} \log{q(y \mid x)},

where the outer sum is over all the observations, and the inner sum is the cross-entropy between true conditional distribution p(yx)p(y \mid x) and modeled conditional distribution q(yx)q(y\mid x). p(yx)p(y \mid x) is represented as a delta distribution which puts all its mass on the true label, i.e. k=yk = y.

Label Smoothing🔗

Label smoothing1 is a common trick used in training neural network classifiers to ensure that the network is not over-confident and better calibrated.

Instead of the delta distribution p(yx)=δy=kp(y\mid x) = \delta_{y=k} we noted earlier, the key idea of label smoothing is to use smoothed target distribution p(yx)p(y\mid x) such that with probability ϵ<1\epsilon < 1, the target is resampled at random, i.e.

p~(yx)=ϵ1K+(1ϵ)δy=k\widetilde{p}(y\mid x) = \epsilon \cdot \frac{1}{K} + (1-\epsilon)\cdot \delta_{y=k}

The implied loss function now is CE(p~q)CE(\widetilde{p} \parallel q).

CE(p~q)=Ep~[logq]=ϵCE(Uq)+(1ϵ)CE(pq)\begin{aligned} CE(\widetilde{p}\parallel q) &= - \mathbb{E}_{\widetilde{p}}\left[\log{q}\right] \\ &= \epsilon \cdot CE(U \parallel q) + (1-\epsilon) \cdot CE(p \parallel q) \end{aligned}

Therefore, with a few rearrangements, what we get is a weighted objective where the first term CE(Uq)CE(U \parallel q) nudges our model towards a uniform distribution over the labels UU and the remainder is the same old cross-entropy loss but reweighted with 1ϵ1-\epsilon.2

This objective makes sense intuitively. We want to match the true distribution pp, but we regularize it such that our classifier is smoothed out by also matching to uniform distribution. Label smoothing demonstrably leads to better generalization and calibration, although leads to worse model distillation due to loss of information at the penultimate layer by encouraging the representations of the same label to cluster tightly.3

Focal Loss🔗

Another proposal to improve calibration of neural networks is focal loss,4 originally proposed for object detection.5

Focal loss modifies the original cross-entropy loss, such that for γ1\gamma \geq 1:6

CEγ(pq)=Ep[(1q)γlogq].CE_\gamma(p \parallel q) = -\mathbb{E}_{p}\left[(1-q)^\gamma \log{q} \right].

This objective implies that as soon as qq starts modeling the original distribution pp well, we will artificially downweight the loss incurred. Again, intuitively this makes sense since the cross-entropy loss has a tendency to keep fitting until we reach the degenerate δy=k\delta_{y=k} distribution.

With a bit of algebraic massaging, we can understand the connection of focal loss to cross-entropy loss.

CEγ(pq)=Ep[(1q)γlogq]Ep[logq]+γEp[qlogq]=CE(pq)γk=1Kpkqklogqk=CE(pq)γPQCE(pq)γPQ1=CE(pq)γk=1Kqklogqk=CE(pq)+γEq[logq]=CE(pq)γH[q]\begin{aligned} CE_\gamma(p \parallel q) &= -\mathbb{E}_{p}\left[(1-q)^\gamma \log{q} \right] \\ &\geq -E_p[\log{q}] + \gamma \mathbb{E}_{p}[q\log{q}] \\ &= CE(p \parallel q) - \gamma \left\lvert \sum_{k=1}^K p_k q_k\log{q_k} \right\rvert \\ &= CE(p \parallel q) - \gamma \left\lvert P \cdot Q \right\rvert \\ &\geq CE(p \parallel q) - \gamma \lVert P \rVert_{\infty} \lVert Q \rVert_{1} \\ &= CE(p \parallel q) - \gamma \sum_{k=1}^K \left\lvert q_k\log{q_k} \right\rvert \\ &= CE(p \parallel q) + \gamma \mathbb{E}_q[\log{q}] \\ &= CE(p \parallel q) - \gamma H[q] \end{aligned}

where the second equation comes from Benoulli’s inequality, the third equation comes by definition of modulus \lvert\cdot\rvert operator (the terms inside the expectation are always non-positive). P=[p1,,pK]P = [p_1,\dots,p_K] represents the vector of probabilities from the true distribution such that the infinity norm P=1\lVert P \rVert_{\infty} = 1 since we represent it as a one-hot encoded vector, and Q=[q1logq1,,qKlogqK]Q = [q_1\log{q_1},\dots,q_K\log{q_K}] represents the vector constructed via our modeled distribution qq such that we can use Hölder’s inequality. We can then revert the modulus since each term is non-positive, such that last term is simply negative entropy of qq.

Therefore, the focal loss minimizes an upper bound of the entropy-regularized cross-entropy loss. Regularizing with the entropy of qq nudges the learned distribution to be higher entropy, leading to smoother learned distributions, which demonstrably leads to better calibration.4

Remarks🔗

It is intuitive to expect calibration to improve by learning smoother classifier distributions. Both label smoothing and focal loss bear neat connections to the original cross-entropy loss, via a reweighted objective and an entropy-regularized objective respectively. More importantly, alongside calibration, these methods often improve generalization. I wonder what other objectives lead to similar enhancements.

Footnotes🔗

  1. Christian Szegedy et al. “Rethinking the Inception Architecture for Computer Vision.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015): 2818-2826. https://ieeexplore.ieee.org/document/7780677

  2. A canonical choice of ϵ\epsilon is 0.10.1.

  3. Rafael Müller et al. “When Does Label Smoothing Help?” Neural Information Processing Systems (2019). https://arxiv.org/abs/1906.02629

  4. Mukhoti, Jishnu et al. “Calibrating Deep Neural Networks using Focal Loss.” ArXiv abs/2002.09437 (2020). https://arxiv.org/abs/2002.09437 2

  5. Lin, Tsung-Yi et al. “Focal Loss for Dense Object Detection.” 2017 IEEE International Conference on Computer Vision (ICCV) (2017): 2999-3007. https://arxiv.org/abs/1708.02002

  6. A canonical choice of γ\gamma is 33.