Cross-Entropy, Label Smoothing, and Focal Loss

Connections between cross-entropy loss, label smoothing, and focal loss.

šŸ§® math
Table of Contents

The cross-entropy loss is one of the most popular loss functions in modern machine learning, often used with classification problems.

One way to derive the cross-entropy loss is by thinking in terms of a true but unknown data distribution pp, and an estimated distribution qq. Using KL-divergence to compare two distributions, our learning objective is to find a distribution qā‹†q^\star such that the KL-divergence is minimized w.r.t qq as,

qā‹†=argminĀ KL(pāˆ„q)q^\star = \mathrm{argmin}~ KL(p \parallel q)

If qā‹†q^\star perfectly models the true underlying data distribution pp, then we achieve the global minima of KL(pāˆ„qā‹†)=0KL(p \parallel q^\star) = 0.

Now, let us unpack the KL-divergence term starting with the definition.

KL(pāˆ„q)=Ep[logā”pq]=Ep[logā”p]āˆ’Ep[logā”q]KL(pāˆ„q)=āˆ’H[p]+CE(pāˆ„q)\begin{aligned} KL(p \parallel q) &= \mathbb{E}_p\left[\log{\frac{p}{q}}\right] \\ &= \mathbb{E}_p[\log{p}] - \mathbb{E}_p[\log{q}] \\ KL(p \parallel q) &= - H[p] + CE(p \parallel q) \end{aligned}

Where H[p]H[p] is the entropy of distribution pp, and CE(pāˆ„q)CE(p \parallel q) is the cross-entropy loss between distributions pp and qq.

It can now be seen that, minimizing the KL-divergence is equivalent to minimizing the cross-entropy loss - the entropy term H[p]H[p] is a constant outside our control (a property of the true data-generating process), and more importantly independent of qq for optimization.

In practice, for a dataset D\mathcal{D} of input-label observations {x,y}\{x,y\}, we compute the average cross-entropy loss for a KK-way classification problem as,

LCE=āˆ’1āˆ£Dāˆ£āˆ‘{x,y}āˆˆDāˆ‘k=1KĪ“y=klogā”q(yāˆ£x),\mathcal{L}_{\mathrm{CE}} = - \frac{1}{\lvert\mathcal{D}\rvert} \sum_{\{x,y\} \in \mathcal{D}} \sum_{k=1}^K \delta_{y=k} \log{q(y \mid x)},

where the outer sum is over all the observations, and the inner sum is the cross-entropy between true conditional distribution p(yāˆ£x)p(y \mid x) and modeled conditional distribution q(yāˆ£x)q(y\mid x). p(yāˆ£x)p(y \mid x) is represented as a delta distribution which puts all its mass on the true label, i.e. k=yk = y.

Label Smoothing

Label smoothing1 is a common trick used in training neural network classifiers to ensure that the network is not over-confident and better calibrated.

Instead of the delta distribution p(yāˆ£x)=Ī“y=kp(y\mid x) = \delta_{y=k} we noted earlier, the key idea of label smoothing is to use smoothed target distribution p(yāˆ£x)p(y\mid x) such that with probability Ļµ<1\epsilon < 1, the target is resampled at random, i.e.

p~(yāˆ£x)=Ļµā‹…1K+(1āˆ’Ļµ)ā‹…Ī“y=k\widetilde{p}(y\mid x) = \epsilon \cdot \frac{1}{K} + (1-\epsilon)\cdot \delta_{y=k}

The implied loss function now is CE(p~āˆ„q)CE(\widetilde{p} \parallel q).

CE(p~āˆ„q)=āˆ’Ep~[logā”q]=Ļµā‹…CE(Uāˆ„q)+(1āˆ’Ļµ)ā‹…CE(pāˆ„q)\begin{aligned} CE(\widetilde{p}\parallel q) &= - \mathbb{E}_{\widetilde{p}}\left[\log{q}\right] \\ &= \epsilon \cdot CE(U \parallel q) + (1-\epsilon) \cdot CE(p \parallel q) \end{aligned}

Therefore, with a few rearrangements, what we get is a weighted objective where the first term CE(Uāˆ„q)CE(U \parallel q) nudges our model towards a uniform distribution over the labels UU and the remainder is the same old cross-entropy loss but reweighted with 1āˆ’Ļµ1-\epsilon.2

This objective makes sense intuitively. We want to match the true distribution pp, but we regularize it such that our classifier is smoothed out by also matching to uniform distribution. Label smoothing demonstrably leads to better generalization and calibration, although leads to worse model distillation due to loss of information at the penultimate layer by encouraging the representations of the same label to cluster tightly.3

Focal Loss

Another proposal to improve calibration of neural networks is focal loss,4 originally proposed for object detection.5

Focal loss modifies the original cross-entropy loss, such that for Ī³ā‰„1\gamma \geq 1:6

CEĪ³(pāˆ„q)=āˆ’Ep[(1āˆ’q)Ī³logā”q].CE_\gamma(p \parallel q) = -\mathbb{E}_{p}\left[(1-q)^\gamma \log{q} \right].

This objective implies that as soon as qq starts modeling the original distribution pp well, we will artificially downweight the loss incurred. Again, intuitively this makes sense since the cross-entropy loss has a tendency to keep fitting until we reach the degenerate Ī“y=k\delta_{y=k} distribution.

With a bit of algebraic massaging, we can understand the connection of focal loss to cross-entropy loss.

CEĪ³(pāˆ„q)=āˆ’Ep[(1āˆ’q)Ī³logā”q]ā‰„āˆ’Ep[logā”q]+Ī³Ep[qlogā”q]=CE(pāˆ„q)āˆ’Ī³āˆ£āˆ‘k=1Kpkqklogā”qkāˆ£=CE(pāˆ„q)āˆ’Ī³āˆ£Pā‹…Qāˆ£ā‰„CE(pāˆ„q)āˆ’Ī³āˆ„Pāˆ„āˆžāˆ„Qāˆ„1=CE(pāˆ„q)āˆ’Ī³āˆ‘k=1Kāˆ£qklogā”qkāˆ£=CE(pāˆ„q)+Ī³Eq[logā”q]=CE(pāˆ„q)āˆ’Ī³H[q]\begin{aligned} CE_\gamma(p \parallel q) &= -\mathbb{E}_{p}\left[(1-q)^\gamma \log{q} \right] \\ &\geq -E_p[\log{q}] + \gamma \mathbb{E}_{p}[q\log{q}] \\ &= CE(p \parallel q) - \gamma \left\lvert \sum_{k=1}^K p_k q_k\log{q_k} \right\rvert \\ &= CE(p \parallel q) - \gamma \left\lvert P \cdot Q \right\rvert \\ &\geq CE(p \parallel q) - \gamma \lVert P \rVert_{\infty} \lVert Q \rVert_{1} \\ &= CE(p \parallel q) - \gamma \sum_{k=1}^K \left\lvert q_k\log{q_k} \right\rvert \\ &= CE(p \parallel q) + \gamma \mathbb{E}_q[\log{q}] \\ &= CE(p \parallel q) - \gamma H[q] \end{aligned}

where the second equation comes from Benoulliā€™s inequality, the third equation comes by definition of modulus āˆ£ā‹…āˆ£\lvert\cdot\rvert operator (the terms inside the expectation are always non-positive). P=[p1,ā€¦,pK]P = [p_1,\dots,p_K] represents the vector of probabilities from the true distribution such that the infinity norm āˆ„Pāˆ„āˆž=1\lVert P \rVert_{\infty} = 1 since we represent it as a one-hot encoded vector, and Q=[q1logā”q1,ā€¦,qKlogā”qK]Q = [q_1\log{q_1},\dots,q_K\log{q_K}] represents the vector constructed via our modeled distribution qq such that we can use Hƶlderā€™s inequality. We can then revert the modulus since each term is non-positive, such that last term is simply negative entropy of qq.

Therefore, the focal loss minimizes an upper bound of the entropy-regularized cross-entropy loss. Regularizing with the entropy of qq nudges the learned distribution to be higher entropy, leading to smoother learned distributions, which demonstrably leads to better calibration.4


It is intuitive to expect calibration to improve by learning smoother classifier distributions. Both label smoothing and focal loss bear neat connections to the original cross-entropy loss, via a reweighted objective and an entropy-regularized objective respectively. More importantly, alongside calibration, these methods often improve generalization. I wonder what other objectives lead to similar enhancements.


  1. Christian Szegedy et al. ā€œRethinking the Inception Architecture for Computer Vision.ā€Ā 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Ā (2015): 2818-2826. ā†©

  2. A canonical choice of Ļµ\epsilon is 0.10.1. ā†©

  3. Rafael MĆ¼ller et al. ā€œWhen Does Label Smoothing Help?ā€Ā Neural Information Processing SystemsĀ (2019). ā†©

  4. Mukhoti, Jishnu et al. ā€œCalibrating Deep Neural Networks using Focal Loss.ā€Ā ArXivĀ abs/2002.09437 (2020). ā†© ā†©2

  5. Lin, Tsung-Yi et al. ā€œFocal Loss for Dense Object Detection.ā€Ā 2017 IEEE International Conference on Computer Vision (ICCV)Ā (2017): 2999-3007. ā†©

  6. A canonical choice of Ī³\gamma is 33. ā†©