CrossEntropy, Label Smoothing, and Focal Loss
Connections between crossentropy loss, label smoothing, and focal loss.
Table of Contents
The crossentropy loss is one of the most popular loss functions in modern machine learning, often used with classification problems.
One way to derive the crossentropy loss is by thinking in terms of a true but unknown data distribution $p$, and an estimated distribution $q$. Using KLdivergence to compare two distributions, our learning objective is to find a distribution $q^\star$ such that the KLdivergence is minimized w.r.t $q$ as,
$q^\star = \mathrm{argmin}~ KL(p \parallel q)$If $q^\star$ perfectly models the true underlying data distribution $p$, then we achieve the global minima of $KL(p \parallel q^\star) = 0$.
Now, let us unpack the KLdivergence term starting with the definition.
$\begin{aligned} KL(p \parallel q) &= \mathbb{E}_p\left[\log{\frac{p}{q}}\right] \\ &= \mathbb{E}_p[\log{p}]  \mathbb{E}_p[\log{q}] \\ KL(p \parallel q) &=  H[p] + CE(p \parallel q) \end{aligned}$Where $H[p]$ is the entropy of distribution $p$, and $CE(p \parallel q)$ is the crossentropy loss between distributions $p$ and $q$.
It can now be seen that, minimizing the KLdivergence is equivalent to minimizing the crossentropy loss  the entropy term $H[p]$ is a constant outside our control (a property of the true datagenerating process), and more importantly independent of $q$ for optimization.
In practice, for a dataset $\mathcal{D}$ of inputlabel observations $\{x,y\}$, we compute the average crossentropy loss for a $K$way classification problem as,
$\mathcal{L}_{\mathrm{CE}} =  \frac{1}{\lvert\mathcal{D}\rvert} \sum_{\{x,y\} \in \mathcal{D}} \sum_{k=1}^K \delta_{y=k} \log{q(y \mid x)},$where the outer sum is over all the observations, and the inner sum is the crossentropy between true conditional distribution $p(y \mid x)$ and modeled conditional distribution $q(y\mid x)$. $p(y \mid x)$ is represented as a delta distribution which puts all its mass on the true label, i.e. $k = y$.
Label Smoothing
Label smoothing^{1} is a common trick used in training neural network classifiers to ensure that the network is not overconfident and better calibrated.
Instead of the delta distribution $p(y\mid x) = \delta_{y=k}$ we noted earlier, the key idea of label smoothing is to use smoothed target distribution $p(y\mid x)$ such that with probability $\epsilon < 1$, the target is resampled at random, i.e.
$\widetilde{p}(y\mid x) = \epsilon \cdot \frac{1}{K} + (1\epsilon)\cdot \delta_{y=k}$The implied loss function now is $CE(\widetilde{p} \parallel q)$.
$\begin{aligned} CE(\widetilde{p}\parallel q) &=  \mathbb{E}_{\widetilde{p}}\left[\log{q}\right] \\ &= \epsilon \cdot CE(U \parallel q) + (1\epsilon) \cdot CE(p \parallel q) \end{aligned}$Therefore, with a few rearrangements, what we get is a weighted objective where the first term $CE(U \parallel q)$ nudges our model towards a uniform distribution over the labels $U$ and the remainder is the same old crossentropy loss but reweighted with $1\epsilon$.^{2}
This objective makes sense intuitively. We want to match the true distribution $p$, but we regularize it such that our classifier is smoothed out by also matching to uniform distribution. Label smoothing demonstrably leads to better generalization and calibration, although leads to worse model distillation due to loss of information at the penultimate layer by encouraging the representations of the same label to cluster tightly.^{3}
Focal Loss
Another proposal to improve calibration of neural networks is focal loss,^{4} originally proposed for object detection.^{5}
Focal loss modifies the original crossentropy loss, such that for $\gamma \geq 1$:^{6}
$CE_\gamma(p \parallel q) = \mathbb{E}_{p}\left[(1q)^\gamma \log{q} \right].$This objective implies that as soon as $q$ starts modeling the original distribution $p$ well, we will artificially downweight the loss incurred. Again, intuitively this makes sense since the crossentropy loss has a tendency to keep fitting until we reach the degenerate $\delta_{y=k}$ distribution.
With a bit of algebraic massaging, we can understand the connection of focal loss to crossentropy loss.
$\begin{aligned} CE_\gamma(p \parallel q) &= \mathbb{E}_{p}\left[(1q)^\gamma \log{q} \right] \\ &\geq E_p[\log{q}] + \gamma \mathbb{E}_{p}[q\log{q}] \\ &= CE(p \parallel q)  \gamma \left\lvert \sum_{k=1}^K p_k q_k\log{q_k} \right\rvert \\ &= CE(p \parallel q)  \gamma \left\lvert P \cdot Q \right\rvert \\ &\geq CE(p \parallel q)  \gamma \lVert P \rVert_{\infty} \lVert Q \rVert_{1} \\ &= CE(p \parallel q)  \gamma \sum_{k=1}^K \left\lvert q_k\log{q_k} \right\rvert \\ &= CE(p \parallel q) + \gamma \mathbb{E}_q[\log{q}] \\ &= CE(p \parallel q)  \gamma H[q] \end{aligned}$where the second equation comes from Benoulliās inequality, the third equation comes by definition of modulus $\lvert\cdot\rvert$ operator (the terms inside the expectation are always nonpositive). $P = [p_1,\dots,p_K]$ represents the vector of probabilities from the true distribution such that the infinity norm $\lVert P \rVert_{\infty} = 1$ since we represent it as a onehot encoded vector, and $Q = [q_1\log{q_1},\dots,q_K\log{q_K}]$ represents the vector constructed via our modeled distribution $q$ such that we can use HĆ¶lderās inequality. We can then revert the modulus since each term is nonpositive, such that last term is simply negative entropy of $q$.
Therefore, the focal loss minimizes an upper bound of the entropyregularized crossentropy loss. Regularizing with the entropy of $q$ nudges the learned distribution to be higher entropy, leading to smoother learned distributions, which demonstrably leads to better calibration.^{4}
Remarks
It is intuitive to expect calibration to improve by learning smoother classifier distributions. Both label smoothing and focal loss bear neat connections to the original crossentropy loss, via a reweighted objective and an entropyregularized objective respectively. More importantly, alongside calibration, these methods often improve generalization. I wonder what other objectives lead to similar enhancements.
Footnotes

Christian Szegedy et al. āRethinking the Inception Architecture for Computer Vision.āĀ 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Ā (2015): 28182826. https://ieeexplore.ieee.org/document/7780677 ā©

A canonical choice of $\epsilon$ is $0.1$. ā©

Rafael MĆ¼ller et al. āWhen Does Label Smoothing Help?āĀ Neural Information Processing SystemsĀ (2019). https://arxiv.org/abs/1906.02629 ā©

Mukhoti, Jishnu et al. āCalibrating Deep Neural Networks using Focal Loss.āĀ ArXivĀ abs/2002.09437 (2020). https://arxiv.org/abs/2002.09437 ā© ā©^{2}

Lin, TsungYi et al. āFocal Loss for Dense Object Detection.āĀ 2017 IEEE International Conference on Computer Vision (ICCV)Ā (2017): 29993007. https://arxiv.org/abs/1708.02002 ā©

A canonical choice of $\gamma$ is $3$. ā©