Understanding Cross Entropy Loss - Information Theory.
Updated: Dec 28, 2019
We have been told in most of the tutorials and articles that a loss function is used to quantify the distance between the ground truth and the predicted output. This notion is expressed intuitively by Mean Squared Error(MSE) loss.
However, this intuitive notion is not present in the cross-entropy loss. This article aims to explain that.
The thing which intrigued me the most about cross-entropy loss is the word "entropy" in it. While reading information theory, one comes across the word entropy.
Entropy - It defines the average amount of information emitted by the stochastic source of data while producing the observed random variable.
After paying enough attention, cross-entropy loss looks like relative entropy (read KL divergence).
Essentially cross entropy is solve the problem of identifying the bits(unit of information) needed to identify that a particular event has occurred.
Let's say that input to the neural network is dog. Then the ground truth of the event is p(x)=1.
The output will be some positive value of q(x), because soft-max will output probabilities.. There has to be a way in which we can compare the values of p(x) and q(x).
Note - Since we are dealing in a discrete space. We use the summation instead of integration. Hence the KL divergence looks slightly different here. Why are we dealing in discrete space? Think about that ... ;)