Log Loss Function
Alex Dyakonov, Chief Research Scientist18 minute read
The log loss function is also commonly called logistic loss or cross-entropy loss (or simply cross entropy), and is often used in classification problems. Let's figure out why it is used and what meaning it has. To fully understand this post, you need a good ML and math background, yet I would still recommend ML beginners to read it (even though the material in this post wasn’t written in a step-by-step format).
Let's start from afar…
Let's think of how the linear regression problem is solved. We want to get a linear function (i.e. weights w) that approximates the target value up to error:
We assumed that the error is normally distributed, x is the feature description of the object (it may also contain a fictitious constant feature so that the linear function has a bias term). Then we know how the responses of our function are distributed and we can write the likelihood function of the sample (i.e., the product of the densities into which the values from the training sample are substituted) and use the maximum likelihood estimation method (in which the maximum likelihood is taken to determine the values of the parameters, and more often its logarithm):
As a result, it turns out that maximizing the likelihood is equivalent to minimizing the mean square error (MSE), i.e. this error function is widely used in regression problems for a reason. In addition to being quite logical, easily differentiable by parameters and easily minimized, it is also theoretically substantiated using the maximum likelihood estimation method if the linear model matches the data accurate to normal noise.
Let's also see how the stochastic gradient (SGD) method is implemented to minimize MSE: we need to take the derivative of the error function for a specific object and write the weight correction formula in the form of a "step towards the antigradient":
We found that the weights of the linear model, when trained by the SGD method, are corrected by adding a feature vector. The coefficient with which they are added depends on the "aggressiveness of the algorithm" (the alpha parameter, which is called the learning rate) and the difference "the algorithm's answer is the correct answer." By the way, if the difference is zero (that is, the algorithm gives an exact answer for a given object), then the weights are not corrected.
Now let's finally talk about log loss. We consider the classification problem with two classes: 0 and 1. The training set can be considered as an implementation of the generalized Bernoulli scheme: for each object, a random variable is generated, which with probability p (its own for each object) takes the value 1 and with probability (1–p) - 0. Suppose that we are building our model so that it generates the correct probabilities, but then we can write the likelihood function:
After taking the logarithm of the likelihood function, we found that maximizing it is equivalent to minimizing the last recorded expression. This is what is called the "logistic error function". For a binary classification problem, in which the algorithm must output the probability of belonging to class 1, it is logical just as much as the MSE is logical in a linear regression problem with normal noise (since both error functions are derived from the maximum likelihood estimation method).
It is often much more understandable to record a log loss error on a single object like this:
Note the unpleasant property of log loss: if for an object of the 1st class we predict a zero probability of belonging to this class or, conversely, for an object of the 0th we predict a unit probability of belonging to class 1, then the error is equal to infinity! Thus, a blunder on one object immediately renders the algorithm useless. In practice, log loss is often limited to some large number (so as not to mess with infinities).
If we ask ourselves which constant algorithm is optimal for a sample of q_1 representatives of class 1 and q_0 representatives of class 0, q_1+q_0 = q, then we get the following:
The last answer is obtained by taking the derivative and equating it to zero. The described problem has to be solved, for example, when constructing decision trees (what label to assign to a leaf if representatives of different classes are included in it). The figure below shows a graph of log loss errors of the constant algorithm for a sample of four objects of class 0 and 6 objects of class 1.
Now imagine that we know that the object belongs to class 1 with probability p, let's see which answer is optimal on this object from the point of view of log_loss: the expected value of our error is as follows:
To minimize the error, we again took the derivative and equated it to zero. We learned that it is optimal for each object to give its probability of belonging to class 1! Thus, to minimize log_loss, one must be able to calculate (estimate) the probabilities of belonging to classes.
If we substitute the obtained optimal solution into the functional to be minimized, then we get the entropy:
This explains why the entropy criterion of splitting (branching) is used when constructing decision trees in classification problems (as well as random forests and trees in boosting). The fact is that the assessment of belonging to class 1 is often made using the arithmetic mean of marks in the leaf. In any case, for a particular tree, this probability will be the same for all objects in the leaf, i.e. constant. Thus, the entropy in the leaf is approximately equal to the log loss error of the constant solution. Using the entropy test we are implicitly optimizing the log loss!
To what extent can log loss vary? It is clear that the minimum value is 0, the maximum is +∞, but the effective maximum can be considered an error when using a constant algorithm (we can hardly come up with an algorithm worse than a constant in solving the problem), i.e.
It is notable that if we take a logarithm with a base of 2, then on a balanced sample it is the segment [0,1].
Connection with logistic regression
The word "logistic" in the name of the error hints at a connection with logistic regression - this is just a method for solving the problem of binary classification, which gets the probability of belonging to class 1. But so far we have been proceeding from the general assumptions that our algorithm generates this probability (the algorithm can be, for example, a random forest or boosting over trees). Let's see that there is still a close connection with logistic regression… let's see how the logistic regression (i.e. sigmoid from a linear combination) is adjusted to this error function by the SGD method.
As we can see, adjusting the weights is exactly the same as when adjusting linear regression! In fact, this indicates the relationship of different regressions: linear and logistic, or rather, the relationship of distributions: normal and Bernoulli.
In many books, another expression goes by the name of log loss function (that is, precisely "logistic loss"), which we can get by substituting the expression for the sigmoid in log loss and redesignating: we assume that the class labels are now -1 and +1, then we get the following:
It is useful to look at the graph of the function central to this view:
As you can see, this is a smoothed (differentiable everywhere) analog of the max (0,x) function, which in deep learning is commonly called ReLU (Rectified Linear Unit). If, when setting the weights, we minimize the log loss, then in this way we set up the classic logistic regression, but if we use ReLU, slightly correct the argument and add regularization, then we get the classic SVM setting:
the expression under the sum sign is usually called Hinge loss. As you can see, often seemingly completely different methods can be obtained by "slightly correcting" the optimized functions to resemble similar ones. By the way, when training RVM (Relevance vector machine), very similar functionality is also used:
Relationship with the Kullback-Leibler divergence
Kullback-Leibler divergence is often used (especially in machine learning, Bayesian approach, and information theory) to calculate the dissimilarity of two distributions. It is determined by the following formula:
P and Q are distributions (the first is usually "true", and the second is the one about which we are interested in how much it looks like true), p and q are the densities of these distributions. The KL divergence is often referred to as distance, although it is not symmetric and does not satisfy the triangle inequality. For discrete distributions, the formula is written as follows:
P_i, Q_i are the probabilities of discrete events. Let's take a look at a specific object x labeled y. If the algorithm gives the probability of belonging to the first class - a, then the assumed distribution on the events "class 0", "class 1" is (1–a, a), and the true one is (1–y, y), therefore the Kullback-Leibler discrepancy between them is as in the function below, which exactly coincides with the log loss.
Setting up on log loss
One way to "fit" the algorithm's responses to log loss is Platt calibration. The idea here is very simple. Let the algorithm generate some estimates of belonging to the 1st class - a. The method was originally developed to calibrate the responses of the support vector machines algorithm (SVM), this algorithm in its simplest implementation separates objects by a hyperplane and simply outputs a class number of 0 or 1, depending on which side of the hyperplane the object is located on. But if we have built a hyperplane, then for any object we can calculate the distance to it (with a minus sign if the object lies in the zero-class half-plane). It is these distances with the sign r that we will convert into probabilities using the following formula:
unknown parameters α, β are usually determined by the method of maximum likelihood estimation on calibration set.
Let’s illustrate the application of the method on a real problem, which the author solved recently. The figure below shows the answers (in the form of probabilities) of two algorithms: gradient boosting (lightgbm) and a random forest (random forest).
It can be seen that the quality of the forest is much lower and it is rather cautious: it underestimates the probabilities for objects of class 1 and overestimates for objects of class 0. Let us arrange all objects in increasing probability (RF), divide them into k equal parts, and for each part calculate the average of all the responses of the algorithm and average of all correct answers. The result is shown below - the points are shown exactly in these two coordinates.
It is easy to see that the points are located on a line similar to a sigmoid - you can estimate the compression-tension parameter in it, it can be seen on the image below. The optimal sigmoid is shown in pink in the figure above. If the responses are subjected to such a sigmoid deformation, then the log loss error of the random forest decreases from 0.37 to 0.33.
Note that here we have deformed the answers of the random forest (these were probability estimates - and they all lay on the segment [0,1]), but from the probability ratios image it can be seen that the sigmoid is needed for deformation. Practice shows that in 80% of situations, in order to improve the log loss error, it is necessary to deform the answers with the help of the sigmoid (as for me, this is also part of the explanation why exactly such functions are successfully used as activation functions in neural networks).
Another calibration option is isotonic regression.
Multi-class log loss
For the sake of completeness, note that log loss is generalized to the case of several classes in a natural way:
q here is the number of elements in the sample, l is the number of classes, a_ij is the answer (probability) of the algorithm on the i-th object to the question of whether it belongs to the j-th class, y_ij = 1 if the i-th object belongs to the j-th class, otherwise y_ij = 0.