Sentim

Loss Functions: What are they and why are they important?

2020-07-04T00:00:00-04:00

Loss functions tell us how wrong our predictions are during training. We then use that information to optimize our machine learning model. But - wait, what about accuracy, precision, and recall - can’t we use those to figure out how wrong our predictions are?

While we can use metrics such as accuracy to get an idea of how wrong our predictions are and to compare various methods and models, they often fail for use during optimization because they aren’t differentiable. That is, a lot of machine learning methods use gradient based optimizers, which just means that the function they optimize has to be differentiable in order to learn. Accuracy, precision, and recall aren’t differentiable, so we can’t use them to optimize our machine learning models.

A loss function is any function used to evaluate how well our algorithm models our data. The higher the loss, the worse our model is performing. We then try to minimize that function in order to to ‘learn’ how to solve the task at hand. In supervised learning, most loss functions compare the predicted output with the label. That is, most loss functions measure how far off our output was from the actual answer.

For example, if you are trying to classify whether or not a picture has a dog in it (0 not a dog, 1 dog), your algorithm might output .6. After rounding, you see that you predicted this was a dog. However, during training we are trying to get better and better predictions, e.g. if something is a dog, then we want the algorithm to output as close to 1 as possible. A loss function that might make sense is to take the absolute value of the difference – i.e. the loss is equal to exactly how far off our prediction is. In the dog example, this loss would be .4. You then modify your model based on the size of your loss – if you have a high loss, it will change more than when you have a low loss. As a side note, it’s important to choose a good loss function – if you penalize the wrong things, then your model could not learn at all or worse, learn the wrong thing.

In general, loss functions have two properties: they are globally continuous and differentiable. This basically just means that the function you use can’t jump, it is defined at every point, it has no sharp turns and no vertical tangents.

A useful property of most loss functions is that they are symmetric, that is for a loss function, the loss(actual_output, predicted_output) = loss(predicted_output, actual_output). Most of the time this is nice to have since it makes sense that the loss of something being actually zero but predicting it is one and the loss of something being actually one but predicting it is zero should be the same.

Overall, loss functions are just functions we use to measure our performance and optimize our machine learning models.

What is a confusion matrix? How does it work? Why do we care?

2020-06-26T00:00:00-04:00

A confusion matrix is a nice way of visualizing the performance of your models. In my video last week on accuracy, precision and recall, I made a mistake while drawing my confusion matrix. Thanks to Reddit users u/MlecznyHotS, u/Alouis07, u/dafeviizohyaeraaqua, and u/wasperen for pointing it out. Let’s fix my mistake.

Last week, when drawing the confusion matrix, I mixed up all the locations of where true positives, true negatives, false positives, and false negatives go on the confusion matrix. In reality, a confusion matrix looks like this – but how do you read it? There are four boxes that correspond to the number of examples where your predicted value matches the actual value, or not, depending on the square. The top row is predicting positive and the bottom row is predicting negative. Similarly, the first column is what is actually positive, and the second column is what is actually negative. So to understand what the top right box means, it’s in the predicted positive row with actually negative column – it’s the number of examples where your model guessed positive when in reality it was negative – the number of false positives. Quickly going through the rest of the boxes – the top left and bottom right are simple - top left is where you predicted positive and it was positive I.e., top left is the true positive square, and bottom right is where you predicted negative and it was negative so that’s the true negative square. Finally, the bottom left is where you predicted negative but it was actually a positive, so that’s the false negative square.

Confusion matrices can be confusing to read sometimes for two reasons. One, a lot of tables will just label the axes as actual and predicted instead of labeling the values – while identical in function, I find that the prior takes up less space but the latter is way easier to read. The second reason that confusion matrices can be confusing is because people will sometimes flip the axes – the x axis will become predicted instead of actual and the y axis will become actual instead of predicted. Even on Wikipedia (article linked in the description), all of the examples have rows represent different actual classes but for the last non-example confusion matrix, that is, the table that defines very clearly what is a true positive, what is a false positive etcetera, that confusion matrix uses rows to represent different predicted classes instead! So you just have to be careful and watch your axis when you are reading someone’s results.

Why do we care about confusion matrices? Why do we care about the individual values of true positives, false positives, false negatives, and true negatives? Because different problems care more about certain values than others. Let’s say that you are responsible for developing a model that does drug testing. After initial development, you see you are 99% accurate! Yay!

	Actual Positive	Actual Negative
Predicted Positive	100	400
Predicted Negative	100	59400

However, if you look at the confusion matrix, you see that in reality, you have 4 times as many false positives as true positives. That means that if every person who takes your test and gets a positive result gets thrown in jail for drug use, 4 for every 5 people in jail or 80% of people who use your test are actually innocent. So you have to carefully consider your evaluation metrics and always double check the confusion matrix for weird or interesting results that your model outputs.

Intuition: What is Accuracy, Precision, and Recall in machine learning, and how do they work?

2020-06-18T00:00:00-04:00

You have a model, and now you want to judge how well it performs. How do you measure model effectiveness?

There are several metrics you could use to judge how good a classification model is, the most common of which are accuracy, precision, and recall. Accuracy measures how much of the data you labeled correctly. That is, accuracy is the ratio the number labeled correctly over the total number. If you are trying to classify just one thing (e.g. hot dog or not), accuracy can be written as (the number of true positives + the number of true negatives)/(number of true positives + number of true negatives + number of false positives + number of false negatives). True positives are examples with a positive label that you labeled as positive, e.g. you labeled a hot dog as a hot dog. Similarly, true negatives are examples with a negative label that you labeled as negative, e.g. you labeled a cat as not a hot dog. On the other hand, false positives are examples that were negative that you labeled positive, e.g. you labeled a cat as a hot dog (how could you!?) and similarly false negatives are examples that were positive that you labeled negative, e.g. you labeled a hot dog as not a hot dog. The true or false in true positive, false negative, etc., indicates whether you labeled it correctly or not, and positive or negative is what you labeled it. So accuracy is just the number of things you correctly labeled as positive and negative divided by the total number of things you labeled. In the case where you are trying to classify a lot of things instead of just one, the overall accuracy is just the number of things you correctly labeled in each category divided by the total number of things you labeled.

Precision is a measure that tells you how often that something you label as positive is actually positive. More formally using the notation from earlier, precision is the number of true positives/ (the number of true positive plus the number of false positives). On the other hand, recall is the measure that tells you the percentage of positives you label correctly. That is, recall is the number of true positives/(the number of true positives plus the number of false negatives). The difference between precision and recall is kind of subtle, so let me reiterate: precision is the number of positive examples you labeled correctly over the total number of times you labeled something positive, whereas recall is the number of positive examples you labeled correctly over the total number of things that were actually positive. You can think of precision as the proportion of times that when you predict its positive it actually turns out to be positive. Where as recall can be thought of as accuracy over just the positives – it’s the proportion of times you labeled positive correctly over the amount of times it was actually positive.

In the multi-label case, precision and recall are usually applied on a per category basis. That is, if you are trying to guess whether a picture has a cat or dog or other animals, you would get precision and recall for your cats and dogs separately. Then it’s just the binary case again – if you want the precision for cats, you take the number of times you guessed correctly that it was cat / the total number of times that you guessed anything was a cat. Similarly, if you want to get recall for cats, you take the number of times you guessed correctly it was a cat over the total number of times it was actually a cat.

Intuition: How does the Heaviside Activation Function work?

2020-06-11T00:00:00-04:00

Early on in the development of neural networks, most activation functions were created to represent the action potential firing in a neuron, because after all, neural networks were originally inspired by how the brain works.

The easiest way to represent the action potential is by having a function that is either active or not, that is, zero if the neuron isn’t active and one if it is active. That is called the Heaviside step function. The problem we have here is that gradient based methods can’t use it to learn because it’s not differentiable at 0 and the slope is 0 at all other values. We can try to fix this by modifying it so that instead of having flat lines, the lines have a small slope, like this. However, it turns out to that because this has the same slope the whole way through except for right here at the jump at x=0, this is functionally equivalent to using a linear activation function – that is, if you use this activation function for your whole neural network the output will be a linear combination of the inputs.

This is less than ideal, as not all functions are linear, so we want a function that looks kind of like the Heaviside step function, is nonlinear, and is differentiable at all points. AKA, we want something that looks like this. There are a few different functions that look like this and have these properties, but the most commonly used one would be the sigmoid function.

Terms: Nonlinearity

2020-06-04T00:00:00-04:00

When learning about neural networks for the first time, you might hear about the term nonlinearity around the time you learn about activation functions. Basically, non-linearity just means not linear.

While nonlinear does mean not linear, there are a couple of small catches that aren’t obvious right away. For example, if the output function of a network could be described by the function sqrt(2)*x^2+ pi^3sin(x) you might think that the function is doing a nonlinear transformation. However, if the inputs into the network are x^2 and sin(x), then you know that the network did a linear transformation. It’s a linear transformation since the output can be written as a linear combination of the inputs – that is, if we say y = x^2 and z = sin(x), we can immediately see that sqrt(2)*y + pi^3*z is a linear function of y and z. Note that we don’t care that the constants are the sqrt(2) and pi^3, those are just constants that don’t depend on the input – we can rewrite this function as f(y,z)= ay+bz and see more clearly that this is a linear function. You might now be confused – so if the output can be nonlinear but it isn’t a nonlinear transformation, then what is a nonlinear transformation?

Anything that can’t be written as the sum of all the inputs times some constants. That is, f(x,x2,x3…) is linear only if it can be written as f(x,x2,x3…) = a1x1 + a2x2 + a3x3… + c (some constant c).

Going back to the earlier example, that example is only linear if the inputs are x^2 and sin(x). If the input is x by itself it’s nonlinear, x^2 by itself it’s nonlinear, or sin(x) by itself it’s nonlinear, we know it’s a not a linear transformation because there is no way to take x^2 and multiply it or add some constant to it to get sin(x) or vice versa.

Terms: Activation functions

2020-06-03T00:00:00-04:00

The original idea behind the activation function is to only propagate signals that are important and ignore signals that aren’t – similar to how neurons in our brain propagate signals. This is why originally, most activation functions looked like this (e.g. a sigmoid curve) – where they are close to zero in the beginning and then all the sudden when they hit some threshold it jumps up close to one.

An activation function is the function that a neuron applies to the weighted sum of its inputs. The neuron basically creates a more general feature from the inputs. Then, with enough of these neurons, we can then construct enough features and then make more general features out of the previous layers’ neurons and so on until we have solved the problem. In image classification, this would be like having the first layer make lines from the pixels, the second layer combine the lines into different shapes, the third combine the shapes into more complicated shapes and so on until we have built up internal representations for cat, mountain, tree, etc.

In the intro I said that most activation functions looked like this in the beginning. In reality, activation functions usually only need two properties: they are differentiable and nonlinear. This means that they can have all sorts of shapes! We want the function to be differentiable so that we can use backpropagation or another gradient based method of learning, but I will talk more about that in a future video.

We don’t usually want to use a linear activation function because with a linear activation function, regardless of how many layers or neurons you use the final function will be always be a linear combination of the inputs. Remember, we assume that there is some actual function that describes how to solve our problem. If we only use linear combination, we are limiting what our function can be, whereas if we use a nonlinear activation function then we can guarantee that it is possible to approximate our actual function – assuming that we use enough layers and nodes.

Neural Network Caveats (Intuition: Artificial Neural Networks Follow-up)

2020-06-02T00:00:00-04:00

A worry that you might have is that our initial function is unable to represent the actual function. For example, if we create an artificial neural network and we try to model the sine function, how can we be sure that the network will learn a close approximation of the function and not just create a linear regression? Well, it’s been proven that as long as you have enough nodes and enough layers, and the function you are using at each of your nodes is non-linear, a neural network can make any function.

However, this doesn’t guarantee that the network is modeling the actual function. It guarantees that it is possible to model that function, but in reality, your neural network is modeling the function over your data set. For instance, if you were trying to model a modified sine function with a collection of data points between [0, 2*pi], it’s entirely possible that the function the network actually finds looks like this, where before this range it always predicts zero and after this range it always predicts one, even though we know the sine function actually looks roughly like this.

This is why it’s better to have as much data as possible: the more data, the more likely the output function is modeling the actual function you are looking for instead of some random function describing your data set.

What is the difference between a Deep Neural Network and an Artificial Neural Network?

2020-06-01T00:00:00-04:00

Technically, an artificial neural network (ANN) that has a lot of layers is a Deep Neural Network (DNN). In practice though, a deep neural network is just a normal neural network where the layers of the network are abstracted out, or a network that uses functions not typically found in an artificial neural network. For example, you could draw an artificial neural network like this. Then, you could abstract this as an input, 3 fully connected layers, and an output. You could then reasonably say that this is a deep neural network. For most real-life applications, you would want more layers or at least more nodes per layer than what I’ve drawn here, but that’s the idea. Additionally, a fully connected layer is just a normal layer of an artificial neural network, where the layer is made up of neurons, each of which take all the previous layer’s neurons as input.

A deep neural network doesn’t have to be an artificial neural network though. In a deep neural network, you can use whatever formulas and techniques you want. For instance, a deep neural network could have three fully connected blocks attached to the input, then take the hyperbolic tangent of the output of the first block and the sigmoid of the second. Then sum these two blocks together before multiplying that by the third block. In a deep neural network, you can use whatever techniques or mathematical formulas you want to shape each layer or section of your network, but in artificial neural networks you are limited to just using fully connected layers.

Intuition: Artificial Neural Networks

2020-05-30T00:00:00-04:00

Artificial Neural Networks consist of layers of nodes where every node represents a function on its input. Similarly, every connection in this network represents some coefficient of our function. When dealing with neural networks, we call these coefficients weights. Additionally, there is an activation function that takes in the sum of these weighted inputs and modifies it in some way. All together, the output for a single node is just the activation function of the sum of the weighted inputs. Every layer in our network just adds another layer of nesting to the function we are building. So when we create an artificial neural network (ANN), all we are doing is creating a complicated function on the inputs. So how does creating this function help solve our problem?

First off, we assume there is some underlying function that maps the data we have to the solution. This is pretty reasonable. For example, given any picture, there should be some function that says whether or not there is a face in that picture.

So given our neural network, how do we modify it to so that it will learn a function to map the input to the associated output? How do we train the network? To train our neural network, the neural network looks at every example in the data set, and if it produced the incorrect answer, we slightly adjust the weights of the network so that the output of the function is slightly closer to the desired output. Then, by repeating this process over and over for lots of examples, the network will slowly learn a function that produces better and better results over the data set.

One question that remains is that given our initial function, how do we choose the coefficients? Rather, when we create neural networks, how do we choose the starting weights? Well, since we already know that we will change the weights to better represent the data, it doesn’t matter what they are initially, so most of the time we just make them random.

Intuition: Perceptrons and Artificial Neural Networks

2020-05-29T00:00:00-04:00

Perceptrons are just neural networks with a single output node, so how a perceptron works and how a neuron in a neural network works are the exact same. For any specific neuron in a neural network, it takes in the weighted sum of all of its inputs and then applies a function to that sum. The function applied to a neuron’s inputs is called an activation function. In the classical formulation for a perceptron, this function is the heaviside function, but for a neural network, this function can be whatever you want.

We can expand upon the perceptron by adding additional layers to it. This is called a multi layer perceptron or an artificial neural network. For every layer in the network, we connect each neuron to every neuron in the layer before it. So for this example, the output node connect to every node in the hidden layer, and every node in the hidden layer connects to every input.

These layers of neurons that connect to every input are called fully connected layers (or FC), and if you combine all of these layers together you get an artificial neural network.