Machine and human error

While studying machine learning, I've been struck by the many philosophical parallels to human learning and cognition—particularly in the ways that both humans and machines make errors. We're all familiar with the fact that our human mental models of the world, shaped by our experiences, can get things wrong. We're also familiar with some of the causes of these failures, which can give us insight into similar failures of machines.

Consider a hypothetical middle school algebra student. For concreteness, let's imagine it's the late 90's and this particular student isn't very interested in learning algebra—he's much more interested in learning how to hack into his friends' computers through AOL Instant Messenger. Anyway, he's seeing lots of stuff about lines like \(y=mx+b\), which he has a pretty good handle on. And he's also hearing stuff about quadratics like \(y=ax^2+bx+c\), which he doesn't have a good handle on yet. When faced with an exam question on quadratics that he's unsure about, he falls back to his knowledge of lines to try to answer, thinking that the two are similar. But this strategy doesn't work out well. His mental model of a quadratic is too simple—it doesn't account for the fact that a quadratic is intrinsically more complicated than a line.

Tired of hearing crap from his teacher and parents about the importance of his grades and his future, he devises a new strategy for the next exam. This time he memorizes various facts about quadratics, as well as solutions to sample problems in the book. He still doesn't really understand the underlying concepts, but he hopes to be able to regurgitate enough to do well on the exam. Unfortunately, this strategy doesn't work out well either. When faced with new, previously unseen problems on the exam, he's not able to solve them. Even some minor variations in notation throw him off. His mental model is too tightly fit to the material that he memorized, and he's not able to apply it successfully in a novel situation.

Both of these types of human learning failures occur with machine learning as well. In order to understand this, we need to understand what it means for a machine to learn something—anything—at all.

Inside a machine

What is machine learning? Broadly speaking, it's when a machine—a computer—obtains knowledge of something from data that it can apply in new situations without being given new instructions. Machine learning systems stand in contrast to more traditional rule-based systems in which predefined, static rules exist to process every input and produce every output.

Consider a hypothetical system for the admissions department at a large university which predicts an applicant's future GPA based on their submitted application materials. One way to design such a system would be to concoct a formula by hand for computing a predicted future GPA from various features available in an application. For example, we might think that an applicant's high school GPA, together with their test scores like SAT or ACT, could be combined in a weighted average to compute an accurate prediction. Maybe that's too simplistic, and we'd also need to account for other factors—like the difficulty of their high school classes (honors, AP), their other academic achievements, and so on. Maybe we'd also need special cases for applicants with non-traditional backgrounds. In any case, conceptually we'd be building a formula of the form \[y=f(x_1,x_2,\ldots,x_n)\] where \(x_1,\ldots,x_n\) represent the various features in the application (high school GPA, test scores, and so on), \(y\) represents the predicted future GPA, and \(f\) is the formula or function defining some relationship between these. This is the type of thing we could plug into a spreadsheet, provided we put all of the application features into the spreadsheet.

But this approach faces several challenges. First, how would we actually select features and weights for our formula? And how would we know that these choices are any good? If we believed in God, we might think that he has the "true" formula for future GPA tucked away in some spreadsheet of his own, which might use infinitely many other features in addition to those we could choose: \[y=F(x_1,x_2,\ldots,x_n,x_{n+1},\ldots)\] The accuracy of our predictions would depend on the extent to which our formula\(f\) resembled God's formula \(F\). But sadly, he hasn't shared his spreadsheet with us. What can we do?

Let's turn things around for a moment and address the second question first. If we managed to cobble together a formula somehow—maybe just using our gut instincts—how would we know if it's any good? Put differently, how would we know if the formula really "knows" about a relationship between application features and future GPAs? Our hypothetical middle school teacher from earlier has the answer: we could test it. One way to do this would be to use the formula to predict GPAs for admitted applicants, and then check on their actual GPAs a few years later. This takes time. Another way is to make use of data we already have.

We could gather application data for a sample of students who were admitted in the past, as well as the actual GPAs they achieved. If we represent the application features \(x_1,\ldots, x_n\) of a student in a single feature "vector" \(\vec{x}=(x_1,\ldots,x_n)\) and represent their actual GPA as \(y\), then we'd be building a data set \[(\vec{x}_1, y_1), (\vec{x}_2, y_2),\ldots,(\vec{x}_m, y_m)\] of pairs of applications and corresponding actual GPAs. Using our formula \(f\), we could compute the predicted GPAs \(f(\vec{x}_1),\ldots,f(\vec{x}_m)\) and compare them to the actual GPAs \(y_1,\ldots,y_m\) to see how far off they are. This would give us a measure of the error of our formula on the data set, and some sense of how good our formula might be.

A machine learning approach to the problem takes this as its starting point. The basic idea is to begin with a data set like the one above, and some way to measure error in prediction on that data set, then search for a function which minimizes the error. The search is systematic and carried out by machine: an algorithm is used to locate such a function within a space of possible functions. Once a function is found, it can be applied to new, previously unseen data to make predictions.

In the machine learning lingo, such data is called training data. The space of possible functions is determined by the model chosen for the problem. For example if we still thought the relationship between application features and future GPAs was likely to be expressible by a weighted average, we could choose a linear model of the form \[f(x_1,\ldots,x_n)=w_1x_1+\cdots+w_nx_n\] where the \(w\)'s are unknown feature weights. In this case the search algorithm, called a training algorithm, would search for the weights which minimize the prediction error of \(f\) on the training data. If we thought the relationship was likely to be more complicated than a linear one, we could choose a more complex type of model. In general, a training algorithm allows a model to "learn" about a relationship between input features and output values in training data. In doing so, it effectively derives rules which can be used to predict output values for new inputs. But unlike the rules in a traditional system, these rules are dynamically derived from the training data by the machine.

Making mistakes

Much can go wrong in the basic machine learning plot outlined above. In general if the training data is garbage, then the resulting model predictions will be garbage too—garbage in, garbage out! And there are many reasons why training data might be garbage. In the case of the university, maybe the data available is incomplete and doesn't have the information we want. Maybe when the registrar's office exports the data for us, they accidentally filter a bunch of records out, or corrupt a bunch of records, and this isn't immediately obvious. Any of these things would jeopardize our ability to successfully use the data. But there are also other, deeper issues to worry about.

First, even if we manage to get a pristine export which has all past application data and GPAs correctly recorded, it may not include all of the features relevant to making good future GPA predictions. Recall God's "true" function for predicting future GPAs? We saw that it may use infinitely many other features, beyond those available to us in our data. By definition it's impossible for us to make predictions using features we don't have access to. And we don't know a priori how predictive those features are, compared to the features we do have access to. The best thing we can do with the data we have is to look at it. There are mathematical techniques that can be used to analyze the data and help us determine whether the features in it are likely to be—to some degree—predictive. And if they are, then we can train a model with them. But the effects of other features will still be felt in the real world, and reflected in God's spreadsheet, even though they aren't captured in our model.

Assuming we find a set of somewhat predictive features for our training data, our model must be capable of learning from them. Earlier we considered choosing a linear model. But what if the true relationship between application features and future GPA isn't linear? For example, what if there are significant interaction effects between multiple application features which a linear model isn't able to describe? Then a linear model will be too simple. It will perform like the middle school student from earlier trying to answer questions about quadratics using facts about lines. In general our model needs to be complex enough that it can, in principle, learn about the relationships present in the training data.

Why not just use the most complex model we can get our hands on, then? This question forces us to grapple with another fundamental issue: so far we've only measured error in prediction on the training data. But our end goal isn't to use our model on the training data—it's to use our model on new, previously unseen data. The success of our entire enterprise hinges on our model's ability to learn something which generalizes beyond the training data, and that's far from guaranteed. To illustrate, let's take an extreme example. Suppose we have a relatively small number of training samples, and we choose a model so complex that it's effectively able to "memorize" the GPA in each sample. This model has zero error on the training data, which sounds like a good thing. But the "rules" that it derives conceptually look something like

If the applicant is Art Vandelay, then the future GPA is 3.87.
If the applicant is Ferris Bueller, then the future GPA is 3.0.
...

These rules are totally useless to us, because those students aren't applying to the university again. The model performs like the middle school student after memorizing sample problems from his math book. Because the model hasn't learned any general, underlying relationship between application features and GPAs, it isn't able to tell us anything useful about new applicants. This is an extreme example, but the lesson is real. A more complex model is more likely to overfit to training data, learning about relationships which are incidental to the particular samples in the data and which fail to generalize beyond them. The problem is exacerbated when there are many features in the training data—especially uninformative ones—relative to the number of samples. Therefore training a complex model until it has zero error on training data generally isn't a good strategy.

How do we mitigate this risk? A fundamental technique is to do more testing, with more data—or rather, different data. Instead of training the model on all of the samples available to us, we train it on only a large subset. After training, we then test the trained model on the rest. The latter data—unsurprisingly—is called test data, and our model's performance on this data gives us a better indication of what its performance might be in the wild on new, previously unseen data. If our model performs poorly in testing, it might be because the model is too simple. But as we've seen, it might equally be because the model is too complex. By adjusting the model complexity, retraining, and retesting, we can try to find a "sweet spot" where the model is more likely to perform well on new data.

Bias and variance

This phenomenon is so important in machine learning that it has a name: the bias-variance tradeoff. Visually, it looks something like this:

When the model is too simple, there's a high amount of error on both training and test data in the form of bias, which is the model's tendency to be systematically wrong. This is reduced as the model is made more complex, and both training and test error decrease. But if the model is made too complex, while the training error continues to decrease, the test error begins to increase again due to variance, which is the model's tendency to learn too much about the particular samples in the training data. Therefore the goal is to find the spot in the middle where the test error is minimized.

There's an illuminating way to describe this using ideas from statistics. Let's imagine for a moment that we have an infinite amount of data to draw from—the university has been around forever and charges enough tuition to afford an infinite database. We repeatedly select a finite number of records at random to train models of a fixed type. Through this process we produce an infinite sequence of models \(f_1,f_2,\ldots\) of that type, each of which has a certain amount of error. How do these models perform on average? One way to get at this is to look at the mean model \[f=\lim_{n\to\infty}\tfrac{1}{n}(f_1+\cdots+f_n)\] which is the limiting average of the individual models. The bias can be thought of as the error of the mean model—the extent to which the mean model differs from God's model when God's model is restricted to the finitely many input features used by our models. This is really telling us how our chosen type of model performs on average, given its intrinsic level of complexity. On the other hand, the variance can be thought of as the average amount by which any of our individual models \(f_i\) differs from the mean model. It tells us how much our individual models are influenced by the particular samples used to train them, on average. Of course in the real world we don't have an infinite amount of data, and we're not training an infinite number of models. But by viewing our training of a single model on a single selection of training data within the broader context of this hypothetical random experiment, we gain insight into the sources of error in our model's predictions.

Are bias and variance the only possible sources of error in our model? No. There are two other important sources. One of them we've already discussed—the possible influence of other features in the real world which are unavailable to us in our data. The other is measurement error. In the real world, data is never pristine, and there are small—and sometimes not so small—random errors in the data that we have access to. By analyzing the data we can try to identify anomalies that might indicate measurement error, but we will never fully eliminate it. Both of these types of error are lumped together under the heading of noise. Because noise is unpredictable on the basis of the features we have access to, it's not something we can control or reduce through model selection and training. In summary, we have the important informal equation \[\text{Error}=\text{Bias}+\text{Variance}+\text{Noise}\] Can something like noise help to explain little Johnny's poor performance on his middle school algebra tests? Maybe. For example if post-lunch food coma caused him to have mild brain fog during one of his exams, that certainly could have hurt his score. But it probably wasn't the proximate cause of his issues.

All of this only scratches the surface of machine learning. But even just understanding these different sources of error, and their relationship to similar sources of error in human learning and cognition, gives us greater insight into how and why machines can fail.