Leaky faucets and leaky data

As I write this post, my bathtub faucet is dripping ever so slightly in the distance. The plumbing is very old and squeaky, and the valves don't work very well. When you want to change the water temperature, you have to carefully turn both the hot and cold fixtures in tandem to avoid a trip to the burn center—turning just one causes the opposite of the desired effect with 50% probability. When you want to turn the water off, you have to emotionally accept that it will never be fully off, just minimized. If you try to tighten the fixtures too much, you'll find yourself in a paradoxical world—like Alice through the looking glass—with more water, not less.

All of this has me thinking about machine learning again. In a previous post, I described how testing is a fundamental strategy in machine learning to help control generalization error of a model. Running a trained model against test data gives us some indication of what the model's performance might be in the wild on new, previously unseen data—which is what we're interested in. But testing isn't a foolproof strategy, and there are many reasons why a model's test performance might not be reflective of its real-world performance. One particularly interesting reason is data leakage. Broadly speaking, data leakage occurs whenever data that either will not or should not be available to a model at the time of prediction inadvertently makes its way into training data. This can occur in multiple ways.

Suppose for example that it's the beginning of the COVID-19 pandemic and you're trying to train a model to predict COVID infection from an X-ray image of a patient's lungs. The training and test data that you have available consists of chest X-rays from pediatric patients without COVID, as well as older adult patients with COVID. These X-ray images have other anatomical features visible like shoulders, ribs, etc. as well as the lungs. After training your model, you find that it performs well on the test data, but several months later you find that it doesn't perform as well in the wild on real patients. What happened?

Upon further investigation, your colleague discovers that the model performs almost equally well on the test data when the lungs are blacked out in the images! How can that be? It turns out that your model mostly learned about other anatomical differences between children and adults which were visible in the training images—the sizes and shapes of shoulder bones, etc.—and it's effectively predicting whether the patient in an image is a child or an adult based on those features. Since that happens to correlate exactly with COVID infection in the training and test data, the model performs well on that data even when it can't see the lungs. But it doesn't perform as well for a more realistic patient population, where adults don't always have COVID and children sometimes do.

There are really two underlying problems in this example. The first is that the sample population in your training and test data—where children never have COVID and adults always do—isn't representative of the real population that you're targeting. The second is that there are features in your training data other than the lungs—the shoulder bones, etc.—which your model is capable of learning about and which, by virtue of the first problem, happen to correlate exactly with COVID infection in the data set but not in the real world. As a result, your model isn't learning any general relationship between features of lungs and presence or absence of COVID, as you wanted. Instead, it's learning something incidental to the training data from other features you don't even want it to look at—features which "leaked" in. (By the way, things like this did really occur.)

Leakage can occur in other ways. Suppose you run an online website for book lovers and are trying to train a model to predict whether a user will like a given book based on their likes and dislikes of other books. You create training and test data sets from several years' worth of user likes and dislikes—including various features of the users and the books which will be available to your model at prediction time. After training, your model performs well on the test data, and performs well initially on the website. But six months later, the performance of the model starts to degrade. What went wrong?

Through digging, you discover that the model isn't working well for newer users—even those who have a large number of likes and dislikes in their accounts—although it's still working well for older users who were present in the training and test data. You wonder how this could be since you trained the model on many features of users and books which are available for both new and old users on the website.

It turns out, however, that your model didn't make good use of many of those features during training. Among the features you included for users were also things like username, first name, last name, etc., which allowed the model to discern a unique "fingerprint" of each user in the training data, and then to "memorize" which sorts of books each of those users likes—for example, the model might have learned that the user blargoner generally likes math and philosophy from certain authors, the user texasteen2008 generally likes science fiction and fantasy, and so on. When the model was run on the test data—which included the same users—it simply identified the users by their fingerprints and made reasonably accurate predictions based on what it memorized about them. It did the same thing for those users on the website. But importantly, the model never learned any general relationships between the types of books a user liked in the past and the types they are likely to like now. This is why it performs poorly for new users, for whom it doesn't have fingerprints and can't just rely on facts that it memorized about them.

In this example, the data has what's called group structure. Each like and dislike of a book belongs to a user, so likes and dislikes can be grouped together by user. The problem is that data from the same groups—for example, blargoner's group—appears in both the training and test data. Since the model is being tested on groups that it already has knowledge about, this provides an overly optimistic estimate of what its performance will be for new, previously unseen groups. The solution is to ensure that no group appears in both the training and test data—in other words, that the training and test data respect the group structure. Additionally, features like username, etc. which only serve to identify a group should be removed from the data.

What's very interesting with this example is that group leakage is only a problem because you want to use the model on new, previously unseen groups—that is, new users. If you had instead created a private website for your family members, and just needed the model to make accurate predictions for those members, then there's no problem—the model could learn what Aunt Fanny likes and use that to make good predictions for Aunt Fanny. The point is that whether or not a particular use of data constitutes leakage depends, in part, on the way in which the model will be used.

In both of the above examples, data leakage was only discovered after the models were already deployed in the wild. How might we detect signs of leakage sooner, before a model is deployed? One way it can be spotted is when a model seems to be learning from a feature that intuitively shouldn't be associated with the target in the real world, yet the model still performs well in testing. This may be a sign that the feature in question is only incidentally associated with the target in the training data and also in the test data. For example, with a simple linear model of the form \[f(x_1,\ldots,x_n)=w_1x_1+\cdots+w_nx_n\] where the features \(x_1,\ldots,x_n\) have been "standardized" so that their values fall within the same numeric range, if we discover that the trained model weight \(w_1\) is large relative to the other weights \(w_2,\ldots,w_n\) but intuitively we wouldn't expect feature \(x_1\) to be associated with the target values of \(f\) in the world, then we might have leakage. We should look closely at our data—including the processes that generated it—for a possible source of a leak.

It's even better for us to try to prevent data leakage ahead of time by thinking critically about how our model will be used and how the training and test data will either help or hinder that. It can be difficult to completely prevent, but with care it can be minimized—much like the drip of water from my bathtub faucet.