The reproducibility crisis in Machine Learning

What is the purpose of this post?

I want to argue why some of the code and ideas that are produced in current Machine Learning (ML) research are not reproducible and why this is a problem. Specifically, I want to quantify through back-of-the-envelope approximations how much value in time and financial capital is lost due to this.

To be clear, by not reproducible results I mean any of the following scenarios: a) A state-of-the-art (sota) result is claimed in the original paper but redoing the same experiment under similar circumstances does not yield the promised results; b) a method works on one or a small number of datasets but does not generalize to multiple others of the same kind without clear reasons; c) running the same code twice and getting different results due to parallelization issues.

UPDATE (June 21st, 2020): Denny Britz has written a detailed blog post that has very similar content but I think it is more precise and has better visualizations. Even though I have put a lot of effort into my post I would recommend reading his instead of mine if you want to read only one article since it is just the better post. The only thing that is exclusive to mine is the back-of-the-envelope estimation of the harm that non-reproduction does to the community which I also updated due to findings of a new paper.

Why do we have a reproducibility crisis?

Most of the people who have spent some time with ML projects either have experienced themselves or know someone who has experienced a situation in which a project was just not reproducible. You double-checked the code and looked at all of the equations and descriptions in the paper, you run a grid-search over the hyperparameters on a GPU cluster overnight, you do everything that comes to your mind but still, the end result is worse than the promised one. This situation is very frustrating because you are constantly uncertain whether the mistake lies with you or with the original paper. After a month of trying a colleague tells you that they also had trouble reproducing this project and have also wasted a month on it. Finally, you give up on it and try another approach. Half a year later, out of curiosity, you look up the paper in question and find that it has been retracted due to a mistake. A retraction is a good sign because it shows that the scientific apparatus correctly weeds out dysfunctional theories. However, many times this doesn’t happen. You have wasted your time, your colleague has wasted their time and future ML researchers will waste their time in trying to recreate a result that just does not stand up to scrutiny. In the following, I want to discuss some of the reasons why these papers exist.

People could act with malicious intent. They might want to get their five minutes of fame or some other reason drives them to straight-up lie, make up some results and try to publish them.

I think these are a small minority of the actual cases and most people who have dedicated (part of) their life to science mostly have good intentions. Therefore, I want to lay out some explanations that are true even if the researcher has genuinely good intentions.

I think the fundamental problem of current research are misaligned rewards and incentives. There are actually multiple problems that need to be disentangled a bit
1. Your reputation and your funding as a researcher are to a large part linked to your publications and your citations. If you don’t publish successfully your career or your self-worth might take a large hit.
2. The community values positive (i.e. improvement on a given metric) results very highly and negative (showing that something does not work) results nearly do not get any attention at all. Trying to publish a paper along the lines of “we have tried everything to make this promising method work but nothing did” is nearly impossible.
3. Re-implementations are not valued a lot in the community. Reducing the complexity of already existing work, cleaning up code, extending it to another framework or different dataset might give you some start on GitHub and a nice comment by your colleagues but it will certainly not be rewarded with publications or funds.
4. Even though the tide is slowly turning, currently, there is no strong norm of publishing code along with the paper. That often implies that it is hard to verify whether you actually tried everything the original authors did and it is easy to blame an undergrad or postgrad student for being inexperienced. Therefore many people who are tasked with a re-implementation do not publicly announce that they tried to re-implement something and their effort is entirely unnoticed.

These reasons taken together lead to a situation where people desperately need to publish something and they are not held sufficiently accountable for their product. This kind of environment is a breeding ground for practices that have sufficient plausible deniability to not be called cheating but are in some kind of grey zone. In the social sciences, such a practice would be p-hacking. In many journals, papers are only published if they are statistically significant. Whether something is significant is measured by a p-value. In simple terms, this value represents the probability of an effect occurring in the data assuming that this effect does not exist in reality. Often, this p-value has to be lower than 0.05 or 0.01 for a paper to be accepted. While this is clearly a good attempt to only measure effects that exist this leads to bad practices. Assume, for example, you are a researcher and you have tested 50 people in your experiment. You run the test and get a p-value of 0.07. You could either say “well, the last 6 months have been for nothing, no journal is going to accept my paper and I can move on to the next experiment” or you start bending the results “just a bit” by changing the criteria for outliers or by conducting the experiment 10 more times in the hope of getting better data. Even though this is clearly suboptimal it is not clearly cheating. There are valid reasons to change the criteria for outliers if you have made wrong assumptions about the data. 50 people is not statistically significant anyways so using more participants is a plus, right? While the ML community does not even have a measure for statistical significance and our ways of comparing results are rudimentary at best, there are still some shady practices. Setting seeds for your experiments is generally a good idea because it enhances reproducibility. However, which seed you choose is up to you. Therefore people might search intensively for “the right” seeds to improve their results. While this looks good on paper this leads to a skewed result since only the best experimental results were presented and all others were discarded silently. However, there are even more innocent reasons why a method or result might not be reproducible. In these cases, even the authors did not know that their method is only suboptimal.

Wrong assumptions about the data and their origins: In nearly all cases the datasets used to test a method have not been collected by the authors themselves but are imported from other sources. This is also reasonable because it saves immense amounts of time and improves comparability through benchmark sets. However, in this process, some information about this data is lost or not included in the metadata. Let’s say for example that a poll has been collected by a university. They called individual voters on their phones and asked some questions about their voting behavior. This information is stored with some demographics like age, zip code, background, etc. A group of ML researchers wants to build a model that predicts voting behavior based on all of these features and it runs successfully on this dataset. They publish a paper, publish the code and their work is done. After a while, a company wants to reuse that model to predict voting behavior and realizes that it doesn’t work at all on their dataset which has been collected via online surveys. A possible explanation for this might be a selection bias that was introduced through the means of polling. Most young people don’t have landlines anymore and the university only called via those and therefore they do not appear in the first study. Most old people do not use the internet and therefore do not appear in the second study. The model of the researchers might have just overfitted to the preferences of older people which might be more homogenous than those of younger voters. Therefore the model is not reproducible due to this selection bias. There are more selection biases that you might not be aware of. Many experiments in medicine are only conducted on a very unrepresentative sample of society, many experiments in cognitive science only include right-handed individuals, many experiments in psychology or other social sciences are only conducted on students. The datasets used by credit rating agencies include disproportionally few people from historically oppressed groups, the datasets that are used to train self-driving cars include mostly images from US west-coast traffic scenes and therefore respond worse to people of color, the datasets used for speech recognition mostly include US west-coast pronunciations and therefore discriminate against other accents. While it is easy to point out these selection biases it is not always easy to fix them. The first thing that one could do is to report the biases that the dataset or experiment has.
Stochasticity of ML: In some Machine Learning setups you do not really know for certain what’s going on most of the time. The dataset contains a couple of million images belonging to more than a thousand classes. Calculating the gradient of the entire dataset is neither feasible nor optimal and therefore you use some variant of a Stochastic Gradient Descent (SGD). In what order the mini-batches are picked happens randomly. Not all Neural Network architectures are completely deterministic and contain some stochastic elements such as drop-out connections. This is the second source of randomness. Most people, therefore, use a seed such that the randomness is at least reproducible every time you rerun the project. However, random seeds only work in a setting where you train on the CPU, which nobody does because it takes ages. As an alternative, a GPU is used where you can easily parallelize most of the training procedures. Unfortunately, using the GPU means that seeding does not work reliably anymore. Modern GPUs optimize their parallelization under the hood and it is impossible to control from outside. This means that even if you set all your seeds correctly and get some result the exact same setup could yield a different result for someone else. This argument is further explained in Rachel Tatman’s medium post.

There are further problems that Pete Warden outlines in their Blog Post on the reproducibility crisis.

Why is this a problem?

I want to make a back-of-the-envelope approximation of the harms that come from this problem. The assumptions I make in the following are not based on real data since I was unable to find any statistics on them. I will, therefore, use the lower bounds of my guesstimates. On four conferences (NeurIPS, ICLR, ICML and ICCV) there were in total 6743 + 2594 + 3432 + 4303 = 17072 papers submitted of which 1428 + 678 + 774 + 1075 = 3955 were accepted. Let’s assume that the above reasons apply to between 1 and 5 percent of them, i.e. roughly 40-200. Let’s additionally assume that two people attempt to reproduce a paper that is published on any of these four conferences (I am very uncertain about this number! Two is my lower-bound estimate). Let’s say an attempt to reproduce a paper consists of one month of work and we, therefore, have 80-400 months wasted per year. Assuming that these people work in academia this means between 2 and 10 PhDs per year get wasted collectively. If they work in the industry this implies 80 - 400 monthly salaries for nothing. According to LinkedIn the median salary of an ML engineer in the USA is 125.000 US dollars. Rounding a bit we get to 1 to 4 million dollars wasted per year due to these mistakes. Obviously, these are very rough estimates and some of the numbers can be very wrong. Maybe more or fewer people try to reproduce a paper or maybe more or fewer papers are actually not reproducible. However, there are effects that should be further elaborated independently of the economic ramifications.

UPDATE (June 21st, 2020): Through Twitter, I found this single author paper by Edward Raff in which he replicated 255 (!) machine learning algorithms by himself and published a detailed version of his findings. There is a lot of interesting analysis and you should read the paper but the main fact that I want to use to update my estimate is that he was able to replicate 63 percent of the results. This means that my estimates of 1-5% were, in fact, rather conservative and using Raffs estimate of 37% as a starting point we get 1480 non-reproducible papers per year and therefore around 3000 months wasted implying 3000 salaries for nothing or 74 PhDs collectively wasted. If it is wasted in the private sector, with the assumptions from above, we get around 37.5 Million US dollars loss per year.

The individual effects: If you are working on your Bachelor’s or Master’s thesis you have clear deadlines to meet. Wasting a month or two can kill your entire project and feels absolutely terrible. During your PhD you have more time but the pressure to publish exists and spending one or two months on a project that yields no results makes neither you nor your supervisor very happy. If this happens multiple times during your PhD-life this might make or break your future career because you published less than other PhDs who did not run into similar problems. In short - the experience sucks.
The most vulnerable are hit the hardest: If you have the privilege to work in a big ML lab or a university that has lots of researchers working in your field, this might not be too much of a problem. Other people talk to you about their experiences with the topic, someone else might have tried to reproduce this paper already and can tell you that it doesn’t work. To do a hyperparameter search you can use a very expensive GPU-cluster and after a night or a week you know whether it’s the architecture or your hyperparameters. This does not hold for people working in a less privileged environment. Their labs might be smaller and the people working there have on average less experience. They might train on the GPU they have in their computer and searching for hyperparameters can take a long time. For individuals in these circumstances, all harms are multiplied. What might be a wasted week at Google might be a wasted six months in a smaller university’s ML lab.

What can be done about this?

There are a lot of different approaches to how reproducibility can be improved. They mostly concern changing the protocol we use to do research and changing the norms within the community, e.g. to always publish with code. I think, however, that these approaches will be mitigatory at best if the incentives within academia stay the way they currently are. Since it will take a long time and effort to change them, I suggest a slightly different approach. The work that people do when they reproduce an architecture or paper should be recognized as such. If you are able to reproduce a paper that should be worth something. If you are able to show that it is not reproducible that should definitely be recognized as a legitimate improvement to the field of ML. In this post I will, therefore, discuss how a system to recognize this work could be established.

One last note

If you have any feedback regarding anything (i.e. layout or opinions) please tell me in a constructive manner via your preferred means of communication.