AI Weekly: AI research still has a reproducibility issue

the Transform Technology Summits begins October 13 with Low-Code / No Code: Enabling Business Agility. Register now!

Many systems, such as autonomous vehicle fleets and drone swarms, can be modeled as multi-agent reinforcement learning (MARL) tasks, dealing with how multiple machines can learn to collaborate, coordinate, compete, and learn collectively. Machine learning algorithms, particularly reinforcement learning algorithms, have been shown to be well suited to MARL tasks. But it is often challenging to scale them efficiently to hundreds or even thousands of machines.

One solution is a technique called centralized training and decentralized execution (CTDE), which allows an algorithm to train using data from multiple machines but make predictions for each machine individually (for example, when a driverless car should turn left). QMIX is a popular algorithm that CTDE implements, and many research groups claim to have designed QMIX algorithms that perform well in difficult benchmarks. But a new paper states that improvements to these algorithms could only be the result of code optimizations or “tricks” rather than design innovations.

In reinforcement learning, algorithms are trained to make a sequence of decisions. AI-guided machines learn to achieve a goal through trial and error, receiving rewards or penalties for the actions they take. But “tricks” like learning rate annealing, which has an algorithm that trains quickly first before slowing down the process, can produce deceptively competitive performance results in benchmark tests.

In the experiments, the co-authors tested QMIX’s proposed variations on the Starcraft Multi-Agent Challenge (SMAC), which focuses on micromanagement challenges in Activision Blizzard’s StarCraft II real-time strategy game. They found that the QMIX algorithms of the teams at the University of Virginia, the University of Oxford, and the University of Tsinghua managed to solve all the SMAC scenarios using a list of common tricks, but when the QMIX variants were normalized, their performance was significantly worse.

A variant of QMIX, LICA, trained with substantially more data than QMIX, but in their research, the creators compared its performance to a “basic” QMIX model without code-level optimizations. The researchers behind another variant, PLEX, used the test results of SMAC version 2.4.10 to compare the results of QMIX in version 2.4.6, which is known to be more difficult than 2.4.10.

“[S]Some of the things mentioned are endemic to machine learning, such as selective selection of results or inconsistent comparisons with other systems. It’s not exactly ‘cheating’ (or at least, sometimes it isn’t) as much as it’s lazy science that should be picked up by someone reviewing. Unfortunately, peer review is a pretty lax process, ”an artificial intelligence researcher at Queen Mary University of London told VentureBeat by email.

On a reddit thread Discussing the study, one user argues that the results point to the need for ablation studies, which remove the components of an AI system one by one to audit their performance. The problem is that large-scale ablations can be expensive in the reinforcement learning domain, the user notes, because they require a lot of computing power.

More generally, the findings underscore the reproducibility issue in AI research. Studies often provide comparative results rather than source code, which becomes problematic when the thoroughness of comparative parameters is questioned. A recent report found that between 60% and 70% of the responses given by natural language processing models were embedded somewhere in the reference training sets, indicating that the models often simply memorized the responses. . Another study, a meta-analysis of more than 3,000 AI articles, found that the metrics used to compare AI and machine learning models tended to be inconsistent, unevenly tracked, and not particularly informative.

“In some ways, the general state of reproduction, validation and revision in computing is quite appalling. And I suppose that larger problem is quite serious given that this field is now affecting people’s lives quite significantly, ”Cook continued.

Reproducibility challenges

In a 2018 blog post, Google engineer Pete Warden talked about some of the top reproducibility issues facing data scientists. He referenced the iterative nature of current approaches to machine learning and the fact that researchers cannot easily record their steps at each iteration. Slight changes in factors such as training or validation data sets can affect performance, he noted, making it difficult to uncover the root cause of differences between expected and observed results.

“Yes [researchers] you can’t get the same precision that the original authors got, how can you tell if your new approach is an improvement? It is also clearly a concern to rely on models in production systems if you don’t have a way to rebuild them to cope with changed requirements or platforms, ”Warden wrote. “It is also stifling for research experimentation; Since making changes to code or training data can be difficult to reverse, it is much more risky to try different variations, just as coding without source control increases the cost of experimenting with the changes. “

Data scientists like Warden say AI research needs to be presented in a way that third parties can step in, train the novel models, and get the same results with a margin of error. In a recent letter published in the magazine Nature – a response to an algorithm detailed by Google in 2020: co-authors set a number of reproducibility expectations, including descriptions of model development, data processing, and training channels; open source code and training data sets, or at least prediction models and labels; and a disclosure of the variables used to augment the training data set, if applicable. The lack of inclusion of these “undermines [the] scientific value ”of the research, they say.

“Researchers have more incentive to publish their findings rather than spending time and resources ensuring their study can be replicated … Scientific progress depends on the ability of researchers to analyze the results of a study and reproduce the main finding to learn from, “the letter reads. . “Ensuring that [new] methods fulfill their potential … requires that [the] the studies are reproducible. “

For AI coverage, send news tips to Kyle Wiggers, and be sure to sign up for the AI ​​Weekly newsletter and bookmark our AI channel, The Machine.

Thank you for reading,

Kyle wiggers

AI Writer


VentureBeat’s mission is to be a digital urban plaza for technical decision makers to gain insight into transformative technology and transact. Our site offers essential information on data technologies and strategies to guide you as you run your organizations. We invite you to become a member of our community, to access:

  • updated information on the topics of your interest
  • our newsletters
  • Exclusive content from thought leaders and discounted access to our treasured events, such as Transform 2021: Learn more
  • network features and more

Become a member



Please enter your comment!
Please enter your name here