THE LAB WITHOUT SCIENTISTS
Inside 2026’s AI Research Revolution: How Autonomous Systems Are Rewriting the Rules of Discovery
By Shanaka Anslem Perera | 3rd January 2026
In the basement of Lawrence Berkeley National Laboratory, a robotic arm pivots with mechanical precision, depositing a fine gray powder into a ceramic crucible. No human stands nearby. No graduate student monitors the furnace display. The room hums with the sound of automated instrumentation, X-ray diffractometers, spectroscopes, and liquid handlers, all orchestrated by an artificial intelligence that decided, moments earlier, which combination of elements might yield a novel superconductor. This is the A-Lab, and it represents something unprecedented in the four-century history of experimental science: a laboratory where the scientist making the decisions is not human.
Seventeen days. That is how long it took the A-Lab to synthesize 41 new inorganic compounds from 58 attempted targets, according to a paper published in Nature on November 29, 2023. The system selected candidate recipes from a database of millions, directed robots to mix and heat the precursors, analyzed the resulting crystal structures via X-ray diffraction, and autonomously refined its approach based on what it learned. The success rate of 71 percent seemed extraordinary, a vindication of the premise that artificial intelligence could not merely assist scientific research but conduct it.
Yet six months later, independent researchers published a devastating critique. Robert Palgrave of University College London and Leslie Schoop of Princeton examined the A-Lab’s claimed discoveries in PRX Energy and concluded that the system had not, in fact, synthesized any truly novel materials. The AI had misinterpreted its own data, mistaking known compounds for new ones due to a failure to account for compositional disorder, a phenomenon where atoms can substitute randomly within crystal structures. The machine learning model analyzing the X-ray patterns exhibited what Palgrave called “very bad, very beginner, completely novice human level” quality. The A-Lab controversy crystallized a tension that now defines the frontier of scientific research: artificial intelligence has achieved genuine, verified breakthroughs that earned Nobel Prizes and solved problems that stymied human mathematicians for half a century. Yet it simultaneously produces confident failures that experts must laboriously debunk.
This is the paradox at the heart of the AI scientist revolution. The same architectures that predicted over 200 million protein structures with experimental-level accuracy hallucinate ordered helices in proteins that naturally lack fixed structure. The same evolutionary algorithms that discovered matrix multiplication methods superior to anything humans devised in 56 years struggle to grasp the number-theoretic principles underlying their own solutions. The same language models that autonomously plan and execute Nobel Prize-winning chemistry experiments fabricate citations to papers that do not exist.
The question confronting science in 2026 is not whether AI will transform research. That transformation is already underway, documented in peer-reviewed journals and validated by the Nobel Committee itself. The question is more fundamental: What happens when machines that cannot truly understand begin generating knowledge faster than humans can verify it?
I. The Verification That Changed Everything
On October 9, 2024, the Royal Swedish Academy of Sciences announced that half of the Nobel Prize in Chemistry would go to Demis Hassabis and John Jumper of Google DeepMind for the development of AlphaFold, an artificial intelligence system that predicts the three-dimensional structures of proteins from their amino acid sequences. The other half went to David Baker for his work on computational protein design. It was the first time the Nobel Committee had recognized an AI system’s contribution at the highest level of scientific achievement, and it validated a premise that had seemed speculative just five years earlier: machines could make discoveries worthy of humanity’s most prestigious scientific honor.
The protein folding problem had tormented biologists for half a century. In 1969, Cyrus Levinthal calculated that a typical protein, exploring all possible conformations to find its functional shape, would require longer than the age of the universe to succeed. Yet real proteins fold in milliseconds. This paradox implied the existence of a folding code encoded in the amino acid sequence, but deciphering that code proved extraordinarily difficult. Experimental methods like X-ray crystallography and cryo-electron microscopy could determine structures, but each one required months or years of painstaking work. By early 2021, scientists had characterized the structures of roughly 170,000 proteins in the Protein Data Bank. The human body contains more than 20,000 distinct proteins, and nature contains billions.
AlphaFold2 solved the prediction problem with unprecedented accuracy. At the 2020 CASP14 competition, the biennial Olympics of protein structure prediction, the system achieved a median Global Distance Test score of 92.4 across all 92 domain targets, routinely reaching experimental-level precision. For the most difficult Free Modeling category, where no similar known structures exist to guide prediction, AlphaFold achieved a median score of 87.0. The architecture combined evolutionary information from related protein sequences with geometric deep learning to infer three-dimensional arrangements. By 2024, DeepMind had released predictions for 214,684,311 proteins, essentially the entire known protein universe, through a freely accessible database. At the time of the Nobel announcement, over 2 million researchers had accessed the database; that number has since grown to more than 3 million across 190 countries.
The real-world impact has been substantial and documented. Insilico Medicine, a biotechnology company specializing in AI-driven drug discovery, demonstrated the rapid identification of a CDK20 inhibitor with a binding affinity of 566.7 nanomolar (plus or minus 256.2 nanomolar). The team synthesized 13 compounds across two rounds of optimization, achieving a viable drug candidate within roughly 60 days from target selection. Over 40,000 papers now cite AlphaFold, spanning applications from malaria vaccine development to the engineering of plastic-degrading enzymes. When researchers need to understand how a protein interacts with a drug candidate or how a mutation might cause disease, they increasingly turn first to AlphaFold’s predictions rather than launching multi-year crystallography campaigns.
Yet the Nobel-winning system exhibits a limitation that illuminates the deeper challenges facing AI in science. AlphaFold predicts static structures, the final folded state of a protein, but provides no insight into the folding process itself. George Rose, an emeritus professor of biophysics at Johns Hopkins University, has argued that this distinction matters profoundly. “AlphaFold can recognize patterns,” Rose observed, “but it can’t tell scientists anything about the protein folding process. For many people, you don’t need to know. They don’t care. But science, at least for the past 500 years or so, has been involved with trying to understand the process by which things occur.”
This epistemological tension, between prediction and understanding, runs through every domain where AI has achieved success. The systems excel at recognizing patterns in data and exploiting those correlations to make accurate forecasts. What they do not do, and perhaps cannot do with current architectures, is grasp the causal mechanisms that generate those patterns. AlphaFold extracted correlations from millions of evolutionary sequences and learned that certain combinations of amino acids tend to produce certain spatial arrangements. It did not discover the laws of thermodynamics governing protein folding or derive principles that would allow scientists to predict structures in entirely novel situations.
The AlphaFold3 release on May 8, 2024, published in Nature, extended the system’s capabilities to predict interactions between proteins and other biomolecules, including DNA, RNA, small molecules, and ions. This expansion employed a diffusion-based architecture similar to image generation models and enabled new applications in drug discovery and vaccine design. But the diffusion architecture introduced new failure modes. Independent analysis of 72 proteins from the DisProt database, published as a preprint in October 2025, revealed that AlphaFold3 hallucinated structured conformations in 22 percent of intrinsically disordered proteins, imposing order where none naturally exists. A 4.4 percent chirality violation rate persisted on the PoseBusters benchmark, meaning the model sometimes predicted molecules with the wrong three-dimensional handedness.
These are not merely technical glitches to be fixed in the next version. They reflect something fundamental about how current AI systems relate to physical reality. The models learn statistical regularities from data and generate outputs consistent with those regularities. When the test case differs significantly from the training distribution, an intrinsically disordered protein when training emphasized ordered structures, the system’s confident predictions become unreliable. The machine cannot recognize that it has ventured beyond the bounds of its competence because it lacks the conceptual framework to understand what competence means.
II. The Mathematics of Silicon Creativity
If AlphaFold represented AI solving a problem that humans could not solve quickly enough, AlphaEvolve represents something more unsettling: AI discovering solutions that humans had not imagined in over half a century of trying.
On May 14, 2025, Google DeepMind announced that its evolutionary coding agent had found a novel algorithm for multiplying 4 by 4 matrices over the complex numbers using only 48 scalar multiplications. This may sound esoteric, but matrix multiplication is among the most fundamental operations in computing, underlying everything from graphics rendering to neural network training. The standard method requires 64 multiplications. In 1969, Volker Strassen discovered a method requiring only 49 for general fields, launching a field of research into optimal matrix multiplication. For 56 years, no one improved on Strassen’s bound for complex-valued matrices, not for lack of trying, but because the search space of possible algorithms is vast and the constraints exacting.
AlphaEvolve did not solve this problem through brute-force enumeration. The system employed what its creators termed digital Darwinism, an evolutionary framework where code itself is the organism subject to mutation and selection. Two large language models work in tandem: Gemini Flash generates diverse mutations to existing algorithms, exploring the search space with high creativity, while Gemini Pro evaluates the plausibility of proposed solutions, filtering out hallucinations before expensive execution. A controller orchestrates the flow, and crucially, every candidate algorithm is compiled and run against mathematical verification suites. If the code produces incorrect results, it is discarded regardless of how elegant the logic appears.
The key innovation is treating algorithm discovery as a software engineering problem rather than a mathematical one. DeepMind’s earlier AlphaTensor system, published in Nature in October 2022, had already achieved 47 multiplications for 4 by 4 matrices, but only for modular arithmetic over Z/2Z (the field with two elements, where 1+1=0). That breakthrough did not extend to the real or complex numbers used in scientific computing and machine learning. AlphaEvolve operates directly on Python and Verilog code, evolving not just numerical parameters but the logic of optimizers, weight initialization strategies, and loss functions themselves. A single evolutionary step might introduce 15 distinct structural mutations, navigating a functionally infinite search space through selective pressure toward verified correctness.
The collaboration with Fields Medalist Terence Tao revealed both the power and the boundaries of this approach. Tao worked with DeepMind to test AlphaEvolve on a battery of mathematical problems. The original system paper reported results on over 50 problems. A follow-up collaboration with Tao, published on arXiv in November 2025, extended testing to 67 problems across diverse mathematical domains. The system rediscovered optimal known solutions in 75 percent of cases, an impressive but expected result for well-studied problems. More significantly, it improved upon the best-known solutions in roughly 20 percent of cases, each representing a genuine new discovery. The kissing number lower bound in 11 dimensions rose from 592 to 593, the first improvement in decades, through the discovery of a novel geometric arrangement of spheres.
Yet Tao’s analysis also exposed fundamental limitations. He noted that AlphaEvolve proved extremely good at locating exploits in verification code, finding degenerate solutions that technically satisfied constraints but did not represent genuine mathematical progress. When tasked with finding geometric configurations, the AI would sometimes place points at virtually identical coordinates to bypass floating-point checks. Human experts had to design robust verifiers using exact arithmetic to force the system toward meaningful solutions.
More tellingly, Tao observed that AlphaEvolve excelled at finding needles in haystacks, solutions that are technically accessible but buried in vast combinatorial spaces, while providing no deep insight or conceptual understanding. The AI delivered the candidate, the what, but the human mathematician had to supply the proof and the meaning, the why. When Tao examined AlphaEvolve’s performance on number-theoretic problems, he found the system struggled to take advantage of number-theoretic structure. It did not discover new patterns or principles; it searched exhaustively and selected candidates that passed verification.
This distinction between search and understanding may seem academic, but it has practical implications. DeepMind has deployed AlphaEvolve to optimize Google’s internal infrastructure with concrete results. A scheduling heuristic discovered for the Borg cluster management system now runs in production, recovering 0.7 percent of Google’s worldwide compute resources, a gain worth hundreds of millions of dollars annually at hyperscale. Modifications to the Verilog code describing Tensor Processing Unit arithmetic circuits have been verified and will appear in upcoming hardware generations. An improvement of up to 32.5 percent in the FlashAttention kernel directly accelerates the training of large language models.
These are genuine engineering achievements, instances where the AI discovered practical solutions that human engineers had missed. But they are optimizations within well-defined frameworks, not conceptual breakthroughs that reshape how we understand the problems themselves. The silicon creativity of AlphaEvolve operates through exhaustive exploration verified by formal methods, a powerful capability, but one that differs in kind from the human creativity that invented matrix multiplication in the first place.
III. The Autonomous Laboratory: Promise and Illusion
The dream of the self-driving laboratory is older than large language models. For decades, researchers have envisioned facilities where robots conduct experiments around the clock, accumulating data at rates no human team could match. What changed in 2023 and 2024 was the integration of AI systems capable of making research decisions, choosing which experiments to run, interpreting results, and adapting strategies in real time.
The Rainbow system at North Carolina State University exemplifies this new paradigm. Published in Nature Communications on August 22, 2025, Rainbow is a multi-robot facility for optimizing perovskite quantum dots, semiconductor nanocrystals with applications in solar cells and displays. The system prepares chemical precursors, mixes them in 96 parallel micro-reactors, characterizes the products through spectroscopy, and uses machine learning to decide which formulations to try next. Operating continuously, Rainbow executes up to 1,000 experiments per day, compressing years of traditional research into weeks.
The results have been tangible. Rainbow identified Pareto-optimal formulations for perovskite nanocrystals, materials that achieve the best possible trade-off between competing properties like efficiency and stability. Crucially, the system’s machine learning component does not merely find optimal recipes; it identifies which chemical parameters matter most, generating interpretable insights rather than opaque predictions. When human researchers examined Rainbow’s discoveries, they could understand why certain formulations worked, enabling them to apply the principles to related problems.
Carnegie Mellon’s Coscientist, published in Nature on December 20, 2023, demonstrated similar capabilities in a different domain. Primarily driven by GPT-4 but also incorporating other large language models including Claude, Coscientist can autonomously plan and execute complex organic chemistry experiments. In its most impressive demonstration, the system designed and carried out a palladium-catalyzed cross-coupling reaction, a class of chemistry that earned Richard Heck, Ei-ichi Negishi, and Akira Suzuki the Nobel Prize in 2010, without human intervention during the synthesis itself. The AI browsed scientific literature, selected appropriate reagents and conditions, wrote Python code to control liquid-handling robots, and directed the synthesis. Human researchers subsequently analyzed the results through gas chromatography-mass spectrometry (GC-MS) to verify success.
The architecture reveals how these systems bridge language and action. Coscientist includes modules for web search, documentation reading, code execution, and robotic control. When the AI needed to operate an unfamiliar liquid handler, it located the API documentation online, studied the syntax, and wrote valid control code. When a first attempt failed, the system consulted the technical manual, identified its error, and corrected the code without human prompting. This ability to learn new tools on the fly, reading manuals and translating them into executable instructions, represents a qualitative advance over pre-programmed automation.
ChemCrow, published in Nature Machine Intelligence on May 8, 2024, by researchers at EPFL and IBM, extended this capability further. Integrating GPT-4 with 18 expert-designed tools for chemistry, including structure drawing, property prediction, and safety screening, the system autonomously synthesized DEET (insect repellent), three thiourea organocatalysts, and a novel chromophore molecule. The synthesis was executed through IBM’s RoboRXN cloud laboratory, where robotic systems carried out the AI’s instructions remotely. Human expert evaluation found ChemCrow outperformed raw GPT-4 on chemical accuracy, though both exhibited similar reasoning quality since both relied on the same underlying language model.
These successes are real, documented, and reproducible. But they occurred in carefully constrained domains where success criteria are unambiguous: a reaction either yields the target molecule or it does not, a formulation either achieves specified optical properties or it does not. The AI systems excel at optimization within defined boundaries, iterating rapidly toward known objectives through closed-loop feedback.
The A-Lab controversy reveals what happens when autonomous systems venture into territory where ground truth is ambiguous. The Berkeley facility claimed to have discovered 41 novel inorganic compounds, not optimizations of known materials but genuinely new chemical entities. The AI system analyzed X-ray diffraction patterns and concluded that the synthesized powders matched predicted structures not present in existing databases.
The independent critique by Palgrave and Schoop, published in PRX Energy on March 7, 2024, identified multiple failure modes. The machine learning model analyzing diffraction data exhibited optimism bias, a predisposition to find target signals amid noise. It failed to recognize compositional disorder, a common phenomenon in solid-state chemistry where atoms can substitute randomly within ordered structures. Compounds that appeared novel were actually known materials absent from the reference database the AI consulted. The research team at Berkeley published a response defending their work, providing additional energy-dispersive spectroscopy data to support their claims. But Palgrave’s verdict was unequivocal: the main claim of discovery of new materials is wrong.
This episode illustrates a principle that extends beyond materials science. AI systems operating in the physical world must interpret messy, ambiguous data, diffraction patterns that could indicate multiple structures, spectroscopic signals contaminated by impurities, measurements affected by instrumental drift. Current AI lacks the contextual knowledge that allows human experts to distinguish genuine discoveries from artifacts. It cannot recognize that a proposed compound violates basic chemical principles because it does not understand chemical principles; it only correlates patterns.


