Open-Source Tool Using AI to Check Code from Max Tegmark and colleagues

October 30, 2024
by Foundational Questions Institute, FQxI

FQxI's Max Tegmark and colleagues have created an open-source tool for checking that code is free from bugs.

From the team's new preprint:

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

In a separate AI study, Tegmark and other colleagues report a surprising geometric structure in LLM-learned concepts. On X/Twitter, Tegmark notes that they form brain-like "lobes," and "semantic crystals." Read more in their preprint on "The Geometry of Concepts".