Scale AI to set the Pentagon’s path for testing and evaluating large language models

The company will create a comprehensive T&E framework for generative AI within the Defense Department.

February 20, 2024

CEO of Scale AI Alexandr Wang testifies during a House Armed Services Subcommittee on Cyber, Information Technologies and Innovation hearing about artificial intelligence on Capitol Hill July 18, 2023 in Washington, DC. (Photo by Drew Angerer/Getty Images)

The Pentagon’s Chief Digital and Artificial Intelligence Office (CDAO) tapped Scale AI to produce a trustworthy means for testing and evaluating large language models that can support — and potentially disrupt — military planning and decision-making.

According to a statement the San Francisco-based company shared exclusively with DefenseScoop, the outcomes of this new one-year contract will supply the CDAO with “a framework to deploy AI safely by measuring model performance, offering real-time feedback for warfighters, and creating specialized public sector evaluation sets to test AI models for military support applications, such as organizing the findings from after action reports.”

Large language models and the overarching field of generative AI include emerging technologies that can generate (convincing but not always accurate) text, software code, images and other media, based on prompts from humans.

This rapidly evolving realm holds a lot of promise for the Department of Defense, but also poses unknown and serious potential challenges. Last year, Pentagon leadership launched Task Force Lima within the CDAO’s Algorithmic Warfare Directorate to accelerate its components’ grasp, assessment and deployment of generative artificial intelligence.

The department has long leaned on test-and-evaluation (T&E) processes to assess and ensure its systems, platforms and technologies perform in a safe and reliable manner before they are fully fielded. But AI safety standards and policies have not yet been universally set, and the complexities and uncertainties associated with large language models make T&E even more complicated when it comes to generative AI.

Broadly, T&E enables experts to determine the baseline performance of a specific model.

For instance, to test and evaluate a computer vision algorithm that differentiates between images of dogs and cats and things that are not dogs or cats, an official might first train it with millions of different pictures of those type of animals as well as objects that aren’t dogs or cats. In doing so, the expert will also hold back a diverse subset of data that can then be presented to the algorithm down the line.

They can then assess that evaluation dataset against the test set, or “ground truth,” and ultimately determine failure rates of where the model is unable to determine if something is or is not one of the classifiers they’re trying to identify.

Experts at Scale AI will adopt a similar approach for T&E with large language models, but because they are generative in nature and the English language can be hard to evaluate, there isn’t that same level of “ground truth” for these complex systems. For example, if prompted to supply five different responses, an LLM might be generally factually accurate in all five, yet contrasting sentence structures could change the meanings of each output.

So, part of the company’s effort to develop the framework, methods and technology CDAO can use to test and evaluate large language models will involve creating “holdout datasets” — where they include DOD insiders to prompt response pairs and adjudicate them by layers of review, and ensure that each is as good of a response as would be expected from a human in the military.

The entire process will be iterative in nature.

Once datasets that are germane to the DOD for world knowledge, truthfulness, and other topics are made and refined, the experts can then evaluate existing large language models against them.

Eventually, as they have these holdout datasets, experts will be able to run evaluations and establish model cards — or short documents that supply details on the context for best for use of various machine learning models and information for measuring their performance.

Officials plan to automate in this development as much as possible, so that as new models come in, there can be some baseline understanding of how they will perform, where they will perform best, and where they will probably start to fail.

Further in the process, the ultimate intent is for models to essentially send signals to CDAO officials that engage with them, if they start to waver from the domains they have been tested against.

“This work will enable the DOD to mature its T&E policies to address generative AI by measuring and assessing quantitative data via benchmarking and assessing qualitative feedback from users. The evaluation metrics will help identify generative AI models that are ready to support military applications with accurate and relevant results using DoD terminology and knowledge bases. The rigorous T&E process aims to enhance the robustness and resilience of AI systems in classified environments, enabling the adoption of LLM technology in secure environments,” Scale AI’s statement reads.

Beyond the CDAO, the company has also partnered with Meta, Microsoft, the U.S. Army, the Defense Innovation Unit, OpenAI, General Motors, Toyota Research Institute, Nvidia, and others.

“Testing and evaluating generative AI will help the DoD understand the strengths and limitations of the technology, so it can be deployed responsibly. Scale is honored to partner with the DoD on this framework,” Alexandr Wang, Scale AI’s founder and CEO, said in the statement.

Scale AI to set the Pentagon’s path for testing and evaluating large language models

More Like This

AV and Applied Intuition team up to bring collaborative autonomy to new Mayhem 10 drone

U.S. military forms first bilateral AI task force with United Arab Emirates

Post-Advana rebrand, Accenture selected for $821M War Data Platform integration deal

Top Stories

‘Ready, but not done’: The Army will start scaling parts of its new C2 ecosystem to more units, but some of it might not move forward.

Space Force establishes $981M contract vehicle to develop testing, training infrastructure

Pentagon CIO issues department-wide directive on IT category management

Pentagon’s counter-drone task force inks $500M contract for SkyValor ‘detect and defeat’ system after border testing

Air Force could tap up to 6 vendors for second increment of CCA drone program

We know how to protect our troops from telecom attacks. We’re just not doing it.

Acting SECNAV Hung Cao taps former Navy SEAL-turned-defense exec to lead I&S

Air Force aims to have next-gen engine ready for aircraft integration by 2030

More Scoops

Pentagon’s JWCC follow-on would create cloud marketplace, expand AI and edge computing

Marine Corps mandates ‘Basic AI’ training course for all troops

A first look at CDAO’s new ‘Wingman’ work to enable custom, AI digital assistants across DOD

Army says it’s using AI to help produce doctrine, but acknowledges the technology’s flaws

Army’s CamoGPT won’t be phased out as Pentagon embraces more commercial genAI products

‘Accelerate like hell’: Hegseth moves to reshape DOD’s AI and tech hubs

New Pentagon report on China’s military notes Beijing’s progress on LLMs

Latest Podcasts

How the Navy is reducing workforce friction to improve mission outcomes

How DARPA is looking to AI to fend off cyber vulnerabilities through a challenge program

How the DOD protects national security interests by monitoring climate change

Security involves more than checking boxes; it’s about accelerating defense innovation

Tech

AI

Weapons

Cyber