Advertisement

Pentagon, IC want industry to provide an ‘evaluation harness’ to standardize testing of AI systems

The Defense Innovation Unit (DIU) and the Office of the Director of National Intelligence (ODNI) launched a a program called MYSTIC DEPOT.
Listen to this article
0:00
Learn more. This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.
A U.S. Air Force XQ-58A Valkyrie, an autonomous, low-cost tactical unmanned air vehicle, flies over Eglin Air Force Base’s Gulf Test and Training Range. (U.S. Air Force photo by Ilka Cole)

The Defense Department and Intelligence Community are on the hunt for an “evaluation harness” to test vendors’ AI technologies for government use.

The Pentagon’s Defense Innovation Unit, headquartered in Silicon Valley, released a solicitation Wednesday for the effort, dubbed “MYSTIC DEPOT,” which will be pursued via a commercial solutions opening contracting mechanism.

The release comes as Defense Secretary Pete Hegseth and Pentagon CTO Emil Michael are pushing the department to accelerate the widespread integration of artificial intelligence capabilities for warfighting and back-office functions.

To keep pace with rapid technology developments in the fast-moving field of AI, agencies need to be able to assess new models against defined benchmarks as they are released.

Advertisement

“The Department of War (DoW), in partnership with the Office of the Director of National Intelligence (ODNI), seeks an evaluation harness and government-specific benchmarks that together enable rigorous, reproducible, vendor-agnostic assessment of any AI system against government-defined criteria,” officials wrote in the solicitation, using a secondary name authorized by the Trump administration to refer to the Department of Defense. “The Government intends to use this harness across multiple programs. Solutions should be designed for broad applicability rather than single-program optimization.”

For the benchmark development portion of the program, the government seeks solutions from vendors that can be applied across unclassified, secret and top secret workflows, with a methodology that addresses requirements elicitation, task decomposition, input design, scoring criteria development, baseline establishment, validation, maintenance and “gaming resistance.”

For the evaluation harness, agencies want an “integrated infrastructure of an execution environment, tooling, and methodology” for AI system assessment that’s deployable across unclassified, classified cloud, and air-gapped environments.

The envisioned architecture would include a model interface, execution engine, measurement and scoring system, output and reporting layer, and a continuous monitoring and analytics capability that automates model ingestion and evaluation, among other desired attributes.

To evaluate how AI systems and models perform in difficult conditions, officials seek a capability that simulates “operational stress and network degradation in a controlled, reproducible environment” to enable assessment of their resilience “in mission-critical denied, degraded, intermittent, or limited (DDIL) environments,” according to the solicitation.

Advertisement

The government also wants a solution that supports automated red-teaming, “including the execution of adversarial prompts and attack patterns.”

With an eye toward human-machine teaming, officials are keen on interfaces that enable reviews by subject matter experts to assess “human workload, usability, and mission performance across human-only, AI-only, and human-AI team scenarios.”

Responses to the solicitation are due by March 24.

Latest Podcasts