Pentagon, IC want industry to provide an ‘evaluation harness’ to standardize testing of AI systems

The Defense Innovation Unit (DIU) and the Office of the Director of National Intelligence (ODNI) launched a a program called MYSTIC DEPOT.

By Jon Harper

March 11, 2026

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

A U.S. Air Force XQ-58A Valkyrie, an autonomous, low-cost tactical unmanned air vehicle, flies over Eglin Air Force Base’s Gulf Test and Training Range. (U.S. Air Force photo by Ilka Cole)

The Defense Department and Intelligence Community are on the hunt for an “evaluation harness” to test vendors’ AI technologies for government use.

The Pentagon’s Defense Innovation Unit, headquartered in Silicon Valley, released a solicitation Wednesday for the effort, dubbed “MYSTIC DEPOT,” which will be pursued via a commercial solutions opening contracting mechanism.

The release comes as Defense Secretary Pete Hegseth and Pentagon CTO Emil Michael are pushing the department to accelerate the widespread integration of artificial intelligence capabilities for warfighting and back-office functions.

To keep pace with rapid technology developments in the fast-moving field of AI, agencies need to be able to assess new models against defined benchmarks as they are released.

“The Department of War (DoW), in partnership with the Office of the Director of National Intelligence (ODNI), seeks an evaluation harness and government-specific benchmarks that together enable rigorous, reproducible, vendor-agnostic assessment of any AI system against government-defined criteria,” officials wrote in the solicitation, using a secondary name authorized by the Trump administration to refer to the Department of Defense. “The Government intends to use this harness across multiple programs. Solutions should be designed for broad applicability rather than single-program optimization.”

For the benchmark development portion of the program, the government seeks solutions from vendors that can be applied across unclassified, secret and top secret workflows, with a methodology that addresses requirements elicitation, task decomposition, input design, scoring criteria development, baseline establishment, validation, maintenance and “gaming resistance.”

For the evaluation harness, agencies want an “integrated infrastructure of an execution environment, tooling, and methodology” for AI system assessment that’s deployable across unclassified, classified cloud, and air-gapped environments.

The envisioned architecture would include a model interface, execution engine, measurement and scoring system, output and reporting layer, and a continuous monitoring and analytics capability that automates model ingestion and evaluation, among other desired attributes.

To evaluate how AI systems and models perform in difficult conditions, officials seek a capability that simulates “operational stress and network degradation in a controlled, reproducible environment” to enable assessment of their resilience “in mission-critical denied, degraded, intermittent, or limited (DDIL) environments,” according to the solicitation.

The government also wants a solution that supports automated red-teaming, “including the execution of adversarial prompts and attack patterns.”

With an eye toward human-machine teaming, officials are keen on interfaces that enable reviews by subject matter experts to assess “human workload, usability, and mission performance across human-only, AI-only, and human-AI team scenarios.”

Responses to the solicitation are due by March 24.

Pentagon, IC want industry to provide an ‘evaluation harness’ to standardize testing of AI systems

More Like This

Army tests autonomous strike drone featuring AI-enabled targeting capabilities

Pentagon preparing for drone swarm ‘crucible’

Pentagon eyes ‘more and more bunkers’ and other tech to shield troops in Operation Epic Fury

Top Stories

Saronic asks court to halt Navy O&S contract for small maritime drones, LCS mission modules

Army taps industry for sensor tech to help assess impacts of blasts on troops

Commanders now responsible for cybersecurity training after Army cuts online course requirement to once every 5 years

Army gives some civilian employees days to accept reassignments, separations or face involuntary moves amid force-wide rebalancing effort

Keith Hardiman appointed deputy CIO for Air Force, Space Force

More Scoops

DOD wants AI-enabled coding tools for ‘tens of thousands’ of users in its developer workforce

Via genAI pilot, CDAO exposes ‘biases that could impact the military’s healthcare system’

Report highlights how secure data-sharing platforms can support the Intelligence Community’s IT roadmap

DIU wants to buy generative AI tech for Thunderforge initiative

Scale AI to set the Pentagon’s path for testing and evaluating large language models

AI-focused ‘hackathons’ to kick off early next year as White House moves to strengthen US supply chains

IC, DOD want to get better at contracting for commercial space-based data and analytic services

Latest Podcasts

How the Navy is reducing workforce friction to improve mission outcomes

How DARPA is looking to AI to fend off cyber vulnerabilities through a challenge program

How the DOD protects national security interests by monitoring climate change

Mike Madsen: Institutionalizing defense innovation — from inside and out

Tech

AI

Weapons

Cyber