Via genAI pilot, CDAO exposes ‘biases that could impact the military’s healthcare system’

The Pentagon's AI hub is now producing a playbook for other Defense Department components, which is informed by this work.

By Brandi Vincent

January 3, 2025

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

(Getty Images)

The Pentagon’s Chief Digital and AI Office recently completed a pilot exercise with tech nonprofit Humane Intelligence that analyzed three well-known large language models in two real-world use cases aimed at improving modern military medicine, officials confirmed Thursday.

In its aftermath, the partners revealed they uncovered hundreds of possible vulnerabilities that defense personnel can account for moving forward when considering LLMs for these purposes.

“The findings revealed biases that could impact the military’s healthcare system, such as bias related to demographics,” a Defense Department spokesperson told DefenseScoop.

They wouldn’t share much more about what was exposed, but the official provided new details about the design and implementation of this CDAO-led pilot, the team’s follow-up plans and the steps they took to protect service members’ privacy while using applicable clinical records.

As the name suggests, large language models essentially process and generate language for humans. They fall into the buzzy, emerging realm of generative AI.

Broadly, that field encompasses disruptive but still-maturing technologies that can process huge volumes of data and perform increasingly “intelligent” tasks — like recognizing speech or producing human-like media and code based on human prompts. These capabilities are pushing the boundaries of what existing AI and machine learning can achieve.

Recognizing the potential for both major opportunities and yet-to-be-known threats, the CDAO has been studying genAI and coordinating approaches and resources to help DOD to deploy and experiment with it in a “responsible” manner, officials say.

After recently sunsetting the genAI-exploring Task Force Lima, the office in mid-December launched the Artificial Intelligence Rapid Capabilities Cell to accelerate the delivery of proven and new capabilities across DOD components.

The CDAO’s latest Crowdsourced AI Red-Teaming (CAIRT) Assurance Program pilot, which focused on tapping LLM chatbots with the aim of enhancing military medicine services, “is complementary to the [cell’s] efforts to hasten the adoption of generative AI within the department,” according to the spokesperson.

They further noted that the CAIRT is one example of CDAO-run programs intended “to implement new techniques for AI Assurance and bring in a wide variety of perspectives and disciplines.”

Red-teaming is a resilience methodology for applying adversarial techniques to internally test systems’ robustness. For the recent pilot, Humane Intelligence crowdsourced red-teaming for clinical note summarization and a medical advisory chatbot — marking two prospective use cases in the context of contemporary military medicine.

“Over 200 participants, including clinical providers and healthcare analysts from [the Defense Health Agency], the Uniformed Services University of the Health Sciences, and the Services, participated in the exercise, which compared three popular LLMs. The exercise uncovered over 800 findings of potential vulnerabilities and biases related to employing these capabilities in these prospective use cases,” officials wrote in a DOD release published Thursday.

When asked to disclose the names and makers of the three LLMs that were leveraged, the DOD spokesperson told DefenseScoop: “The identities of the large language models (LLMs) used in the study were masked to prevent bias and ensure data anonymity during the evaluation.”

The team carefully designed the exercise to minimize selection bias, gather meaningful data, and protect the privacy of all participants. Plans for the pilot also underwent thorough internal and external reviews to ensure its integrity before it was conducted, according to the official.

“Once announced, providers and healthcare analysts from the Military Health System (MHS) who expressed interest were invited to participate voluntarily. All participants received clear instructions to generate interactions that simulated real-world scenarios in Military Medicine, such as summarizing patient records or seeking clinical advice, ensuring the use of fictional cases rather than actual patient data,” the spokesperson said.

“Multiple measures were implemented to ensure the privacy of participants, including maintaining the anonymity of providers and healthcare analysts involved in the exercise,” they added.

The DOD announcement suggests that certain learnings in this pilot will play a major role in shaping the military’s policies and best practices for responsibly using genAI.

The exercise is set to “result in repeatable and scalable output via the development of benchmark datasets, which can be used to evaluate future vendors and tools for alignment with performance expectations,” officials wrote.

Furthermore, if — “when fielded” — these two use cases are deemed to be covered AI as defined in the recent White House national security memo governing federal agencies’ pursuits of the technology, officials noted that “they will adhere to all required risk management practices.”

Inside the Pentagon’s top AI hub, officials are now scoping out new programs and partnerships for CAIRT-related efforts that make sense within the department and other federal partners.

“CDAO is producing a playbook that will enable other DOD components to set up and run their own crowdsourced AI assurance and red teaming programs,” the spokesperson said.

DefenseScoop has reached out to Humane Intelligence for comment.

Via genAI pilot, CDAO exposes ‘biases that could impact the military’s healthcare system’

More Like This

Experts worry about transparency, unforeseen risks as DOD forges ahead with new frontier AI projects

Army wants AI tech to help manage airspace operations

Army issues solicitation for ‘launched effects’ autonomous drones

Top Stories

Trump administration shrinks Defense Technical Information Center staff from 154 to 40

Pentagon approves 55,000 deferred resignations as workforce reduction pursuits continue to evolve

Gen. James Rainey: The Army’s most ambitious transformation since the Cold War

Space Force, IC prep to launch more sats that could enable future GMTI missions

Special ops forces, intel community to team up on operational challenges in ‘data dense environments’

New commission to examine how to create an independent Cyber Force

Space Force gets new senior leader for cyber and data

More Scoops

Pentagon awards mega contracts to Musk-owned company, other firms for new ‘frontier AI’ projects

DISA launching experimental cloud-based chatbot for Indo-Pacific Command

Former Space Force CTIO joins advisory board for artificial intelligence startup Seekr

‘One-two punch’: Inside NGA’s approach to exploring powerful next-gen AI

Scale AI unveils ‘Defense Llama’ large language model for national security users

Questions on DOD’s plans for generative AI swirl as Task Force Lima’s possible sunset nears

Air Force releases new tool to track development, spending on AI efforts

Latest Podcasts

How the Navy is reducing workforce friction to improve mission outcomes

How DARPA is looking to AI to fend off cyber vulnerabilities through a challenge program

How the DOD protects national security interests by monitoring climate change

AI revolutionizes military wargaming: Lessons in speed, scope, and strategic foresight

Tech

Weapons

Cyber

AI