Via genAI pilot, CDAO exposes ‘biases that could impact the military’s healthcare system’
The Pentagon’s Chief Digital and AI Office recently completed a pilot exercise with tech nonprofit Humane Intelligence that analyzed three well-known large language models in two real-world use cases aimed at improving modern military medicine, officials confirmed Thursday.
In its aftermath, the partners revealed they uncovered hundreds of possible vulnerabilities that defense personnel can account for moving forward when considering LLMs for these purposes.
“The findings revealed biases that could impact the military’s healthcare system, such as bias related to demographics,” a Defense Department spokesperson told DefenseScoop.
They wouldn’t share much more about what was exposed, but the official provided new details about the design and implementation of this CDAO-led pilot, the team’s follow-up plans and the steps they took to protect service members’ privacy while using applicable clinical records.
As the name suggests, large language models essentially process and generate language for humans. They fall into the buzzy, emerging realm of generative AI.
Broadly, that field encompasses disruptive but still-maturing technologies that can process huge volumes of data and perform increasingly “intelligent” tasks — like recognizing speech or producing human-like media and code based on human prompts. These capabilities are pushing the boundaries of what existing AI and machine learning can achieve.
Recognizing the potential for both major opportunities and yet-to-be-known threats, the CDAO has been studying genAI and coordinating approaches and resources to help DOD to deploy and experiment with it in a “responsible” manner, officials say.
After recently sunsetting the genAI-exploring Task Force Lima, the office in mid-December launched the Artificial Intelligence Rapid Capabilities Cell to accelerate the delivery of proven and new capabilities across DOD components.
The CDAO’s latest Crowdsourced AI Red-Teaming (CAIRT) Assurance Program pilot, which focused on tapping LLM chatbots with the aim of enhancing military medicine services, “is complementary to the [cell’s] efforts to hasten the adoption of generative AI within the department,” according to the spokesperson.
They further noted that the CAIRT is one example of CDAO-run programs intended “to implement new techniques for AI Assurance and bring in a wide variety of perspectives and disciplines.”
Red-teaming is a resilience methodology for applying adversarial techniques to internally test systems’ robustness. For the recent pilot, Humane Intelligence crowdsourced red-teaming for clinical note summarization and a medical advisory chatbot — marking two prospective use cases in the context of contemporary military medicine.
“Over 200 participants, including clinical providers and healthcare analysts from [the Defense Health Agency], the Uniformed Services University of the Health Sciences, and the Services, participated in the exercise, which compared three popular LLMs. The exercise uncovered over 800 findings of potential vulnerabilities and biases related to employing these capabilities in these prospective use cases,” officials wrote in a DOD release published Thursday.
When asked to disclose the names and makers of the three LLMs that were leveraged, the DOD spokesperson told DefenseScoop: “The identities of the large language models (LLMs) used in the study were masked to prevent bias and ensure data anonymity during the evaluation.”
The team carefully designed the exercise to minimize selection bias, gather meaningful data, and protect the privacy of all participants. Plans for the pilot also underwent thorough internal and external reviews to ensure its integrity before it was conducted, according to the official.
“Once announced, providers and healthcare analysts from the Military Health System (MHS) who expressed interest were invited to participate voluntarily. All participants received clear instructions to generate interactions that simulated real-world scenarios in Military Medicine, such as summarizing patient records or seeking clinical advice, ensuring the use of fictional cases rather than actual patient data,” the spokesperson said.
“Multiple measures were implemented to ensure the privacy of participants, including maintaining the anonymity of providers and healthcare analysts involved in the exercise,” they added.
The DOD announcement suggests that certain learnings in this pilot will play a major role in shaping the military’s policies and best practices for responsibly using genAI.
The exercise is set to “result in repeatable and scalable output via the development of benchmark datasets, which can be used to evaluate future vendors and tools for alignment with performance expectations,” officials wrote.
Furthermore, if — “when fielded” — these two use cases are deemed to be covered AI as defined in the recent White House national security memo governing federal agencies’ pursuits of the technology, officials noted that “they will adhere to all required risk management practices.”
Inside the Pentagon’s top AI hub, officials are now scoping out new programs and partnerships for CAIRT-related efforts that make sense within the department and other federal partners.
“CDAO is producing a playbook that will enable other DOD components to set up and run their own crowdsourced AI assurance and red teaming programs,” the spokesperson said.
DefenseScoop has reached out to Humane Intelligence for comment.