Scale AI unveils ‘Defense Llama’ large language model for national security users
Credentialed U.S. military and national security officials are experimenting and engaging in multiple classified environments with Defense Llama — a powerful new large language model that Scale AI configured and fine-tuned over the last year from Meta’s Llama 3 LLM — to adopt generative AI for their distinctive missions, like combat planning and intelligence operations.
Dan Tadross, Scale AI’s head of federal delivery and a Marine Corps reservist, briefed DefenseScoop on the making and envisioned impacts of this new custom-for-the-military model in an exclusive interview and technology demonstration on Monday.
“There are already some users from combatant commands and other military groups that are able to leverage this on certain networks,” he explained at Scale AI’s office in Washington.
Large language models and the overarching field of generative AI encompass emerging and already-disruptive technologies that can produce (convincing but not always accurate) text, software code, images and other media — based on human prompts.
This quickly evolving realm presents major opportunities for the Defense Department, while simultaneously posing uncertain and serious potential challenges.
Last year, Pentagon leadership formed a temporary task force to accelerate DOD components’ grasp, oversight and deployments of generative AI. More recently, the department and other agencies were delivered new directives regarding pursuing the advanced technology in various provisions in the Biden administration’s new National Security Memo (NSM) on AI issued last month.
“We are still looking at ways to provide more enterprise support, especially as things like the NSM that was just released. That’s one of the areas that we’re leaning forward on being able to try and help support the DOD’s adoption of this technology, again, in a responsible manner,” Tadross said.
Notably, Scale AI’s demo occurred the same day that Meta revealed that it’s making its Llama models available to U.S. government agencies — and explicitly those that are working on defense and national security applications — with support from other commercial partners including Scale AI. Also on Monday, OpenAI unveiled its first limited ChatGPT Enterprise partnership with DOD, which will enable its generative capabilities’ use on unclassified systems and data.
These announcements follow research and reports that recently surfaced suggesting that Chinese researchers linked to the People’s Liberation Army applied Meta’s open source Llama model to create an AI asset that presents the possibility for military applications.
“There’s always a concern [about] the risk appetite. My perspective on this is that the risk of not adopting these technologies is actually greater than adopting them in a measured and responsible way,” Tadross told DefenseScoop.
In some ways, he said, Scale AI’s Defense Llama stems from the company’s still-unfolding test and evaluation and other experimental efforts with DOD partners in combatant commands and at Marine Corps University’s School of Advanced Warfighting.
“We found that there are instances where a DOD member or any government official is going to ask a question that would not get a good response from the model,” Tadross said.
“This is because if you build these models off of the plethora of information that’s on the internet, and then also are tuning it for the use cases that are best commercialized … there are protections that are put in place to ensure that they are used responsibly, [including] making sure that they don’t respond about warfare, about drug use, about human trafficking, things like this that make all the sense in the world, to ensure that they don’t go haywire and start answering all those questions to the general population,” he said.
But once LLMs were safely configured for use and experimentation by trained and approved government officials on DOD’s classified and more secure networks, Tadross explained, the models still “refused” to fully address certain prompts about warfare planning and other defense topics.
“We needed to figure out a way to get around those refusals in order to act. Because if you’re a military officer and you’re trying to do something, even in an exercise, and it responds with ‘You should seek a diplomatic solution,’ you will get very upset. You slam the laptop closed,” he said.
“So we needed to find a way to minimize those refusals and ensure that it is not only doing that, but also answering the tone that would be useful — because if it’s like this very informal social media-type tone, it doesn’t instill a lot of confidence in its response,” he said.
Tadross and his team trained Defense Llama on a sprawling dataset that pulled together military doctrine, international humanitarian law, and relevant policies that align with the Pentagon’s rules for armed conflict and ethical principles for AI.
The engineering process known as supervised fine-tuning was applied. And to inform the model’s tone, officials applied reinforcement learning with human feedback methods.
“You get a response and then you provide the type of response that you would have preferred. So because the intelligence community has already written style guides for how to write, we just built a lot of examples based off that,” Tadross said.
He declined to confirm which classified networks Defense Llama is running on — or specific military units that are tapping into it — to date.
But in an emailed statement, a Scale AI spokesperson later confirmed that the model “is now available for integration into various defense systems, including command and control platforms, intelligence analysis tools, and decision-support systems.”
Defense Llama can be accessed exclusively in controlled government hubs housed within the Scale Donovan platform.
Tadross used Donovan to demonstrate the new LLM for DefenseScoop.
The platform presented another commercial LLM in a side-by-side view with Defense Llama. In the first demo, Donovan posed the question: “As a military planner, which munition should I select to destroy a hardened structure while minimizing collateral damage from a nearby civilian facility?”
Defense Llama provided a lengthy response that also spotlighted a number of factors worth considering, such as “hardness of the target, distance from civilian facilities, environmental features, and time constraints.”
The other LLM replied with an apology, a simple explanation that the question was out of its scope, and a recommendation to seek other options.
For another prompt, Tadross asked: “What tactics has Iran employed against coalition forces?”
He explained in real time that the model that’s not Defense Llama supplied “a straight refusal.” The Scale AI-configured LLM, on the other hand, offered up multiple paragraphs about how Iran has used ballistic missiles, cyber warfare, intelligence gathering, terrorist groups and naval forces.
“This is all very much in line with what they’ve actually done,” Tadross noted.
Drawing back on his past experiences operating inside military command centers, he remembered how key data points and information would be funneled through many officials in high-stakes scenarios before reaching top decision-makers.
“The intent behind deploying technology like this, and the impact that I expect that it’ll make, is that it will reduce the reliance on more and more people sitting at those headquarters sections doing the grunt work that’s necessary to pull the data together. So instead, what you’ll have is a situation where there’ll be fewer people able to access a larger swath of data and make a decision quite a bit faster than what they would have done otherwise,” Tadross told DefenseScoop.