Should Diplomats Trust AI?
A recent report from Harvard’s Belfer Center suggests that AI may soon be capable of performing most of a diplomat’s job, including policy analysis, negotiation, crisis response, writing talking points, news assessment, managing diplomatic protocol, translation, and processing visa applications. If you believe the author, this is a great opportunity for diplomacy. Other commentators even suggest going all-in on AI is a strategic imperative.
But the report also contains a startling admission: “Artificial intelligence will further intensify diplomats’ struggle for relevance and could reduce the workforce of foreign affairs ministries to a bare minimum within the next fifteen to twenty years.” This serves as a warning to diplomats: AI is coming for your job.
It was thus a surprise to me when the Foreign Service Journal – “the voice of the Foreign Service” – published an edition of its journal dedicated to AI, largely extolling its usefulness (Note: I am a member of the FSJ’s editorial board). I have great respect for the authors, and there is obviously enormous opportunity with these new tools. But the excitement around AI risks causing us to overlook the many ways it can go badly wrong. Such failures could result in catastrophic damage to American diplomacy and national security.
Nobody can say for sure how this will all turn out. Some of the advances in AI are truly extraordinary. But we need to be prudent and examine the assumptions underlying this debate. Proponents of AI are making three big bets that require scrutiny:
1. AI can help make us better and faster.
2. The technology is going to continue to get better.
3. We can make AI both superintelligent and super-safe.
Each of these might be right. But each is also a critical failure point.
Will AI make us better and faster?
Maybe! But, then again, maybe not.
When people speak about AI these days, they’re usually referring to generative AI powered by Large Language Models (LLMs). LLMs that run tools like ChatGPT are “giant statistical prediction machines that repeatedly predict the next word in a sequence,” according to IBM. But such tools offer only “the illusion of thinking.” LLMs are an impressive technology, but they also present serious reliability concerns.
There are some settings in which LLMs seem to thrive. For instance, customer service agents are found to be much more productive when supported by AI. This job works well for LLMs because there are millions of transcripts of human-led support calls used to train the model to be fast and accurate. And the domain is not too complicated; there’s usually a right answer to any support question.
But foreign policy is a hard environment for AI because the domain is complex, contextual, and dynamic. These features create problems for LLMs. A literature review finds that while “LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains,” it also identifies critical challenges, including that LLMs are weak at applying discovered patterns to novel situations, such as generating policy solutions. Another study concludes that LLMs are “bad at declining to answer questions they couldn’t answer accurately, offering incorrect or speculative answers instead.” When the ability of popular tools to provide accurate citations was evaluated, they failed, on average, 60% of the time. Grok 3 failed an eye-popping 94% of the time. If one lacks the specific expertise and workflow to detect errors, hallucinations can go unnoticed, leading to disastrous results.
The solution to these challenges, many AI proponents suggest, is to place a “human in the loop.” The human partners with AI to monitor the output and catch any mistakes. The promise of the human-in-the-loop model is that each diplomat could do more, better, and faster. Imagine an expert diplomat, assisted by AI, covering a broader portfolio because it’s so much easier to summarize wider swaths of information. Better yet, imagine a single officer replacing an entire country desk. Oh, the efficiency and coordination improvements we could achieve!
But the evidence so far is sobering. A 2024 meta-analysis found that, on average, AI-human collaboration performed “significantly worse” than either the highest performing humans or AI models. That said, the collaborations outperformed the average humans in the studies. The results depended on the task: AI-human teams were very poor at decision tasks, but very good at content creation, for example.
A 2025 randomized trial of experienced software developers found that those partnered with AI believed they were working 20% faster – but testing showed they were actually taking 19% longer to complete tasks. In other words, partnering with AI produced overconfident users who did not know they were being slowed down; they were tricked into thinking their performance improved! This effect seems to be playing out at the organizational level as well: A July 2025 report from MIT NANDA found that “despite $30–40 billion in enterprise investment into GenAI… 95% of organizations are getting zero return.”
Again, what’s so pernicious about AI is that it all feels so convincing. Even when they’re wrong, they seem right. It is because of this that I remain quite wary of AI’s use for tasks on which the user can’t personally validate the quality of the answer the machine produces. Miles Williams, in an article on AI and international affairs, says it well:
“using AI tools well is not really a skill; it’s a second-order implication of having skills. The people I see using AI effectively in their work today are people who developed expertise and skills before AI tools were readily available… they know how to double-check and refine what their AI agents produce because they already know how to do the tasks they outsourced.”
In other words, well-trained diplomats remain a necessary ingredient for the State Department to use AI effectively (even aside from the fact that it might just slow them down). Unfortunately, as I have written about extensively in this Substack, the State Department does a poor job of prioritizing expertise.
This leads us to some tricky labor questions. The threat of AI to workers has led many unions, prominently including the Writers Guild of America, to prioritize AI protections for their labor force. But at the State Department, the experts best positioned to fact-check the AI output may also be the most expendable. Pressure to pursue efficiency gains may result in career officials being sidelined, and their forms of historical knowledge may be perceived as the easiest to replicate with AI. In contrast, the more subjective decision-making authority endowed in political appointees may protect them. Indeed, this is already happening. Why task a costly, slow-thinking human with a job when one can just query ChatGPT?
There is also a dimension of this debate that receives insufficient attention: CSIS’s Futures Lab tested leading AI models against 400 diplomatic crisis scenarios and found evidence of systematic escalation bias. They reported that AI was “hawkish to a fault.” In other words, the more we rely on AI, the more likely it is that we end up in unnecessary conflict. Meanwhile, our adversaries are building AI too. Chinese LLMs like DeepSeek and Qwen exhibit similar escalation bias, particularly toward Western nations. If American diplomats are consulting tools with one set of biases while Chinese counterparts use tools with a different, potentially conflicting set, the diplomatic risks compound dangerously.
We are also discovering that the use of AI appears to cause cognitive decline. A study from Switzerland on nearly 700 people found “a significant negative correlation between frequent AI tool usage and critical thinking abilities.” An MIT study used electroencephalography to demonstrate that people literally power down their brains when using AI, eroding their critical thinking skills. A Microsoft Research and Carnegie Mellon survey of 319 knowledge workers confirmed the pattern: “higher confidence in GenAI is associated with less critical thinking.” AI also appears to damage our intellectual autonomy through “a process of gradual de-skilling, where we lose skills that we currently take for granted.” Other research offers similar warnings: AI usage triggers “metacognitive laziness,” declining motivation in one’s work, and reductions in effort and confidence.
Two scholars issue a stark warning in a new paper, How AI Destroys Institutions:
“AI systems are built to function in ways that degrade and are likely to destroy our crucial civic institutions. The affordances of AI systems have the effect of eroding expertise, short-circuiting decision-making, and isolating people from each other. These systems are anathema to the kind of evolution, transparency, cooperation, and accountability that give vital institutions their purpose and sustainability. In short, current AI systems are a death sentence for civic institutions, and we should treat them as such.”
Will the technology get better?
Definitely. But how much better? If you believe proponents like Elon Musk, AI will become “smarter than any human by the end of this year.” But others are skeptical, suggesting the next breakthrough may be decades away. Other leaders in the field believe that LLMs are a dead end and are pivoting their focus to other technological pathways.
LLMs are not magic; they have physical limitations stemming from energy use, processing power, the limited availability of training data, and the inherent limitations of the underlying technology. Without getting into the technical details of the challenge of scaling LLMs, let’s use an analogy: One might have believed that engineers could make internal combustion engines steadily more efficient over time. But that assumption was wrong. It turns out that the laws of physics get in the way. The same may be true for AI.
Most likely, AI will get better at certain tasks, and not others. It may thrive at balancing our accounting books, but struggle to adjudicate visa interviews accurately. Or, it is possible I am entirely wrong, and the sky is the limit for LLM models. Indeed, some very smart people believe that LLMs will continue to improve, quickly achieving artificial general intelligence (AGI), surpassing humans in virtually all domains.
Will superintelligent AI be safe?
There’s a parable in the AI community about a robot trained to make paperclips. At first, the robot hums along, making piles of paperclips. But it soon runs out of supplies at its factory, and realizes that to achieve its paperclip-making work, it needs to get more materials. So it first dismantles the factory walls, turning its surroundings into paperclips. Then it starts gobbling up everything in sight, turning the whole city into new paperclips. The robot deems the terrified humans trying to shut down production as bad for productivity, so it gobbles them up too. Soon, nothing is left except the robot and a planet’s worth of paperclips.
Science fiction movies feature superintelligent robots who hatch schemes to wipe out the human race. That’s plausible. But many AI safety experts believe the more likely scenario is that an AI is too stupid to realize the implications of its actions. This is called the “alignment problem.”
Despite safety concerns, government agencies and private companies are advancing AI development and integration faster than society can understand the implications of what we are building. Some policymakers responsible for AI governance and regulation have little technical knowledge of the tools they implement. In national security, big questions about safety are being set aside in the rush to ensure technological superiority over our adversaries. The private sector, whose leaders have a substantial financial interest in rapid adoption, is spending heavily to persuade policymakers – and especially the national security establishment – that their AI tools are essential, inevitable, and safe. Maybe they are right. But independent evaluators are giving the private sector poor grades on AI safety. In their rush to capture market share, companies are demonstrating a “striking lack of commitment to many areas of safety.”
The result is that some of the most prominent A.I. experts are terrified by what they built, just like the scientists who built the atomic bomb, calling AI an “existential risk.” They worry about things like a computer starting a nuclear war that will destroy the planet, hatching a bioweapon, or some other terrible thing that we haven’t even imagined. In response, AI optimists suggest it is more likely that AI will deliver us into a utopia in which humans achieve “unparalleled abundance” and an end to scarcity and war.
The Takeaway
AI holds great promise. But the utility of AI in international relations rests on several shaky assumptions. If any one of these assumptions fails, the whole AI edifice may crumble. Meanwhile, the field of diplomacy seem poorly-positioned to shape the rewards and risks to our profession on this critical issue. I hope that begins to change.
As always, I welcome your feedback. If there is research here that I have missed, please share it. And I would love to hear from professionals inside and outside government about your experiences so far with AI.





Not all the articles extolled the benefits of AI, Dan.
- Ian Hopper, article-writer.
Excellent piece Dan, well researched and thought out.
I have been tracking this issue for close to year, how it plays out, what its impact will be and how it is being deployed throughout the diplomatic ecosystem around the world, whether in ministries of foreign affairs or IGOs.
In addition to the issues you have pointed out there, there are implementation gaps and issues that are showing precursors to the broader problems we will face collectively later on as adoption spreads faster.
I’m not going to crowd your comment section, but suffice to say that the comparison between the idealized use case scenarios that are used to illustrate the functions of this technology and the way it is actually being used and will be used, whether by individual diplomats and officers in the field or organizations, is going to throw a wrench in the assumptions and expected outcomes.