Article

Why current AI evaluation frameworks are failing healthcare

Published on September 18, 2025 | 8 min read
why-current-ai-frameworks

Key takeaways

  • Patient-focused digital health evaluation models are often ill-suited for rapidly iterating technologies designed for healthcare professionals
  • The practically infinite response variations of generative AI models demand new, evidence-based taxonomies and evaluation approaches that go beyond traditional quality assurance methods
  • The healthcare sector urgently needs AI evaluation frameworks that are based on an evaluation consensus to overcome adoption barriers and prevent regulatory fragmentation

Healthcare systems worldwide are in crisis and face the unprecedented confluence of financial strain, a global shortage of healthcare workers, and the burden of noncommunicable diseases responsible for 74% of global mortality.1 Amidst these daunting challenges, digital health and artificial intelligence technologies (DHAITs) offer a glimmer of hope by promising more accessible, sustainable, efficient, and higher-quality care. However, DHAITs face significant barriers to adoption. To overcome these barriers, healthcare needs AI evaluation frameworks that are based on an evaluation consensus that remains flexible enough for variations in national implementation. In this way, innovations are assessed consistently without imposing a single global rulebook.

While much of the attention traditionally focuses on patient-facing tools, the digital solutions used by healthcare professionals (HCPs), including clinicians, nurses, managers, and administrators, are equally critical. These tools support risk analysis, screening, diagnosis, prognosis, treatment choices, and patient monitoring, and they have the potential to optimize workflows, reduce care variation, and improve provider efficiency.

However, despite these sky-high promises, healthcare providers and patients need significant training to be able to use DHAITs, digital infrastructure constraints are widespread, and the rules and regulations around the use of DHAITs are still emerging and evolving.2-4 Most critically, the sector lacks robust, context-sensitive evaluations that generate evidence demonstrating long-term value. Existing evaluation models, largely inherited from the pharmaceutical and medical technology sectors, are often ill-suited for iterative, adaptive, and fast-moving digital tools, especially those powered by AI.5 The result is a proliferation of short-lived, unknown-value solutions that fail to scale, never truly realizing their intended impact.

This is why an evaluation consensus for DHAITs, particularly those designed for healthcare professionals, is not merely advantageous but essential. Without it, we risk a splintered landscape within and across countries, hindering innovation, compromising safety, and ultimately failing to use these technologies to their full potential for the benefit of patients, HCPs, and health systems alike.

The unique landscape of HCP-facing digital health tools

One of the fundamental missteps in current evaluation paradigms is the general tendency to apply a single set of approaches to all digital health technologies, no matter if they are aimed at patients or professionals. Solutions for healthcare professionals require a distinctly different evaluation approach than patient-facing tools. When we develop tools for patients, the objective is typically to improve their health outcomes and ensure the highest standard of care. But when we develop something for an HCPs,who are not there to be treated but to treat, our aims shift: we want to make their work easier, more efficient, and to improve performance.

Current health technology assessment (HTA) protocols are predominantly patient-focused, making it challenging to evaluate and advance tools designed for professionals. Health systems are now facing a situation where more people live longer but not necessarily remain healthy in old age. Healthcare professionals themselves have become the unfortunate bottlenecks due to the compounding volume of patients, necessitating the integration of non-human actors like AI into the healthcare system. However, appropriate AI evaluation frameworks and protocols are integral for understanding, developing, and deploying these new actors responsibly within the health system.

The nature of AI and the introduction of profound uncertainty

The introduction of AI further muddies the waters of evaluation. Unlike traditional digital health technologies or medical devices, AI components are not always consistent. It can also be incredibly difficult to trace how AI arrives at certain conclusions given particular inputs, introducing an undesirable amount of uncertainty in healthcare, where certainty and transparency are crucial.

Consider the stark difference between a radiological tool that screens MRIs for the discrete classification of “disease present” or “disease absent,” and a generative AI powered by a large language model (LLM) for psychiatric support. The AI imaging technology for disease detection, despite complex input data, is highly catered to choosing one of two outcomes, making quality assurance relatively straightforward. In contrast, an LLM can produce a practically infinite number of response variations. Ensuring the safety and quality of each of these infinite recombinations becomes a nearly impossible task through traditional means.

To address this complexity and bring clarity to the diverse world of HCP-facing DHAITs, we need an evidence-based taxonomy to disentangle the development process and highlight areas where risk and bias can be introduced. By systematically breaking down DHAITs along critical  dimensions, we can develop a more thorough understanding of how technologies are built up, as well as develop targeted and appropriate evaluation requirements per dimension. AI evaluation frameworks based on a common foundation offer a consistent, nuanced, and deliberate set of requirements to understand the functionality, performance, and risks associated with AI and digital health technologies.

A call for AI evaluation frameworks that share a common ground

Currently, regulatory authorities generally align with the International Medical Device Regulators Forum (IMDRF) framework for classifying medical devices, which includes DHAITs in the forms of Software-as-Medical-Device or Digital Medical Devices. However, variations persist across jurisdictions, with countries developing their own classification frameworks (e.g., NICE’s three-tier framework in the UK, HAS’s fourth tier for autonomous decision-making in France, FDA’s delineation of clinical decision support tools in the USA).

The problem is that this piecemeal approach leads to a lack of consensus on how to interpret different risks associated with AI, which is fundamentally a normative question about the values we want to govern these technologies. Without a unified AI evaluation framework, jurisdictions start developing regulatory solutions on their own, which risks fragmentation and large variability from country to country. This can hinder technology deployment, rollout and comparability across regions since their development is built on  fundamentally different foundations.

An international classification framework is crucial because it can provide risk classifications and taxonomies that can be consistently implemented across countries. While national variations in benchmarks and evaluation thresholds will and should remain—reflecting the individuality of health systems in different countries and regions—there is significant benefit if the underlying classification system is based on a unified theory of knowledge. Failure to do so risks a splintered approach where some countries fall significantly behind, losing their ability to compete at the global level. A unified framework is about striking a balance: realizing the public benefits of AI while ensuring market feasibility for developers, all while setting a consistent standard for risk and quality. Predetermined Change Control Plans (PCCPs) are good examples of this. PCCPs, pioneered in the US and mirrored in Canada and South Korea, are similar to proposals in the EU AI Act, which allows for pre-approved regulatory clearance of predefined algorithmic modifications within clear boundaries. PCCPs are promising to reduce and avoid the time-consuming process of repeated regulatory submissions for a priori predicted updates that do not affect performance.6 Jurisdictions that do not adopt PCCPs risk longer re-clearance cycles for routine updates, eroding competitiveness relative to peers that enable pre-authorized modifications.6,7

The evaluation of DHAITs demands a paradigm shift in how we generate and interpret evidence. Randomized controlled trials (RCTs) remain the gold standard, but their suitability for rapidly iterating digital technologies is limited.8 Similarly, for economic evidence, the traditional use of quality-adjusted life-years (QALYs) is often inappropriate for HCP-facing tools, as their benefits are often indirectly or unrelated to the health outcomes of the patient.

It is also worth noting that no single international body has currently taken on the role to lead these endeavours for AI, unlike the IMDRF for medical devices. Entities like the World Health Organization, the World Economic Forum, or the World Bank—or ideally a combination thereof—must step up to bring together regional initiatives. The message to regulators is clear: we need your help in unifying classification to avoid a splintered approach that could leave countries behind in global competitiveness. Ultimately, proper evaluation is a prerequisite to building trust in AI deployment within health systems and advancing the creation of AI-appropriate funding and reimbursement mechanisms.

Forging a path forward for AI evaluation frameworks

As we navigate the hype and reality of AI development, we must resist the urge to prematurely declare these technologies market-ready based on insufficient testing. The potential for DHAITs to transform healthcare is immense, but only if we approach their development, classification, and evaluation with the rigor, nuance, and global collaboration they demand. Our collective responsibility is to ensure that these powerful tools truly serve to make healthcare more accessible, sustainable, efficient, and of higher quality for all.

AI-governance-for-healthcare

AI governance for healthcare: A digital health evaluation framework

Solving systemic adoption barriers for digital health technologies requires frameworks that can solve issues in AI governance for healthcare.

Get our latest insights

Join our community and stay up to date with the latest laboratory innovations and insights.

Contributors

Robin van Kessel headshot

Robin van Kessel, PhD

André Hoffmann Fellow, London School of Economics and Political Science

Robin van Kessel is the André Hoffmann Fellow on Health System Financing and Payment Models at LSE Health and the World Economic Forum. At the LSE, he co-founded and leads the digital health research unit with a particular focus on the regulation, implementation, and financing of digital health and AI technologies. He holds a PhD in Comparative Health Policy from Maastricht University. His main research portfolio focuses on the intersection of digital health and artificial intelligence, health systems and policy, and health inequalities. Dr van Kessel’s work is published in leading medical and health policy journals such as npj Digital Medicine, The Lancet Digital Health, The Lancet Regional Health - Europe, the Bulletin of the World Health Organization, and The BMJ.

Newsletter for healthcare leaders and experts

Written for experts by experts, we offer the healthcare newsletter of choice when it comes to leading healthcare transformation.

Healthcare Transformers delivers insights on digital health, patient experience, healthcare business, value-based care, and data privacy and security—key topics and emerging trends facing healthcare leaders today. Collaborating with esteemed industry experts and innovators worldwide, we offer content that helps you gain first-hand knowledge, explore challenges, and think through solutions on the most pressing developments and issues. Subscribe to our Healthcare Transformers newsletter today and get critical discussions and invaluable perspectives delivered straight to your inbox.

References

  1. World Health Organization. (2025). Article available from https://cdn.who.int/media/docs/default-source/documents/about-us/general-programme-of-work/global-health-strategy-2025-2028.pdf?sfvrsn=237faeeb_3 [Accessed September 2025]
  2. Almyranti M et al. (2024). OECD Artificial Intelligence Papers, 28. Paper available from https://www.oecd.org/en/publications/artificial-intelligence-and-the-health-workforce_9a31d8af-en.html [Accessed September 2025]
  3. Van Kessel R et al. (2022). PLOS Digital Health, 1(2), e0000013. Paper available from https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000013 [Accessed September 2025]
  4. Schmidt J et al. (2024). npj Digital Medicine, 7, 229. Paper available from https://doi.org/10.1038/s41746-024-01221-6 [Accessed September 2025]
  5. Gerke S et al. (2020). npj Digital Medicine. 3, 1. Paper available from https://pubmed.ncbi.nlm.nih.gov/32285013/ [Accessed September 2025]
  6. U.S. Food & Drug Administration. (2024). Available from https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial-intelligence [Accessed September 2025] 
  7. EUR-Lex. (2024). Available from https://eur-lex.europa.eu/eli/reg/2024/1689/oj  [Accessed September 2025]
  8. Guo C et al. (2020). npj Digital Medicine, 3, 110. Paper available from https://www.nature.com/articles/s41746-020-00314-2 [Accessed September 2025]