healthcare | VALIANT

Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance

waddelma — Thu, 26 Mar 2026 19:29:10 +0000

Katherine E. Brown; Jesse O. Wrenn; Nicholas J. Jackson; Michael R. Cauley; Benjamin X. Collins; Laurie L. Novak; Bradley A. Malin; Jessica S. Ancker (2026).��.��Journal of the American Medical Informatics Association, 33(3), 621–631.��

This study examines how machine learning (ML) systems—often used to support healthcare decisions—can sometimes produce unfair results, meaning their predictions may be less accurate for certain patient groups. A key concern is that clinicians may rely too heavily on these systems, which can unintentionally reinforce these biases. The researchers explored a strategy called��ML suppression, which means selectively “silencing” or withholding certain AI predictions when they are likely to be unreliable, based on an auditing process. They also looked at whether incorporating��uncertainty estimates��(how confident the model is in its predictions) could help decide when to suppress outputs.

Using large hospital datasets, the team simulated how clinicians and ML systems would work together to predict outcomes like death, ICU admission, or hospital readmission. They compared different scenarios, including when the AI performed better than clinicians and when it performed worse. They evaluated both accuracy (using a standard metric called AUC, which measures how well predictions distinguish outcomes) and fairness (measured by differences in error rates across groups).

The results showed that when the AI model performed better than clinicians, using suppression improved overall performance without making fairness worse. When clinicians performed better, relying on human judgment alone was often as fair or fairer than using suppressed AI predictions. Importantly, adding uncertainty information helped improve results further by better identifying when AI predictions should be ignored. Overall, the study suggests that carefully filtering out low-quality AI predictions can improve both the effectiveness and fairness of human–AI collaboration in healthcare.

��

Figure 1.

Schematic indicating the collaboration scenario with and without suppression.³⁴

Clinician Needs and Requirements for a Decision Aid Navigator: Qualitative Study

waddelma — Wed, 28 Jan 2026 17:17:08 +0000

Morse, Brad; Reale, Carrie; Nguyen, An T.; Latella, Erin; Bauguess, Hannah D.; Anders, Shilo H.; Roberts, Pamela S.; SooHoo, Spencer L.; El-Kareh, Robert E.; Soares, Andrey; & Schilling, Lisa M. (2025).��.��JMIR Human Factors,��12, e69756.��

Decision aids are tools that help patients and clinicians make healthcare decisions together, improving patient knowledge, reducing regret, and encouraging meaningful discussion. However, many clinicians do not use these tools because of time limits, difficulty matching aids to patient needs, leaving the electronic health record (EHR) to access them, and manual data entry.

This study explored clinician needs to design an EHR-integrated app called DEAN (Decision Aid Navigator), built on the SMART on FHIR platform. DEAN identifies decision aids relevant to a patient’s conditions, current treatments, and demographics, and helps document shared decision-making discussions.

Researchers interviewed 13 clinicians from four academic medical centers while showing a prototype of DEAN. Analysis of the interviews revealed three key needs: (1) streamlined functionality to reduce workflow burden, (2) clinician skills to use the app and decision aids effectively, and (3) trust that the app suggests pre-vetted decision aids. Clinicians agreed that EHR integration was essential for adoption.

The study concludes that improving tools like DEAN and integrating them into the EHR can help clinicians use decision aids more efficiently, supporting shared decision-making and potentially increasing patient-centered care.

Figure 1.��The 5 rights of clinical decision support (adapted from []) .

Evaluating cell AI foundation models in kidney pathology with human-in-the-loop enrichment

waddelma — Fri, 19 Dec 2025 16:47:48 +0000

Guo, J., Lu, S., Cui, C., Deng, R., Yao, T., Tao, Z., Lin, Y., Lionts, M., Liu, Q., Xiong, J., Wang, Y., Zhao, S., Chang, C. E., Wilkes, M., Fogo, A. B., Yin, M., Yang, H., & Huo, Y. (2025).��.��Communications Medicine,��5(1), 495.��

Large artificial intelligence foundation models are becoming important tools in healthcare, including digital pathology, where they help analyze medical images. Many of these models have been trained to handle complex tasks such as diagnosing diseases or measuring tissue features using very large and diverse datasets. However, it is less clear how well they perform on more focused tasks, such as identifying and outlining cell nuclei within images from a single organ like the kidney. This study examines how well current cell foundation models perform on this task and explores practical ways to improve them.

To do this, the researchers assembled a large dataset of 2,542 kidney whole slide images collected from multiple medical centers, covering different kidney diseases and even different species. They evaluated three widely used, state-of-the-art cell foundation models—Cellpose, StarDist, and CellViT—for their ability to segment cell nuclei. To improve performance without requiring extensive, time-consuming pixel-level annotations from experts, the team introduced a “human-in-the-loop” approach. This method combines predictions from multiple models to create higher-quality training labels and then refines a subset of difficult cases with corrections from pathologists. The models were fine-tuned using this enriched dataset, and their segmentation accuracy was carefully measured.

The results show that accurately segmenting cell nuclei in kidney pathology remains challenging and benefits from models that are more specifically tailored to this organ. Among the three models, CellViT showed the best initial performance, with an F1 score of 0.78. After fine-tuning with the improved training data, all models performed better, with StarDist reaching the highest F1 score of 0.82. Importantly, combining automatically generated labels from foundation models with a smaller set of pathologist-corrected “hard” image regions consistently improved performance across all models.

Overall, this study provides a clear benchmark for evaluating and improving cell AI foundation models in real-world pathology settings. It also demonstrates that high-quality nuclei segmentation can be achieved with much less expert annotation, supporting more efficient and scalable workflows in clinical pathology without sacrificing accuracy.

Fig. 1: Overall framework.

The upper panel��(a–c) illustrates the diverse evaluation dataset consisting of 2542 kidney WSIs.��a��shows the number of kidney WSIs in publicly available cell nuclei datasets versus our evaluation dataset, which exceeds existing datasets by a large margin.��b��depicts the diverse data sources included in our dataset.��c��indicates that these WSIs were stained using Hematoxylin and Eosin (H&E), Periodic acid–Schiff methenamine (PASM), and Periodic acid–Schiff (PAS).��Performance: Kidney cell nuclei instance segmentation was performed using three SOTA cell foundation models: Cellpose, StarDist, and CellViT. Model performance was evaluated based on qualitative human feedback for each prediction mask. Data Enrichment: A human-in-the-loop (HITL) design integrates prediction masks from performance evaluation into the model’s continual learning process, reducing reliance on pixel-level human annotation.

Human-centered design of an artificial intelligence monitoring system: the ��ý�� Algorithmovigilance Monitoring and Operations System

waddelma — Sun, 23 Nov 2025 16:58:07 +0000

Salwei, Megan E., Davis, Sharon E., Reale, Carrie., Novak, Laurie Lovett., Walsh, Colin G., Beebe, Russ., Nelson, Scott D., Sundrani, Sameer., Rose, Susannah L., Wright, Adam T., Ripperger, Michael A., Shave, Peter., & Embi, Peter J. [2025]. .��JAMIA Open,��8(5), ooaf136.��

As artificial intelligence [AI] becomes more common in healthcare, there is growing awareness that these systems need continuous oversight after they are put into use—a process known as algorithmovigilance. However, few tools exist to help hospitals consistently monitor and manage the performance of AI across their entire organization. In this study, we worked to understand what end users need from such a system while designing a new monitoring platform called the ��ý�� Algorithmovigilance Monitoring and Operations System [VAMOS]. To do this, we brought together a multidisciplinary team at ��ý�� Medical Center and held nine participatory design sessions with clinicians, leaders, and technical experts to create early prototypes. After developing a working version, we conducted eight additional interviews to gather feedback and used rapid qualitative analysis to refine the design. A multidisciplinary heuristic evaluation then helped identify more ways to improve the system. Through this human-centered, iterative process, we identified the key features an AI monitoring system must include, such as specific data displays, performance dashboards, expandable “accordion” summaries, and model-specific pages that meet the needs of a wide range of users. We also outlined general design principles for long-term AI monitoring, highlighting the challenge of supporting teams spread across the health system as they track performance issues and respond to signs of algorithm deterioration. Ultimately, VAMOS is intended to help healthcare organizations monitor AI tools in a systematic and proactive way, with the goal of improving care quality and ensuring patient safety.

Figure 1.

Overview of human-centered design process to develop VAMOS.

��

AI-Driven Clinical Decision Support to Reduce Hospital-Acquired Venous Thromboembolism: A Trial Protocol

waddelma — Thu, 23 Oct 2025 19:20:17 +0000

Walsh, Colin G.; Long, Yufei; Novak, Laurie Lovett; Salwei, Megan E.; Tillman, Benjamin F.; French, Benjamin C.; Mixon, Amanda S.; Law, Michelle E.; Franklin, Jacob; Embi, Peter J. (2025). JAMA Network Open, 8(10), e2535137.

Hospital-acquired venous thromboembolism (HA-VTE), or blood clots that develop in the veins during or after a hospital stay, remains one of the leading preventable causes of death among hospitalized adults in the United States. Although many models have been created to predict which patients are most at risk, none have clearly proven to be more effective than others, and it is still uncertain whether these models actually improve doctors’ decisions about preventive treatment. Testing these systems in both urban and rural hospitals may help determine how well they work across different healthcare environments.

This study is a randomized clinical trial designed to test whether an artificial intelligence (AI)–based clinical decision support (CDS) tool can reduce the number of HA-VTE cases among adult hospital patients. The trial will be conducted by ��ý�� Medical Center from October 2025 through September 2027, including adults aged 18 and older who are hospitalized in medical, surgical, or intensive care units and are at high risk for blood clots but do not currently have one or a condition that prevents preventive treatment. Participants will be drawn from Vanderbilt Adult Hospital in Nashville and three partner hospitals serving rural communities in Middle Tennessee.

Within the hospital’s electronic health record system, patients will be randomly assigned to receive either AI-supported care, which uses an alert system to prompt clinicians about clot prevention, or standard care based on traditional risk assessment tools. The main goal of the study is to determine whether the AI tool reduces the number of hospital-acquired blood clots. Additional measures will include hospital length of stay, readmission rates, safety outcomes, and bleeding events.

This study will be one of the first to examine whether an AI-driven decision support system can safely and effectively lower the risk of hospital-acquired blood clots without increasing side effects. It will also assess whether the same AI model performs equally well in both urban and rural hospitals. The results and supporting data will be shared publicly through peer-reviewed publications and ClinicalTrials.gov.

Figure 1. ��Intervention OurPractice Advisories Logic

BPA indicates best practice advisory; CDS, clinical decision support; DVT, deep vein thrombosis; VTE-AI, Venous Thromboembolism Using Artificial Intelligence.

Assessing the clinical utility of biomarkers using the intervention probability curve (IPC)

waddelma — Thu, 23 Oct 2025 19:05:51 +0000

Paez, Rafael; Rowe, Dianna J.; Deppen, Stephen A.; Grogan, Eric L.; Kaizer, Alexander M.; Bornhop, Darryl J.; Kussrow, Amanda K.; Barõn, Anna E.; Maldonado, Fabien; Kammer, Michael N. (2025). Cancer Biomarkers, 42(1), CBM230054.

Before new medical tests, or biomarkers, are used in clinics, it is important to understand how useful they are for guiding patient care. One way to do this is to see how a test might change which patients are assigned to different treatment groups, but traditional methods have some limitations. To address this, researchers developed the intervention probability curve (IPC), which models how likely a doctor is to choose a particular treatment based on a patient’s estimated risk of disease.

In this study, the IPC was used to evaluate a new biomarker for suspected lung cancer, using data from the National Lung Screening Trial. The analysis estimated how the biomarker would affect decisions about interventions, such as biopsies or surgeries. The results suggested that 8% of patients with non-cancerous nodules could avoid unnecessary invasive procedures, while patients with actual cancer nodules would almost always still receive appropriate care (only 0.1% change).

Compared with traditional methods, the IPC provides a more detailed and continuous view of how a biomarker could influence clinical decisions. This approach shows that the IPC can be a valuable tool for assessing the potential impact of new biomarkers before they are implemented in everyday clinical practice.

Figure��1.

Population-based assessment of changes in intervention probability. While the mean of the distributions is similar, the spread of distributions shows the change in probability is more tightly clustered around zero in the cancer population than the change in probability.

Bedtime sliding scale insulin is unnecessary for hospitalized patients with bedtime glucose < 300 mg/dL: A nudge-based quasi-experiment

waddelma — Fri, 26 Sep 2025 19:56:15 +0000

Flory, James H., Vertosick, Emily Ann, Kuperman, Gilad J., Ancker, Jessica S., Kim, Scott Y.H., Fitzpatrick, Christine, Gould, Kimberly, Weiss, Everett, & Vickers, Andrew J. (2025). Diabetes Research and Clinical Practice, 228, 112428.

This study looked at how bedtime rapid-acting insulin is used in hospitalized patients with moderately high blood sugar, particularly in populations like cancer patients where previous research may not apply. Researchers changed the standard insulin order so that rapid-acting insulin at bedtime would only be automatically suggested for glucose levels of 300 mg/dL or higher. ��ý�� half of the providers used this new order set over a two-month period, allowing comparison with the original approach. Among 458 patients, the new order set led to a 91% increase in the use of a less-aggressive insulin plan and lowered average morning glucose by 16 mg/dL. These findings suggest that rapid-acting bedtime insulin is not needed for glucose levels below 300 mg/dL and highlight that simple changes to order sets can be used to run effective, low-cost clinical trials without disrupting usual patient care.

Fig. 1��Study schema.

Behavior Shifts in Patient Portal Usage During and After Policy Changes Around Test Result Delivery and Notification

waddelma — Fri, 20 Jun 2025 18:24:54 +0000

Suresh, Uday; Steitz, Bryan D.; Rosenbloom, S. Trent; Griffith, Kevin N.; Ancker, Jessica S. AMIA … Annual Symposium proceedings. AMIA Symposium (2024): 1089-1098. ��

Because of the 21st Century Cures Act, many hospitals and clinics now release test results to patients through online portals as soon as they are available. To understand how this change affected the way patients use these portals, we looked at electronic health record data before and after the policy went into effect, as well as after a later change that required patients to opt in to get notifications about new test results.��

We found that once the Cures Act policy was put in place, more patients took action after viewing their test results—specifically, 4.5% more scheduled a new��appointment��and 4.5% more sent messages to their doctors. Later, when automatic notifications were turned off, 2.1% more patients scheduled appointments, but 0.8% fewer used telemedicine.��

Our study shows that these policy changes led to real shifts in how patients respond to their test results—something that affects both patients’ efforts to get more information and the workload for clinicians.��

��

Figure 1a, 1b, 1c.��

Proportions of patients scheduling new appointments after accessing their test results prior to their clinician or after their clinician across the time periods of Cures Act compliance and the notification policy change. Plot 1a shows the proportion of all patients scheduling new appointments. Plots 1b and 1c show the proportion of patients scheduling new appointments who reviewed their results prior to their clinician and after their clinician, respectively.��

��

Searching for Value Sensitive Design in Applied Health AI: A Narrative Review

waddelma — Wed, 21 May 2025 16:22:32 +0000

Long, Yufei; Novak, Laurie; Walsh, Colin G. “”��Yearbook of Medical Informatics��33, no. 1 (2025): 75–82.��.

As artificial intelligence (AI) becomes more common in healthcare, it’s important to design these technologies in ways that fit into real-world medical environments and respect the values of the people involved. While many designers focus on��human-centered design—which looks at users’ needs—another approach called��Value Sensitive Design (VSD)��goes a step further. VSD aims to include human values like trust, fairness, and well-being right from the beginning of the design process.

In this study, researchers looked at how VSD is being used in healthcare AI. They reviewed existing research using a version of the VSD framework adapted specifically for AI. Out of 819 articles they reviewed, only nine met the criteria for a full in-depth look.

Most of these studies focused on values related to individual users, like��trust��and��autonomy. However, there was much less attention given to values at the��organizational level��(such as employee well-being) or��societal level��(like��equity��and��justice). Most of the studies were from the U.S. and Western Europe.

The researchers concluded that future healthcare AI design should take a broader approach by also considering organizational and societal values, not just those of individual users. Since so few studies have applied VSD in this way, there’s a clear need for more research to better guide how AI can be responsibly used in healthcare.

How Difference Tasks Are Affected by Probability Format, Part 2: A Making Numbers Meaningful Systematic Review

waddelma — Wed, 23 Apr 2025 14:08:19 +0000

Benda, Natalie C.; Zikmund-Fisher, Brian J.; Sharma, Mohit M.; Johnson, Stephen B.; Demetres, Michelle; Delgado, Diana; Ancker, Jessica S. “” MDM Policy and Practice 10, no. 1 (2025): 1. .��

The way health information is presented—especially numbers about risks and benefits—can have a big impact on how people understand and respond to it. To explore this, the Making Numbers Meaningful team reviewed over 100 studies to see how different presentation formats affect people’s reactions when comparing probabilities, like how much a treatment lowers the chance of a disease coming back. They found that people were more influenced by numbers shown as relative differences (for example, saying a treatment “cuts the risk in half”) than absolute differences (like saying it “reduces risk by 2%”). People also responded more strongly to charts that focused only on the number affected, rather than showing the full picture, and were more persuaded when messages included personal stories or information about what other people chose. Bar charts were generally preferred over icon-style graphics, especially when they included labels showing exact numbers. Overall, the review showed that the format used to present health statistics can shape how effective people think a treatment is, how much they trust the information, what choices they intend to make, and which formats they prefer. This means that clear and thoughtful presentation of numbers is essential for helping people make informed health decisions.��

Table A��

This standardized numbering system has been used for results subheadings in this article and across all Making Numbers Meaningful results articles to ensure that readers can find comparable information in all articles. Gray cells represent combinations that are not possible according to the definitions presented in Ancker et al.��