Chongyu Qu; Ritchie Zhao; Ye Yu; Bin Liu; Tianyuan Yao; Junchao Zhu; Bennett A. Landman; Yucheng Tang; Yuankai Huo (2026).Ìý.ÌýJournal of Medical Imaging, 13(1), 014006.Ìý
This study focuses on making advanced deep learning models for medical imaging more efficient and practical to use, especially in settings with limited computing power. One common approach is quantization, which reduces the numerical precision (or bit-width) of a model’s calculations—for example, using 8-bit numbers instead of standard 32-bit ones—to shrink model size and speed up processing. However, many previous methods only simulate this lower precision without actually improving real-world performance. To address this gap, the researchers developed MedPTQ, an open-source pipeline that enables true 8-bit (INT8) quantization for complex 3D medical imaging models, such as U-Net and transformer-based architectures. Their method works in two stages: first, it uses a tool called TensorRT to simulate lower-precision computations using sample data, and then it converts this into real low-precision execution on GPUs (graphics processing units), which are commonly used for high-performance computing.
The results show that MedPTQ can significantly reduce model size (by up to nearly four times) and speed up processing (by almost three times faster) while maintaining almost the same accuracy as full-precision models, as measured by the Dice similarity coefficient—a standard metric for evaluating how well predicted image segments match the true regions. Importantly, the approach was tested across multiple types of models and datasets, including scans of the brain, abdomen, and entire body from CT and MRI imaging, demonstrating strong flexibility and reliability. Overall, this work shows that real, not just simulated, low-precision AI models can be effectively deployed in medical imaging, making them more accessible and efficient without sacrificing performance.

Fig.Ìý1
We introduce MedPTQ, an open-source pipeline for real post-training quantization that converts FP32 PyTorch models into INT8 TensorRT engines. By leveraging TensorRT for real INT8 deployment, MedPTQ reduces model size and inference latency while preserving segmentation accuracy for efficient GPU deployment.