Unsupervised discovery of clinical disease signatures using probabilistic independence

May 21, 2025, 4:24 PM

Lasko, Thomas A.; Stead, William W.; Still, John M.; Li, Thomas Z.; Kammer, Michael; Barbero-Mota, Marco; Strobl, Eric V.; Landman, Bennett A.; Maldonado, Fabien. 鈥�鈥澨�Journal of Biomedical Informatics听166 (2025): 104837.听.听

听

This study uses a method based on听probabilistic independence听to help uncover the hidden, patient-specific causes鈥攐r “sources”鈥攐f disease using data from electronic health records (EHRs). In this approach, each disease source is treated as an听unobserved root cause听in a network that influences various observed medical variables like lab tests, medications, billing codes, and demographics. The effects of each source鈥攊ts听signature鈥攁re the patterns these causes leave behind in the data.听

By analyzing a large dataset of over 269,000 patient records and 9,195 variables, the model was able to infer 2,000 potential disease sources and their unique signatures. To test the method, the researchers used it to explore the causes of听benign vs. malignant pulmonary nodules听(small spots in the lungs) in more than 13,000 cases. The model successfully identified 92% of known malignant causes and 30% of benign ones listed in an external reference. It also uncovered several likely causes not included in the reference list, but supported by other medical literature.听

In many cases, the model could听decompose听a general diagnosis into more specific patterns related to disease progression or treatment. For example, a common malignant cause could be broken down into five or more detailed sub-patterns. Interestingly, the model also flagged many patients who may have had听undiagnosed cancer, based on their data patterns.听

These findings show that even from noisy, incomplete, and irregular health records, it’s possible to extract meaningful,听patient-specific causes of disease. This could eventually help clinicians better understand complex cases and make more precise treatment decisions tailored to individual patients.听

听

Fig. 1.听A hypothetical causal graph and structured derived from it. a) The causal graph inferred from observing the听听(solid circles) over many records. The听听are inferred latent sources (dotted circles). Colors of the nodes听听indicate the degree to which a unit change in source听听affects them. They are arbitrary here for illustration, except for听, which cannot be affected by听. b) Causal effects of source听听collected into a bar-graph signature c) Causal model of听听using latent sources听听as inputs. d) Statistical model of听using observations听听as inputs. Color intensity of inputs represent their hypothetical importance values for the prediction in a single instance. For the causal model, the inputs are mutually independent root nodes, and therefore can be interpreted as the causal sources of听, which may suggest treatment approaches that address the specific causes for this patient, and which may be manipulated to investigate different counterfactual scenarios. For the statistical model, the importance values remain entangled and cannot be interpreted this way.听

听

天美传媒官网