Genomic Analysis | VALIANT /valiant Vanderbilt Advanced Lab for Immersive AI Translation (VALIANT) Thu, 21 Nov 2024 17:33:27 +0000 en-US hourly 1 Temporal recording of mammalian development and precancer /valiant/2024/11/21/temporal-recording-of-mammalian-development-and-precancer/ Thu, 21 Nov 2024 17:32:54 +0000 /valiant/?p=3320 Islam, M.; Yang, Y.; Simmons, A.J.; Shah, V.M.; Musale, K.P.; Xu, Y.; Tasneem, N.; Chen, Z.; Trinh, L.T.; Molina, P.; Ramirez-Solano, M.A.; Sadien, I.D.; Dou, J.; Rolong, A.; Chen, K.; Magnuson, M.A.; Rathmell, J.C.; Macara, I.G.; Winton, D.J.; Liu, Q.; Zafar, H.; Kalhor, R.; Church, G.M.; Shrubsole, M.J.; Coffey, R.J.; Lau, K.S. “.” Nature, Volume 634, Issue 8036, 2024, pp. 1187-1195, Article 8, .

Understanding when and how cells change over time is crucial for studying biology. Traditionally, this involves continuous observation, but another method uses permanent genetic changes, like mutations, as “timestamps” to track events after they happen. Researchers developed a “molecular clock” using CRISPR technology to record the timing of cell changes along with information about cell type and lineage (family tree). This approach revealed the timing of specific cell growth during mouse development, unexpected connections between different cell types, and new states of epithelial cells based on their genetic history.

The method was also applied to study the early stages of colon cancer in mice and humans. By analyzing 418 human precancerous polyps, researchers discovered that 15–30% originated from multiple normal cells rather than a single cell. This innovative framework combines genetic “timestamps” with single-cell analysis, providing new insights into the timing and origins of development and diseases like cancer.

Fig. 1: Optimization of a multipurpose, single-cell capture platform.

a, gRNA capture schematic for the NSC–seq platform. The target site of gRNA scaffold anneals to NSC–seq capture sequence (CS) with a cellular barcode (blue) and unique molecular identifier (green). An additional sequence (grey) is added to the 3′-end of the complementary DNA via template switching during reverse transcription to enable downstream library amplification. This gRNA capture approach is compatible with any type of gRNA (single-guide RNA (sgRNA), hgRNA and self-targeting guide RNA) that contains the target site sequence in the scaffold (Extended Data Fig.).b, Cas9-induced mutation recovery by direct hgRNA capture as compared with mutations detected in DNA of the same samples.c, gRNA capture efficiency by NSC–seq assessed in an experiment in which all cells from a drug-selected cell line should contain sgRNAs.d, Comparative transcriptome capture efficiency between standard inDrops and NSC–seq experiments.e, NSC–seq experiments performed on developmentally barcoded whole embryos in which Cas9 is constitutively expressed (top). Accumulative mutations on homing barcode regions increase over time (bottom),.f, Average mutation density over embryonic time points (Extended Data Fig.). Black dots represent geometric mean for each time point, andP values are derived from unpaired two-tailedt-tests.g, Somatic mtVar calling from mitochondrial RNA (mtRNA) (top). Approach to filtering informative mtVars for lineage tracking using hgRNA mutations as ground truth (bottom) (Extended Data Fig.).h, Number of somatic mtVars per cell over embryonic time points. Black dot represents geometric mean for each time point, andP values were derived from unpaired two-tailedt-tests.i, Pearson correlation coefficient heat map of variant proportions combining hgRNAs and mtVars for selected tissue types, presented as pseudobulk from an E9.5 embryo (Extended Data Fig.).j, Multimodal application of the NSC–seq platform.a,e,g,j, Schematics created using BioRender (). a.u., arbitrary units; AUC, area under the curve; rep., replicate; prog., progenitor; bp, base pairs.

]]>
Consensus tissue domain detection in spatial omics data using multiplex image labeling with regional morphology (MILWRM) /valiant/2024/11/21/consensus-tissue-domain-detection-in-spatial-omics-data-using-multiplex-image-labeling-with-regional-morphology-milwrm/ Thu, 21 Nov 2024 16:46:39 +0000 /valiant/?p=3298 Kaur, H.; Heiser, C.N.; McKinley, E.T.; Ventura-Antunes, L.; Harris, C.R.; Roland, J.T.; Farrow, M.A.; Selden, H.J.; Pingry, E.L.; Moore, J.F.; Ehrlich, L.I.R.; Shrubsole, M.J.; Spraggins, J.M.; Coffey, R.J.; Lau, K.S.; Vandekar, S.N. “ (MILWRM).” Communications Biology, Volume 7, Issue 1, 2024, Article 1295, .

New molecular imaging methods can capture detailed genetic and protein information directly from tissues, allowing scientists to study diseases while keeping the original structure of the tissue intact. By combining this molecular data with traditional tissue images, researchers can learn more about how different parts of tissues are affected by diseases. However, making sense of all this complex data, especially when comparing many samples, is challenging.

To help with this, we created MILWRM, a Python tool that can quickly find and label different areas within tissue samples. MILWRM analyzes images and groups similar parts of the tissue together, making it easier to identify specific regions.

We tested MILWRM on various tissue samples, including human colon polyps, lymph nodes, mouse kidneys, and mouse brain slices. The tool was able to distinguish different types of polyps and identify unique areas in the brain based on their molecular characteristics. MILWRM helps researchers understand the structure and molecular features of tissues, making it a valuable tool for studying diseases.

Fig. 1: The workflow of the MILWRM pipeline.

MILWRM begins with constructing a tissue labeler object from all the sample slides that undergo data preprocessing, serialization, and subsampling to create a randomly subsampled dataset used for k-means model construction. This subsampled data is used to find an optimal number of tissue domains, and k-selection using the adjusted inertia method. Finally, a k-means model is constructed, and each pixel is assigned a TD. Each TD has a distinct domain profile describing its molecular features. MILWRM also provides quality control metrics such as confidence scores (created with BioRender.com).

]]>
Identification and multimodal characterization of a specialized epithelial cell type associated with Crohn’s disease /valiant/2024/09/22/identification-and-multimodal-characterization-of-a-specialized-epithelial-cell-type-associated-with-crohns-disease/ Sun, 22 Sep 2024 15:41:06 +0000 /valiant/?p=3032
Li, Jia, Simmons, Alan J., Hawkins, Caroline V., Chiron, Sophie, Ramirez-Solano, Marisol A., Tasneem, Naila, Kaur, Harsimran, Xu, Yanwen, Revetta, Frank, Vega, Paige N., Bao, Shunxing, Cui, Can, Tyree, Regina N., Raber, Larry W., Conner, Anna N., Pilat, Jennifer M., Jacobse, Justin, McNamara, Kara M., Allaman, Margaret M., Raffa, Gabriella A., Gobert, Alain P., Asim, Mohammad, Goettel, Jeremy A., Choksi, Yash A., Beaulieu, Dawn B., Dalal, Robin L., Horst, Sara N., Pabla, Baldeep S., Huo, Yuankai, Landman, Bennett A., Roland, Joseph T., Scoville, Elizabeth A., Schwartz, David A., Washington, M. Kay, Shyr, Yu, Wilson, Keith T., Coburn, Lori A., Lau, Ken S., & Liu, Qi. (2024). Identification and multimodal characterization of a specialized epithelial cell type associated with Crohn’s disease. Nature Communications, 15(1), 7204.
This study investigates Crohn’s disease (CD), a chronic inflammatory condition affecting both the gastrointestinal system and other parts of the body due to immune system dysregulation. By analyzing over 202,000 cells from 170 tissue samples across 83 patients, the researchers identified a specific epithelial cell type, termed ‘LND,’ present in both the terminal ileum and ascending colon. These LND cells, which show high expression of genes related to antimicrobial response and immune regulation (such as LCN2, NOS2, and DUOX2), were found to be rare in individuals without inflammatory bowel disease (IBD) but significantly expanded in patients with active CD.

Further in-situ RNA and protein imaging confirmed the presence of LND cells, which interact closely with immune cells and express genes linked to CD susceptibility, suggesting their involvement in the disease’s immune dysfunction. Additionally, the study identified early and late subpopulations of LND cells, each with distinct developmental trajectories. Interestingly, patients with a higher ratio of late-to-early LND cells were more likely to respond positively to anti-TNF treatment, a common therapy for CD. These findings highlight a potentially pathogenic role for LND cells in CD and provide new insights into disease mechanisms and treatment responses.

Single-cell landscape in Crohn’s disease and non-IBD controls.
A Schematic for processing endoscopic and surgical samples from TI and AC for
non-IBD controls, inactive and active CD patients. B Summary of the number of
samples in each group. C UMAP of 155,093 cells from endoscopy samples colored
by cell clusters. D Dotplot showing markers for each cell type. E UMAP of 155,093
cells colored by tissue origin, TI (brown) or AC (blue). F Proportion of each cell
cluster in TI (brown) and AC samples (blue). G UMAP of 155,093 cells colored by
disease status, controls (tan), inactive (green) or active CD (purple). H MDS plot of
cell compositional differences across all endoscopy specimens
]]>
Integration of estimated regional gene expression with neuroimaging and clinical phenotypes at biobank scale /valiant/2024/09/22/integration-of-estimated-regional-gene-expression-with-neuroimaging-and-clinical-phenotypes-at-biobank-scale/ Sun, 22 Sep 2024 15:32:32 +0000 /valiant/?p=3027 Hoang, Nhung, Sardaripour, Neda, Ramey, Grace D., Schilling, Kurt, Liao, Emily, Chen, Yiting, Park, Jee Hyun, Bledsoe, Xavier, Landman, Bennett A., Gamazon, Eric R., Benton, Mary Lauren, & Capra, John A., Rubinov, Mikail.(2024). Integration of estimated regional gene expression with neuroimaging and clinical phenotypes at biobank scale. PLoS Biology, 22(9), e3002782.

This study aims to deepen our understanding of human brain individuality by integrating various large-scale data sets, including genomic, transcriptomic, neuroimaging, and electronic health records. The researchers used computational genomics methods to estimate genetically regulated gene expression (gr-expression) for 18,647 genes across 10 brain regions in over 45,000 people from the UK Biobank. Their analysis revealed that gr-expression patterns align with known genetic ancestry relationships, brain region identities, and gene expression correlations across different regions.

Through transcriptome-wide association studies (TWAS), they discovered 1,065 associations between gr-expression and individual differences in gray matter volumes across people and brain regions. These findings were compared to genome-wide association studies (GWAS) in the same sample, revealing hundreds of novel associations. The study also linked gr-expression to clinical phenotypes by integrating results from the ý Biobank.

Further analysis involved the Human Connectome Project (HCP), where they identified associations between polygenic gr-expression and MRI-based structural and functional brain phenotypes. The results were highly replicable, strengthening the reliability of their findings. Overall, this work offers a valuable new resource for connecting genetically regulated gene expression to brain organization and diseases, advancing our understanding of brain individuality and its clinical relevance.

Estimation of genetically regulated gene expression from genetic data.
(A) Pipeline for estimation of gr-expression with Joint-Tissue Imputation. Left: Joint-Tissue Imputation models are trained on genetic sequences and directly assayed gene expression from postmortem brain samples in the GTEx and PsychEncode projects. Center: The models are trained to estimate gr-expression as a weighted sum of SNPs that are close to the gene of interest along the linear genome. The estimation includes elastic-net regularization because the number of these SNPs typically exceeds the number of samples in the training data. Right: The trained models were used to estimate gr-expression from genetic sequences of neuroimaging-genomic samples in the UK Biobank and the HCP. (B) An illustration of the 10 cortical and subcortical regions with available models of gr-expression. Numbers in parentheses refer to all models that passed baseline performance thresholds for the prediction of observed gene expression on held-out data (r2 > 0.01 and pFDR < 0.05). (C, D) Predictive performance of gr-expression models on held-out data from the GTEx data set. (C) Histograms of r [2], the variance of directly assayed gene expression explained by estimated gr-expression. (D) Histograms of p-values (−log10 pFDR) on these r2 values. Regions are colored as in panel B. FDR, false discovery rate; GTEx, Genotype-Tissue Expression Project; HCP, Human Connectome Project; SNP, single-nucleotide polymorphism.
]]>
Benchmarking clustering, alignment, and integration methods for spatial transcriptomics /valiant/2024/08/22/benchmarking-clustering-alignment-and-integration-methods-for-spatial-transcriptomics/ Thu, 22 Aug 2024 16:36:44 +0000 /valiant/?p=2879 Hu, Yunfei; Xie, Manfei; Li, Yikang; Rao, Mingxing; Shen, Wenjun; Luo, Can; Qin, Haoran; Baek, Jihoon; Zhou, Xin Maizie. Genome Biology, volume 25, Article number: 212 (2024). . Published: 09 August 2024.

Understanding the complexities of tissues and organisms is no small feat. However, scientists are making great strides with a cutting-edge technique called spatial transcriptomics (ST). This method allows us to study tissues at a microscopic level, revealing valuable information about their structure and function.But here’s the catch: analyzing and integrating data from multiple tissue slices and finding meaningful patterns within a single slice can be quite challenging. To overcome this hurdle, researchers have developed several algorithms specifically tailored for ST data analysis. These algorithms help identify distinct spatial regions within a tissue slice and align data from different sources for further analysis.

To guide researchers in choosing the right methods and paving the way for future advancements, a team of scientists conducted a comprehensive benchmarking study. They evaluated various state-of-the-art algorithms by analyzing real and simulated datasets with different sizes, technologies, species, and complexities.The researchers assessed each algorithm using a range of quantitative and qualitative metrics. These metrics included measures of clustering accuracy, visualization techniques to understand spatial relationships, alignment accuracy, and even 3D reconstruction. By considering both method performance and data quality, they provided a holistic evaluation to aid researchers in selecting the best tools for their specific needs.

The team has made all their evaluation code available on GitHub, along with online notebooks and documentation. This ensures transparency and reproducibility, allowing other researchers to validate the benchmarking results and explore new methods using different datasets.In conclusion, this groundbreaking study provides comprehensive recommendations to researchers, offering guidance in choosing optimal tools and inspiring future developments. With these advanced techniques, we are unlocking new possibilities and gaining deeper insights into the fascinating world of complex tissues.

Benchmarking framework for clustering, alignment, and integration methods on different real and simulated datasets. Top, illustration of the set of methods benchmarked, which includes 16 clustering methods, five alignment methods, and five integration methods. Bottom, overview of the benchmarking analysis, in terms of different metrics (1–7). Different experimental metrics and analyses, Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI), Homogeneity (HOM), Average Silhouette Width (ASW), CHAOS, Percentage of Abnormal Spots (PAS), Spatial Coherence Score (SCS), uniform manifold approximation and projection (UMAP) visualization, layer-wise and spot-to-spot alignment accuracy, 3D reconstruction, and runtime, are designed to quantitatively and qualitatively assess method performance as well as data quality. Additional details are provided in the “Results” section
]]>
Parallel signatures of cognitive maturation in primate antisaccade performance and prefrontal activity /valiant/2024/08/22/parallel-signatures-of-cognitive-maturation-in-primate-antisaccade-performance-and-prefrontal-activity/ Thu, 22 Aug 2024 16:35:28 +0000 /valiant/?p=2876 Zhu, Junda; Zhou, Xin Maizie; Constantinidis, Christos; Salinas, Emilio; Stanford, Terrence R. iScience, Volume 27, Issue 8, 16 August 2024, 110488, .

In this study, researchers examined how our ability to control and redirect our attention develops as we grow older. They conducted their investigation by studying monkeys that were trained to disregard distractions and redirect their attention to a specific target (referred to as the antisaccade task).Through behavioral evaluations and monitoring neural activity in the prefrontal cortex, they compared the monkeys’ performance before and after puberty.

The findings revealed that adult monkeys showed significant improvements in processing the stimulus quickly, resisting the involuntary urge to look at it, and consistently following the task rules compared to when they were younger. The researchers also observed changes in the prefrontal cortex, which played a crucial role in the monkeys’ improved performance. The specific neurons in this brain region showed enhanced activity and provided neural markers of the behavioral changes, suggesting a shift from stimulus-driven to goal-driven control during each trial.

These results not only shed light on the cognitive development of monkeys but also offer important insights into how our own attentional abilities mature as we transition from adolescence to adulthood. It appears that effective allocation of attention plays a key role in achieving better response control. This study deepens our understanding of how our ability to focus and resist distractions improves over time, providing valuable knowledge about the development of cognitive skills.

Antisaccade performance for individual subjects (A) Tachometric curves showing the proportion of correct responses at each rPT bin (bin width = 40 ms). Each curve combines trials from all gap conditions either in the young (blue) or adult (red) stage. For each curve, the light shaded ribbon denotes the mean proportion correct ±1 SE from the experimental data, and the dark trace is a continuous function fitted to those data (Methods).
(B–D) Three quantities derived from the tachometric curve and used to characterize antisaccade performance for each monkey and for their combined data (All). The rPT at criterion (B) is the processing time at which performance reaches 75% correct. The probability that a saccade is captured by the cue (C) is based on how much the tachometric curve dips below chance. And the asymptote (D) is the performance level attained at long rPTs. For all quantities, bars show values for the young (blue) and adult (red) stages, and gray shades and error bars indicate 68% and 95% CIs, respectively, obtained by bootstrapping (Methods).
]]>
SCCNAInfer: a robust and accurate tool to infer the absolute copy number on scDNA-seq data /valiant/2024/08/22/sccnainfer-a-robust-and-accurate-tool-to-infer-the-absolute-copy-number-on-scdna-seq-data/ Thu, 22 Aug 2024 16:31:43 +0000 /valiant/?p=2870 Zhang, Liting; Zhou, Xin Maizie; Mallory, Xian. “Bioinformatics, Volume 40, Issue 7, July 2024, btae454, .

In diseases like cancer, changes in our cells called copy number alterations (CNAs) are important to understand. These changes can tell us a lot about how diseases progress. Single-cell DNA sequencing (scDNA-seq) helps researchers detect CNAs in individual cells, but current tools can make mistakes across the entire genome due to wrong estimates of cell chromosome numbers, or “ploidy.”

SCCNAInfer is a new tool designed to improve this process. It uses information from inside tumor cells to more accurately estimate each cell’s ploidy and CNAs. SCCNAInfer works alongside existing CNA detection methods by grouping cells, calculating ploidy for each group, refining the data, and accurately identifying CNAs for each cell.

Tests show that SCCNAInfer does a better job compared to other tools like Aneufinder, Ginkgo, SCOPE, and SeCNV. This new tool can help researchers get clearer insights into cell changes, aiding in the study of cancer and other diseases.

SCCNAInfer is freely available at .

Overview of SCCNAInfer. Raw read count and optionally the segmentation of each cell from an existing tool are the input to SCCNAInfer. If the segmentation result is not provided, SCCNAInfer allows the users to select a state-of-the-art method to produce the segmentation result. Step 1 identifies the normal cells if any, and normalizes the raw read count. Step 2 calculates the pairwise distance among each pair of cells based on the normalized read count and the segmentation result from an existing tool. Given the pairwise distance among the cells, Step 3 clusters the cells by a hierarchical clustering approach which automatically selects the optimal cluster number. Here, K refers to the number of clusters, and E refers to the cost function. Whichever K minimizes E is selected. Step 4 searches the optimal subclonal ploidy (P) for each cluster. For each cluster, whichever P that can minimize a cost function F is selected. Step 5 refines the read count by clustering the bins inside each cell cluster. Finally, based on the corrected read count from Step 5 and the optimal subclonal ploidy from step 4, the absolute copy number for each cell is calculated as the output of SCCNAInfer.
]]>
Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data /valiant/2024/04/16/tradeoffs-in-alignment-and-assembly-based-methods-for-structural-variant-detection-with-long-read-sequencing-data/ Tue, 16 Apr 2024 02:58:39 +0000 /valiant/?p=2038 Liu YH, Luo C, Golding SG, Ioffe JB, Zhou XM. Nat Commun. 2024 Mar 19;15(1):2447. doi: 10.1038/s41467-024-46614-z. PMID: 38503752; PMCID: PMC10951360.

Researchers have systematically evaluated a range of tools designed to detect structural variants (SVs) in genomes using long-read sequencing, a method that provides more comprehensive genomic insights. The study compares 14 alignment-based methods, including advanced deep learning options, with four assembly-based methods, revealing that while assembly-based tools more effectively detect larger SVs and are robust against various testing conditions, alignment-based methods offer greater accuracy at lower sequencing coverages and excel at identifying complex SVs. This benchmarking effort helps users select the most suitable tools for different research scenarios and lays a foundation for future improvements in genomic analysis tools.

Complex SV detection in simulated and real cancer datasets. a Heatmap shows overall and genotyping (gt) F1 scores of translocation (TRA), inversion (INV), and duplication (DUP) detection for 10 SV calling methods on 9 simulated PacBio Hifi, CLR, and ONT datasets. b Heatmap shows recall and precision scores of somatic deletion (DEL), insertion (INS), translocation (TRA), inversion (INV), and duplication (DUP) detection for 9 SV calling methods on two publicly available sets of Tumor-Normal paired Pacbio CLR and ONT libraries. Empty cells represent analysis that could not be performed (or finished within 14 days of runtime) for the tool in the corresponding row. Source data are provided as a Source Data file.
]]>