Conference Program

(Each talk would be 30-minute long with 10-minute question time.)


August 15

9:20 am Light Breakfast
Breakfast provided for everyone. Located in the conference room.
9:45 am Predicting the onset of breast cancer using mammogram imaging data with irregular boundary
Jiguo Cao (Simon Fraser University - Department of Statistics and Actuarial Science)
With mammography being the primary breast cancer screening strategy, it is essential to make full use of mammogram imaging data to better identify women who are at higher and lower than average risk. Our primary goal in this study is to extract mammogram-based features that augment the well-established breast cancer risk factors to improve prediction accuracy. In this article, we propose a supervised functional principal component analysis (sFPCA) over triangulations method for extracting features that are ordered by the magnitude of association with the failure time outcome. The proposed method accommodates the irregular boundary issue posed by the breast area within the mammogram imaging data with flexible bivariate splines over triangulations. We also provide an eigenvalue decomposition algorithm that is computationally efficient. Compared to the conventional unsupervised FPCA method, the proposed method results in a lower Brier Score and higher area under the ROC curve (AUC) in simulation studies. We apply our method to data from the Joanne Knight Breast Health Cohort at Siteman Cancer Center. Our approach not only obtains the best prediction performance compared to unsupervised FPCA and benchmark models but also reveals important risk patterns within the mammogram images. This demonstrates the importance of utilizing additional supervised image-based features to clarify breast cancer risk.
10:25 am Online Bayesian phylogenetic inference via sequential Monte Carlo
Liangliang Wang (Simon Fraser University - Department of Statistics and Actuarial Science)
Bayesian phylogenetics is to approximate a posterior distribution of phylogenetic trees based on biological sequence data such as DNA and RNA. Modern technologies can generate a large amount of sequence data every day. Most existing methods in Bayesian phylogenetics are costly to conduct inference for streaming data because they must start the computation anew each time when a new sequence is available. In this work, we propose an efficient online Bayesian phylogenetic method which can update an existing posterior with new sequences. The proposed method is based on a sequential Monte Carlo method with a novel guided proposal distribution.
11:05 am Coffee Break
Coffee break provided for everyone.
11:20 am Optimization in Pharmacogenomics
Ibrahim Numanagic (University of Victoria - Department of Engineering and Computer Science)
High-throughput sequencing provides the means to determine allelic decomposition— the exact sequence content of all gene copies present in the sample, also known as haplotype— for any gene of interest. When applied to pharmaceutical genes, such decomposition can be used to inform the drug treatment and dosage decisions. However, many clinically and functionally important genes are highly polymorphic and have undergone structural alterations, and as such present a significant challenge for the existing genotyping methods. Here we present a combinatorial optimization framework based on integer linear programming that is able to efficiently solve this problem for various pharmacogenes, including those with structural alterations. We also show how to adapt these linear programs for the emerging long reads sequencing datasets.
12:00 - 2:00 pm Lunch Break
Lunch provided for speakers and volunteers. Located in David Turpin Building first floor hallway.
2:00 pm EPPS: a Novel Ensemble Test to Improve the Power of Genomic Studies
Xuekui Zhang (University of Victoria - Department of Mathematics and Statistics)
The traditional SNP-wise test for Genome-Wide Association Studies (GWASs) often involves testing associations between a disease outcome and millions of single nucleotide polymorphisms (SNPs) and applies multiple testing adjustments afterwards to control the false positive rate. Due to the curse of dimensionality and the limited sample size, many GWASs face a lack of power issues based on such an approach. We proposed EPPS, a novel ensemble test procedure to increase the power of GWASs. Using the multiple random data splits approach, EPPS is an ensemble two-step test procedure inspired by the ensemble machine learning approach and a traditional pilot study in genomic experiments. EPPS provides `one' p-value for each SNP by integrating results of all data splits, enabling FDR control using any standard multiple testing adjustment approach. Furthermore, automatically selects values of its parameters to optimize study power with a pre-hoc power analysis.
2:40 pm Subtype Analysis with Cancer Somatic Mutations
Chad (Qianchuan) He (Fred Hutch - Public Health Sciences Division)
Understanding the association between cancers subtypes and genetic variations is fundamental to the development of targeted therapies for patients. Somatic mutation plays important roles in tumor development and has emerged as a new type of genetic variations for studying the association with cancer subtypes. We propose an approach, SASOM, for the association analysis of cancer subtypes with somatic mutations. Our approach tests the association between a set of somatic mutations (from a genetic pathway) and subtypes, while incorporating functional information of the mutations into the analysis. In a real data application, we examine the somatic mutations from a cutaneous melanoma dataset, and identify a genetic pathway that is associated with immune-related subtypes.
3:20 pm Coffee Break
Coffee break provided for everyone.
3:50 pm cSurvival: a web resource for biomarker interactions in cancer outcomes and in cell lines
Xuanjin Cheng (Canada's Michael Smith Genome Sciences Centre [GSC] at BC Cancer)
Survival analysis is a technique for identifying prognostic biomarkers and genetic vulnerabilities in cancer studies. Large-scale consortium-based projects have profiled >11 000 adult and >4000 pediatric tumor cases with clinical outcomes and multiomics approaches. This provides a resource for investigating molecular-level cancer etiologies using clinical correlations. Although cancers often arise from multiple genetic vulnerabilities and have deregulated gene sets (GSs), existing survival analysis protocols can report only on individual genes. Additionally, there is no systematic method to connect clinical outcomes with experimental (cell line) data. To address these gaps, we developed cSurvival (https://tau.cmmt.ubc.ca/cSurvival). cSurvival provides a user-adjustable analytical pipeline with a curated, integrated database and offers three main advances: (i) joint analysis with two genomic predictors to identify interacting biomarkers, including new algorithms to identify optimal cutoffs for two continuous predictors; (ii) survival analysis not only at the gene, but also the GS level; and (iii) integration of clinical and experimental cell line studies to generate synergistic biological insights. To demonstrate these advances, we report three case studies. We confirmed findings of autophagy-dependent survival in colorectal cancers and of synergistic negative effects between high expression of SLC7A11 and SLC2A1 on outcomes in several cancers. We further used cSurvival to identify high expression of the Nrf2-antioxidant response element pathway as a main indicator for lung cancer prognosis and for cellular resistance to oxidative stress-inducing drugs. Altogether, these analyses demonstrate cSurvival’s ability to support biomarker prognosis and interaction analysis via gene- and GS-level approaches and to integrate clinical and experimental biomedical studies.
4:30 pm Overview of Decipher/Veracyte Efforts in Building Genomic Classifiers via Semi-supervised Machine Learning Approaches
Yang Seagle Liu (Veracyte, Inc.)
Modern genomic tests in cancer treatment (products like Decipher for prostate and bladder cancer; Afirma for thyriod cancer), has led to the collection of genomic profiles of a large number of patients (N > 100K). This data has become a rich resource to improve existing genomic classifiers, but it is often impractical to obtain outcome data on such a large number of patients. To leverage this large unlabelled data, we explore semi-supervised machine learning approaches that utilize both labelled and unlabelled to train a machine learning model. This presentation is an overview of this data as well as approaches we explored.
5:30 - 7:30 pm Dinner
BBQ provided for everyone. Located in David Turpin Building first floor hallway.
6:00 - 9:00 pm Poster Session
Located in David Turpin Building first floor hallway.

August 16

9:20 am Light Breakfast
Breakfast provided for everyone. Located in the conference room.
9:45 am Robust Joint Modelling of Left-Censored Longitudinal Data and Survival Data, with Application to HIV Vaccine Studies
Lang Wu (University of British Columbia - Department of Statistics)
In jointly modelling longitudinal and survival data, the longitudinal data may be complex in the sense that they may contain outliers and may be left censored. Motivated from an HIV vaccine study, we propose a robust method for joint models of longitudinal and survival data, where the outliers in longitudinal data are addressed using a multivariate t-distribution for b-outliers and using an M-estimator for e-outliers. We also propose a computationally efficient method for approximate likelihood inference. The proposed method is evaluated by simulation studies. Based on the proposed models and method, we analyze the HIV vaccine data and find a strong association between longitudinal biomarkers and the risk of HIV infection.
10:25 am Establishing LOD and LOQ for samples with low copy number eDNA
Mary Lesperance (University of Victoria - Department of Mathematics and Statistics)
Quantitative real-time polymerase reaction (qPCR) is a popular, highly sensitive means to detect environmental DNA (eDNA) in a variety of sample matrices. Currently, there is a drive to use qPCR data to infer species biomass or abundance by quantifying the copy number or concentration of a given target gene fragment in a sample, which is often very dilute. Cycle thresholds (Ct/Cq) on multiple technical replicates have been used to quantify eDNA amounts. However, quantification of DNA copy number has been challenging when DNA is not detected in all technical replicates. Herein, we provide a statistically robust Binomial-Poisson model to create a standard curve that relates the number of qPCR-detected technical replicates to the copy number to be applied in the case of low copy number samples. Limits of detection (LOD) and quantification (LOQ) and their confidence intervals are derived using a well-accepted statistical approach thus providing a more broadly applicable and robust method for reporting eDNA abundance in the low copy number range. To date, we have applied this approach to 30+ eDNA assays from multiple labs. In this presentation, we provide a practical example on how to derive LOD and LOQ with confidence intervals and estimate copy numbers with standard errors using a standardized format and synthetic DNA to characterize an eDNA assay.
11:05 am Coffee Break
Coffee break provided for everyone.
11:20 am Residual diagnostics for censored regression via randomized survival probabilities
Longhai Li (University of Saskatchewan - Department of Mathematics and Statistics)
Residuals in normal regression are used to assess a model's goodness-of-fit (GOF) and discover directions for improving the model. However, there is a lack of residuals with a characterized reference distribution for censored regression. In this article, we propose to diagnose censored regression with normalized randomized survival probabilities (RSP). The key idea of RSP is to replace the survival probability (SP) of a censored failure time with a uniform random number between 0 and the SP of the censored time. We prove that RSPs always have the uniform distribution on (0, 1) under the true model with the true generating parameters. Therefore, we can transform RSPs into normally distributed residuals with the normal quantile function. We call such residuals normalized RSP (NRSP residuals). We conduct simulation studies to investigate the sizes and powers of statistical tests based on NRSP residuals in detecting the incorrect choice of distribution family and nonlinear effect in covariates. Our simulation studies show that, although the GOF tests with NRSP residuals are not as powerful as a traditional GOF test method, a nonlinear test based on NRSP residuals has significantly higher power in detecting nonlinearity. We also compared these model diagnostics methods with a breast-cancer recurrent-free time dataset. The results show that the NRSP residual diagnostics successfully captures a subtle nonlinear relationship in the dataset, which is not detected by the graphical diagnostics with CS residuals and existing GOF tests.
12:00 pm Lunch Break
Lunch provided for speakers and volunteers. Located in Elliot Building room 165.
2:00 pm A novel machine learning approach for gene module identification and prediction via a co-expression network of single-cell sequencing data
Li Xing (University of Saskatchewan - Department of Mathematics and Statistics)
Gene co-expression network (GCN) analysis is widely used in microarray and RNA sequencing data to group genes with correlated expression levels, inferring similarity in function or coregulation in the pathway. In literature, such approaches are mainly unsupervised, which may introduce instability and variation across different datasets. Inspired by modern machine learning, we propose a novel approach that integrates supervised and unsupervised learning and simultaneously targets two tasks for analysis of RNA sequencing data: (1) gene module identification; and (2) phenotype prediction.
The identified gene modules from this approach could enable other researchers to conduct follow-up studies to figure out causal genes. And the novel approach also improves the accuracy of predicting phenotypes. Its algorithm incorporates parallel computation and other strategies to ensure researchers can handle large-scale single-cell data using personal computers. We showcased its use for single-cell auto-annotation.
2:40 pm Performance of Cross-Fit and Doubly Robust Estimators for Residual Confounding Control in Pharmacoepidemiologic Studies Utilizing Administrative Healthcare Databases
Ehsan Karim (University of British Columbia - School of Population and Public Health)
Retrospective health care claims datasets are not primarily collected for research purposes, and hence studies conducted using such databases are commonly criticized for the lack of complete information on potential confounders. A massive amount of diagnoses, procedures, and medication codes that are regularly captured in the claims databases are usually not used in a regular epidemiological study. In effect estimation studies, the high-dimensional propensity score (hdPS) algorithm is a framework that enables us to utilize such information as proxies of unobserved information and that has been shown to reduce bias. Some of the machine learning methods that can select or rank variables have been shown to perform as a suitable alternative to this hdPS framework. In this talk, using a health administrative cohort study as a motivating example, the performance of cross-fit and doubly robust estimators will be highlighted compared to both hdPS and machine learning methods, and practical recommendations will be discussed
3:20 pm Coffee Break
Coffee break provided for everyone.
3:50 pm Neural Network Classifiers for Features Extraction in Neuroimaging Genetics
Farouk Nathoo (University of Victoria - Department of Mathematics and Statistics)
A major issue in the association of genes to imaging phenotypes is the high dimension of both the genetic data and imaging data. In the article, we tackled the latter problem. Our proposed solution is supported by a vast literature and uses the great predictive power of neural networks to extract features highly related with Alzheimers Disease (AD) diagnosis. A key concept of our work is a neuroimaging genetic pipeline where we separate the neuroimaging genetic association in three distinct steps: first is image processing, then, neuroimaging feature extraction and finally the genetic association study. In this article, we discuss the use of a neural network classifier in order to accomplish the second step; extracting neuroimaging features that are related with AD. We compared the predictive power of those features to expert selected features before running a genetic association study and take a closer look at the genes identified given those new neuroimaging features.
4:30 pm Using CVX to construct optimal designs for biomedical studies with multiple objectives
Julie Zhou (University of Victoria - Department of Mathematics and Statistics)
Model-based optimal designs for regression problems with multiple objectives are common in practice. The traditional approach is to construct an optimal design for the most important objective and hope that the design performs well for the other objectives. Analytical approaches are challenging because the objectives are often competitive and their relative importance has to be incorporated at the onset of the design construction. There are also no general and efficient algorithms for searching such designs for user-specified nonlinear models and criteria. We propose a new and effective approach for finding multiple-objective optimal designs via the CVX software and demonstrate it can efficiently find different types of multiple-objective optimal designs after the optimization problems are carefully formulated as convex optimization problems appropriate for CVX use. We provide biomedical applications and show our MATLAB codes work well. This is a joint work with Professor Weng Kee Wong.
6:30 - 9:30 pm Dinner
Dinner provided for speakers and volunteers. Located at 1550's Pub Style Restaurant.

August 17

9:20 am Light Breakfast
Breakfast provided for everyone. Located in the conference room.
9:45 am A Bayesian adaptive multilayer basis model for task fMRI data
Michelle Miranda (University of Victoria - Department of Mathematics and Statistics)
Task-evoked functional magnetic resonance imaging (fMRI) studies are a powerful tool to understand human sensory, cognitive, and emotional processes. We introduce a new Bayesian approach for analyzing task fMRI data that simultaneously detects activation signatures and background connectivity. Our modeling involves a new tensor spatial-temporal basis strategy that enables scalable computing yet captures spatial correlation from nearby voxels and distant ROIs and long-memory temporal correlation. The spatial basis involves a composite hybrid transform with two levels: the first accounts for within-ROI correlation, and second between-ROI distant correlation. We demonstrate in simulations how our basis space regression modeling strategy increases sensitivity for identifying activation signatures, partly driven by the induced background connectivity that itself can be summarized to reveal biological insights. This strategy leads to computationally scalable fully Bayesian inference at the voxel or ROI level that adjusts for multiple testing.
10:25 am Bayesian Adaptive Dose-Finding Design in Phase I/II trial
Haolun Shi (Simon Fraser University - Department of Statistics and Actuarial Science)
MMolecularly targeted agents and immunotherapy have revolutionized modern cancer treatment. Unlike chemotherapy, the maximum tolerated dose of the targeted therapies may not pose significant clinical benefit over the lower doses. By simultaneously considering both binary toxicity and efficacy endpoints, phase I/II trials can identify a better dose for subsequent phase II trials than traditional phase I trials in terms of efficacy-toxicity tradeoff. Existing phase I/II dose-finding methods are model-based or need to pre-specify many design parameters, which makes them difficult to implement in practice. To strengthen and simplify the current practice of phase I/II trials, we propose a utility-based toxicity probability interval (uTPI) design for finding the optimal biological dose (OBD) where binary toxicity and efficacy endpoints are observed. The uTPI design is model-assisted in nature, simply modeling the utility outcomes observed at the current dose level based on a quasibinomial likelihood. Toxicity probability intervals are used to screen out overly toxic dose levels, and then the dose escalation/de-escalation decisions are made adaptively by comparing the posterior utility distributions of the adjacent levels of the current dose. The uTPI design is flexible in accommodating various utility functions while only needs minimum design parameters. A prominent feature of the uTPI design is that it has a simple decision structure such that a concise dose-assignment decision table can be calculated before the start of trial and be used throughout the trial, which greatly simplifies practical implementation of the design. Extensive simulation studies demonstrate that the proposed uTPI design yields desirable as well as robust performance under various scenarios.
11:05 am Coffee Break
Coffee break provided for everyone.