Conference in Statistics and Data Science with Applications in Biology, Genetics, Public Health, and Finance

kamloops

August 23 - 24, 2023

Thompson Rivers University, Kamloops, BC

Conference Program

Each talk will be 30 minutes long followed by 10 minutes of Q&A.

August 23, 2023

8:30 am:

Light Breakfast

Breakfast will be provided to everyone.

9:00 am:

Welcome event

9:40 am:

Information-Rich Environments for Population Health Research: Challenges and Opportunities

Lisa Lix, Professor and Canada Research Chair, Max Rady College of Medicine, Community Health Sciences, University of Manitoba

Canada's health information environment is expanding rapidly to integrate administrative data, electronic medical records, and clinical registries, and to incorporate methods and tools to mine these data. Despite extraordinary opportunities for research aimed at improving the health of populations, there are also significant barriers, including less-than- optimal data qualities, data siloes, and limited analytic training opportunities. As a consequence, the value of these data are not fully realized. This talk will highlight initiatives to improve data fitness for use, expand data collaborations and networks, and build capacity in machine learning and advanced data visualization methods.

10:20 am:

Evidence changes beliefs and measuring change in beliefs measures evidence

Micheal Evans, Professor, Department of Statistical Sciences, University of Toronto

The concept of statistical evidence is central to the field of statistics. In spite of this, it is reasonable to conclude, based on current debates, that the field has not settled on how to exactly define and measure statistical evidence. There is, however, a fairly simple characterization whenever a prior is incorporated into the analysis. In that context the Bayes factor seems to provide this, but there are still concerns that need to be addressed. These issues are discussed together with their resolution. Overall, this necessitates an approach to statistical problems that combines Bayesian and frequentist attributes to provide a logically sound methodology. We achieve, not only a relatively simple characterization of statistical evidence, but a resolution between what can be considered as very different views of statistics and its role in science.

11:00 am:

Coffee Break

Coffee will be provided to everyone.

11:20 am:

An exploration of the relationship between wastewater viral signals and COVID-19 hospitalizations in Ottawa, Canada

Charmaine Dean, Vice-President Research, Professor, Department of Statistics and Actuarial Science, University of Waterloo

Monitoring of viral signal in wastewater is considered a useful tool for monitoring the burden of COVID-19, especially during times of limited availability in testing. Studies have shown that COVID-19 hospitalizations are highly correlated with wastewater viral signals and the increases in wastewater viral signals can provide an early warning for increasing hospital admissions. However, the association between wastewater viral signals and COVID-19 hospitalizations may not be linear or consistent over time. A clear understanding of the time-varying and nonlinear association between wastewater viral signals and COVID-19 hospitalizations is necessary. This project uses a distributed lag nonlinear model (DLNM) (Gasparrini et al., 2010) to study the nonlinear exposure-response delayed association of the COVID-19 hospitalizations and SARS-CoV-2 wastewater viral signals, using data from Ottawa, Ontario, Canada. We consider up to a 15-day time lag from the average of SARS-CoV two gene concentrations and their contribution to COVID-19 hospitalizations. We also included an adjustment for the expected reduction in hospitalization from vaccination efforts. A correlation analysis of the data verifies that COVID-19 hospitalizations are highly correlated with wastewater viral signals with a time-varying relationship. Our analysis with DLNM suggests that the model yields a reasonable estimate of COVID-19 hospitalizations and enhances our understanding of the association between wastewater viral signals and COVID-19 hospitalizations. This paper quantifies the relationship between SARS-CoV gene concentrations and COVID-19 hospitalizations at the population level.

12:00 noon:

Lunch Break

Lunch will be provided to the invited speakers and volunteers.

1:30 pm:

Pretest and shrinkage estimators in generalized partially linear models with application to real data

Shakhawat Hossain, Professor, Department of Mathematics and Statistics, University of Winnipeg

Semiparametric models hold promise to address many challenges to statistical inference that arise from real world applications, but their novelty and theoretical complexity create challenges for estimation. Taking advantage of the broad applicability of semiparametric models, we propose some novel and improved methods to estimate the regression coefficients of generalized partially linear models (GPLM). This model extends the generalized linear model by adding a nonparametric component. Like in parametric models, variable selection is important in the GPLM to single out the inactive covariates for the response. Instead of deleting inactive covariates, our approach uses them as auxiliary information in the estimation procedure. We then define two models, one that includes all the covariates and another that includes the active covariates only. We then combine these two model estimators optimally to form the pretest and shrinkage estimators. Asymptotic properties are studied to derive the asymptotic biases and risks of the proposed estimators. We show that if the shrinkage dimension exceeds two, the asymptotic risks of the shrinkage estimators are strictly less than that of the full model estimators. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed estimation methods. We then apply our proposed methods to two real data sets. Our simulation and real data results show that the proposed estimators perform with higher accuracy and lower variability in the estimation of regression parameters for GPLM compared to competing estimation methods.

2:10 pm:

Feedback mechanisms in epidemic models: Is your population alarmed?

Rob Deardon, Professor, Faculty of Veterinary Medicine and Department of Mathematics and Statistics, University of Calgary

The COVID-19 pandemic has illustrated both the utility and limitation of using epidemic models for understanding and forecasting disease spread. One of the many difficulties in modelling epidemic spread is that caused by behavioural change in the underlying population. This can be a major issue in public health since, as we have seen during the COVID-19 pandemic, behaviour in the population can change drastically as infection levels vary, both due to government mandates and personal decisions. Such changes in the underlying population result in major changes in transmission dynamics of the disease, making the modelling challenges. However, these issues arise in agriculture and public health, as changes in farming practice are also often observed as disease prevalence changes. We propose a model formulation where time-varying transmission is captured by the level of alarm in the population and specified as a function of the past epidemic trajectory. The model is set in a data-augmented Bayesian framework as epidemic data are often only partially observed, and we can utilize prior information to help with parameter identifiability. We investigate the estimability of the population alarm across a wide range of scenarios, using both parametric functions and non-parametric Gaussian process and splines. The benefit and utility of the proposed approach is illustrated through an application to COVID-19 data from New York City.

2:50 pm:

Coffee Break

Coffee will be provided to everyone.

3:10 pm:

Reducing Residual Confounding Bias in Health Research: Evaluating High-Dimensional Propensity Score Algorithm and Machine Learning Extensions

M. Ehsan Karim, Assistant Professor, School of Population and Public Health, University of British Columbia

Health research that utilizes administrative health databases may lack complete information on confounding variables. This limitation can lead to bias in treatment effect estimation. The high-dimensional propensity score (hdPS) algorithm has been proposed as a solution to reduce bias by using proxies for unmeasured and mismeasured covariates. However, the hdPS framework involves a large amount of data, and machine learning variable selection methods have been proposed as alternatives. Despite this, accurately estimating variance remains a challenge, even for doubly robust or targeted maximum likelihood estimators. In this study, we evaluate the performances of vanilla hdPS, machine learning alternatives, and doubly robust versions in terms of bias, model-based and empirical variances, and coverage based on updated methodological recommendations. We also present a nationally representative analysis as a motivating example and provide practical recommendations for practitioners.

3:50 pm:

A short journey through remote sensing, high-dimensional data, and environmental sciences

Max Turgeon, Data scientist, Tesera Systems and Adjunct professor, Department of Statistics, University of Manitoba

Recent technological advances have led to a plethora of new complex, high-dimensional datasets. These complex datasets open the door to new applications, but they also come with statistical and computational challenges. In this talk, we will focus on how remote sensing is used in environmental sciences. Specifically, we will look at how satellite-based hyperspectral imagery, LiDAR and radar can be used to manage forested areas, to track the impact of drought and wild fires, and to monitor ecosystems. Our main case study is the use of functional data analysis and temporal hyperspectral imagery to identify tree species in Northwestern Ontario. Throughout the talk, we will also discuss the availability of open-access remote sensing data, so that members of the audience can start exploring this exciting area of application.

4:30 pm:

Joint modeling of longitudinal and time-to-event data

Shahedul Khan, Associate Professor, Department of Mathematics and Statistics, University of Saskatchewan

In follow-up studies, both fixed and time-dependent covariates are often available along with an observation on the time to an event of interest. A time-dependent covariate can also be internal (generated by the unit under study and measured longitudinally over time) or external (changes its values due to external characteristics). For example, in HIV clinical trials, datasets typically contain not only time to progression to AIDS or death but also information on several covariates such as treatment assignment, demographic information, and physiological characteristics which are recorded at baseline, and CD4 cell counts which are taken at subsequent clinic visits (longitudinal measurements). The study is typically designed not only to explore the effects of the fixed covariates on time-to-event but also to (i) understand within-subject patterns of change of the internal covariate (longitudinal response), and/or (ii) characterize the relationship between features of the internal covariate and time to the occurrence of the event. The complications posed by the realities of the observed data and the potential for biased inferences for both (i) and (ii) if ordinary techniques are applied have led to the development of a new approach, called "joint models" for longitudinal and time-to-event data. The joint modeling approach works in two steps: (a) modeling the internal covariate taking into account measurement error to construct its complete history, and (b) quantifying the effects of the covariates on time-to-event taking into account the history as constructed in (a). Combining the two processes enables us to mutually borrow information from each process and gain efficiency in statistical inference. In this talk, rationale and background of joint modeling will be discussed, and our recent contributions will be presented.

5:15 pm:

Poster Session

By registered students

August 24, 2023

8:30 am:

Light Breakfast

Breakfast will be provided to everyone.

9:00 am:

Gaussian Mixture Reduction based on Composite Transportation Divergence

Jiahua Chen, Professor, Department of Statistics, University of British Columbia

Gaussian mixtures can closely approximate almost any smooth density function and they are used to simplify downstream inference tasks. As such, it is widely used in applications in density estimation, belief propagation, and Bayesian filtering. In these applications, a finite Gaussian mixture provides an initial approximation to density functions that are updated recursively. A challenge in these recursions is that the order of the Gaussian mixture increases exponentially, and the inference quickly becomes intractable. To overcome the difficulty, Gaussian mixture reduction, which approximates a high order Gaussian mixture by one of a lower order, can be used. Existing methods, such as clustering-based approaches, are renowned for their satisfactory performance and computational efficiency. However, they have unknown convergence and optimal targets. Directly searching for a lower order Gaussian mixture that minimizes some divergence from the original mixture usually involves a challenging optimization problem. We propose a novel optimization-based Gaussian mixture reduction method. We find that a composite transportation divergence is particularly suited to the reduction problem. It facilitates an easy-to-implement and effective majorization-minimization algorithm for its numerical solution. We further establish theoretical convergence under general conditions. We show that many existing clustering-based methods are special cases of ours, thus bridging the gap between optimization-based and clustering-based methods. The unified framework allows users to choose the most suitable cost function to achieve superior performance in their specific application. We validate the efficiency and effectiveness of the proposed method through extensive empirical experiments Based on Thesis of Qiong Zhang

9:40 am:

Fast, Distributed Bayesian Inference for Everyone

Alexandre Bouchard-Côté, Professor, Department of Statistics, University of British Columbia

Bayesian statistics has the potential to be the data scientist's swiss army knife. In areas where the data types and the questions posed are highly varied, Bayes Estimators adapt to the problem at hand. This contrasts with more rigid statistical methodologies where the problem is adapted to the statistical tools. I will describe current work motivated by this vision. One question my group investigate is: how to scale Bayesian inference using distributed architectures? I will describe novel perspectives on this old problem coming from the nascent field of non-reversible Monte Carlo methods. In particular, I will present an adaptive, non-reversible parallel tempering allowing MCMC exploration of challenging problems such as single cell phylogenetic trees.  My group is working on making these advanced Monte Carlo methods easy to use: we have developed Blang, a Bayesian modelling language to perform inference over arbitrary data types using non-reversible, highly parallel algorithms, and Pigeons, a package allowing the user to leverage clusters of 1000s of nodes to speed-up difficult Monte Carlo problems without requiring knowledge of distributed algorithms.

10:20 am:

Post-selection estimation and prediction strategies in linear mixed models for high-dimensional data application

S. Ejaz Ahmed, Professor, Department of Mathematics and Statistics, Brock University

In high-dimensional settings where number of predictors is greater than observations, many penalized methods were introduced for simultaneous variable selection and parameters estimation when the model is sparse. However, a model may have sparse signals as well as with number predictors with weak signals. In this scenario variable selection methods may not distinguish predictors with weak signals and sparse signals. The prediction based on a selected submodel may not be preferable in such cases. For this reason, we propose a high-dimensional shrinkage strategy to improve the prediction performance of a submodel in linear mixed effect models. Such a high-dimensional shrinkage estimator (HDSE) is constructed by shrinking a ridge estimator in the direction of a candidate submodel. We demonstrate that the proposed HDSE performs uniformly better than the ridge estimator. Interestingly, it improves the prediction performance of given submodel generated from most existing variable selection methods. The relative performance of the proposed HDSE strategy is appraised by both simulation studies and the real data analysis.  The methodology is demonstrated on a longitudinal resting-state functional magnetic resonance imaging (rs-fMRI) effective brain connectivity network and genetic data.  Some open research problems will be discussed, as well.

11:00 am:

Coffee Break

Coffee will be provided to everyone.

11:20 am:

Graphical proportional hazards models with measurement error

Grace Y. Yi, Professor and Canada Research Chair in Data Science (Tier 1), Department of Statistical and Actuarial Sciences, University of Western Ontario

In survival data analysis, the Cox proportional hazards (PH) model is perhaps the most widely used model to feature the dependence of survival times on covariates. While many inference methods have been developed under such a model or its variants, those models are not adequate for handling data with complex structured covariates. High-dimensional survival data often entail several features: (1) many covariates are inactive in explaining the survival information, (2) active covariates are associated in a network structure, and (3) some covariates are error-contaminated. To handle such survival data, we propose graphical PH measurement error models and develop inferential procedures for the parameters of interest. Our proposed models significantly enlarge the scope of the usual Cox PH model and have great flexibility in characterizing survival data. Theoretical results are established to justify the proposed methods. Numerical studies are conducted to assess the performance of the proposed methods.

12:00 noon:

Lunch Break

Lunch will be provided to the invited speakers and volunteers.

1:30 pm:

Rank-based support vector machines for highly imbalance data using nominated samples

Mohammad Jafari Jozani, Professor, Department of Statistics, University of Manitoba

We propose a novel approach to address the issue of highly imbalanced binary classification problems using rank information. Our approach utilizes a maxima nominated sampling technique that biases the training sample towards the minority class by using observations with the highest chance of being from the minority class in a small sample of randomly selected units from the underlying population. This sampling technique is based on expert opinion, which has received minimal attention in the machine learning community so far. To incorporate the extra rank information of maxima nominated samples (MaxNS) into the learning process, we introduce novel rank-based Hinge and Logistic loss functions that account for the extra rank information in MaxNS training data sets. We develop MaxNS Support Vector Machines and provide efficient algorithms for solving the proposed learning problems. Numerical studies are performed to validate the efficacy of the methods presented.

2:10 pm:

A novel machine learning approach for gene module identification and prediction via a co-expression network of single-cell sequencing data

Li Xing, Assistant Professor, Department of Mathematics and Statistics, University of Saskatchewan

Gene co-expression network analysis is widely used in microarray and RNA sequencing data analysis. It groups genes based on their co-expression network. And genes within a group infer similarity in function or coregulation in the pathway. In literature, the approaches to group genes are mainly unsupervised, which may introduce instability and variation across different datasets. Inspired by ensemble learning, we propose a novel approach that ensemble supervised and unsupervised learning techniques and simultaneously works on two tasks, gene module identification and phenotype prediction, during the data analysis process. The identified gene modules from this approach could suggest more candidate genes to the original pathway, and those genes are potential biomarkers for pathway-related diseases. In addition, the novel approach also improves the prediction accuracy for phenotypes. The algorithm can be used as a general prediction algorithm. And, as it is specially designed to handle large samples, it is suitable for handling single-cell data with many cells. We showcased the use of the algorithm in single-cell cell-type auto-annotation

2:50 pm:

Coffee Break

Coffee will be provided to everyone.

3:10 pm:

The scalable birth-death MCMC algorithm for mixed graphical model learning with application to genomic data integrations

Nanwei Wang, Assistant Professor, Department of Mathematics and Statistics, University of New Brunswick

Recent advances in biological research have seen the emergence of high-throughput technologies with numerous applications. In cancer research, the challenge is now to perform integrative analyses of high-dimensional multi-omic data with the goal to better understand genomic processes that correlate with cancer outcomes. We propose here a novel mixed graphical model approach to analyze multi-omic data of different types (continuous, discrete and count) and perform model selection by extending the Birth-Death MCMC (BDMCMC) algorithm. We compare the performance of our method to the LASSO and the standard BDMCMC methods using simulations and found that our method is superior in terms of both computational efficiency and the accuracy of the model selection results. Finally, an application to the TCGA breast cancer data shows that integrating genomic information at different levels (mutation and expression data) leads to better subtyping of breast cancers.

3:50 pm:

Nonparametric high-dimensional multi-sample tests based on graph theory.

Xiaping Shi, Assistant Professor, Department of Computer Science, Mathematics, Physics and Statistics, UBCO

High-dimensional data pose unique challenges for data processing in an era of ever-increasing amounts of data availability. Graph theory can provide a structure of high-dimensional data. We introduce two key properties desirable for graphs in testing homogeneity. Roughly speaking, these properties may be described as: unboundedness of edge counts under the same distribution and boundedness of edge counts under different distributions. It turns out that the minimum spanning tree violates these properties but the shortest Hamiltonian path posses them.    Based on the shortest Hamiltonian path, we propose two combinations of edge counts in multiple samples to test the homogeneity. We give the permutation null distributions of proposed statistics when sample sizes go to infinity. The power is analyzed by assuming both sample sizes and dimensionality tend to infinity. Simulations show that our new tests behave very well overall in comparison with various competitors. Real data analysis of tumors and images further convince the value of our proposed tests. Software implementing the test is available in the R package GRelevance.

4:30 pm:

Cell annotation using single-cell genomics data

Xuekui Zhang, Associate Professor, Department of Mathematics and Statistics, University of Victoria

Single-cell RNA-sequencing (scRNA-seq) technology enables researchers to investigate a genome at the cellular level with unprecedented resolution. An organism consists of a heterogeneous collection of cell types, each of which plays a distinct role in various biological processes. Hence, the first step of scRNA-seq data analysis is often to distinguish cell types so they can be investigated separately. This talk introduces our recent works in cell type annotation using scRNA-seq data.