Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeLabel Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts
Test-time adaptation (TTA) aims to adapt a pre-trained model to the target domain in a batch-by-batch manner during inference. While label distributions often exhibit imbalances in real-world scenarios, most previous TTA approaches typically assume that both source and target domain datasets have balanced label distribution. Due to the fact that certain classes appear more frequently in certain domains (e.g., buildings in cities, trees in forests), it is natural that the label distribution shifts as the domain changes. However, we discover that the majority of existing TTA methods fail to address the coexistence of covariate and label shifts. To tackle this challenge, we propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively. Specifically, we estimate the label distribution of the target domain to feed it into the label shift adapter. Subsequently, the label shift adapter produces optimal parameters for the target label distribution. By predicting only the parameters for a part of the pre-trained source model, our approach is computationally efficient and can be easily applied, regardless of the model architectures. Through extensive experiments, we demonstrate that integrating our strategy with TTA approaches leads to substantial performance improvements under the joint presence of label and covariate shifts.
Optimal Representations for Covariate Shift
Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation's marginal support needs to be the same across source and target. We make this practical by designing self-supervised objectives that only use unlabelled data and augmentations to train robust representations. Our objectives give insights into the robustness of CLIP, and further improve CLIP's representations to achieve SOTA results on DomainBed.
Pooling Image Datasets With Multiple Covariate Shift and Imbalance
Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple institutions to study weak but relevant associations between images and disease outcomes. Such data often manifest shift/imbalance in covariates (i.e., secondary non-imaging data). Controlling for such nuisance variables is common within standard statistical analysis, but the ideas do not directly apply to overparameterized models. Consequently, recent work has shown how strategies from invariant representation learning provides a meaningful starting point, but the current repertoire of methods is limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how this style of formulation offers a unified perspective on at least 5+ distinct problem settings, from self-supervised learning to matching problems in 3D reconstruction.
Estimating Model Performance Under Covariate Shift Without Labels
Machine learning models often experience performance degradation post-deployment due to shifts in data distribution. It is challenging to assess model's performance accurately when labels are missing or delayed. Existing proxy methods, such as drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method, Probabilistic Adaptive Performance Estimation (PAPE), for evaluating classification models on unlabeled data that accurately quantifies the impact of covariate shift on model performance. It is model and data-type agnostic and works for various performance metrics. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of the covariate shift, learning directly from data instead. We tested PAPE on tabular data using over 900 dataset-model combinations created from US census data, assessing its performance against multiple benchmarks. Overall, PAPE provided more accurate performance estimates than other evaluated methodologies.
Generalize or Detect? Towards Robust Semantic Segmentation Under Multiple Distribution Shifts
In open-world scenarios, where both novel classes and domains may exist, an ideal segmentation model should detect anomaly classes for safety and generalize to new domains. However, existing methods often struggle to distinguish between domain-level and semantic-level distribution shifts, leading to poor out-of-distribution (OOD) detection or domain generalization performance. In this work, we aim to equip the model to generalize effectively to covariate-shift regions while precisely identifying semantic-shift regions. To achieve this, we design a novel generative augmentation method to produce coherent images that incorporate both anomaly (or novel) objects and various covariate shifts at both image and object levels. Furthermore, we introduce a training strategy that recalibrates uncertainty specifically for semantic shifts and enhances the feature extractor to align features associated with domain shifts. We validate the effectiveness of our method across benchmarks featuring both semantic and domain shifts. Our method achieves state-of-the-art performance across all benchmarks for both OOD detection and domain generalization. Code is available at https://github.com/gaozhitong/MultiShiftSeg.
AdverX-Ray: Ensuring X-Ray Integrity Through Frequency-Sensitive Adversarial VAEs
Ensuring the quality and integrity of medical images is crucial for maintaining diagnostic accuracy in deep learning-based Computer-Aided Diagnosis and Computer-Aided Detection (CAD) systems. Covariate shifts are subtle variations in the data distribution caused by different imaging devices or settings and can severely degrade model performance, similar to the effects of adversarial attacks. Therefore, it is vital to have a lightweight and fast method to assess the quality of these images prior to using CAD models. AdverX-Ray addresses this need by serving as an image-quality assessment layer, designed to detect covariate shifts effectively. This Adversarial Variational Autoencoder prioritizes the discriminator's role, using the suboptimal outputs of the generator as negative samples to fine-tune the discriminator's ability to identify high-frequency artifacts. Images generated by adversarial networks often exhibit severe high-frequency artifacts, guiding the discriminator to focus excessively on these components. This makes the discriminator ideal for this approach. Trained on patches from X-ray images of specific machine models, AdverX-Ray can evaluate whether a scan matches the training distribution, or if a scan from the same machine is captured under different settings. Extensive comparisons with various OOD detection methods show that AdverX-Ray significantly outperforms existing techniques, achieving a 96.2% average AUROC using only 64 random patches from an X-ray. Its lightweight and fast architecture makes it suitable for real-time applications, enhancing the reliability of medical imaging systems. The code and pretrained models are publicly available.
Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles
Once deployed, medical image analysis methods are often faced with unexpected image corruptions and noise perturbations. These unknown covariate shifts present significant challenges to deep learning based methods trained on "clean" images. This often results in unreliable predictions and poorly calibrated confidence, hence hindering clinical applicability. While recent methods have been developed to address specific issues such as confidence calibration or adversarial robustness, no single framework effectively tackles all these challenges simultaneously. To bridge this gap, we propose LaDiNE, a novel ensemble learning method combining the robustness of Vision Transformers with diffusion-based generative models for improved reliability in medical image classification. Specifically, transformer encoder blocks are used as hierarchical feature extractors that learn invariant features from images for each ensemble member, resulting in features that are robust to input perturbations. In addition, diffusion models are used as flexible density estimators to estimate member densities conditioned on the invariant features, leading to improved modeling of complex data distributions while retaining properly calibrated confidence. Extensive experiments on tuberculosis chest X-rays and melanoma skin cancer datasets demonstrate that LaDiNE achieves superior performance compared to a wide range of state-of-the-art methods by simultaneously improving prediction accuracy and confidence calibration under unseen noise, adversarial perturbations, and resolution degradation.
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection
Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code is publicly available.
Statistical Learning under Heterogenous Distribution Shift
This paper studies the prediction of a target z from a pair of random variables (x,y), where the ground-truth predictor is additive E[z mid x,y] = f_star(x) +g_{star}(y). We study the performance of empirical risk minimization (ERM) over functions f+g, f in F and g in G, fit on a given training distribution, but evaluated on a test distribution which exhibits covariate shift. We show that, when the class F is "simpler" than G (measured, e.g., in terms of its metric entropy), our predictor is more resilient to heterogenous covariate shifts in which the shift in x is much greater than that in y. These results rely on a novel H\"older style inequality for the Dudley integral which may be of independent interest. Moreover, we corroborate our theoretical findings with experiments demonstrating improved resilience to shifts in "simpler" features across numerous domains.
Feed Two Birds with One Scone: Exploiting Wild Data for Both Out-of-Distribution Generalization and Detection
Modern machine learning models deployed in the wild can encounter both covariate and semantic shifts, giving rise to the problems of out-of-distribution (OOD) generalization and OOD detection respectively. While both problems have received significant research attention lately, they have been pursued independently. This may not be surprising, since the two tasks have seemingly conflicting goals. This paper provides a new unified approach that is capable of simultaneously generalizing to covariate shifts while robustly detecting semantic shifts. We propose a margin-based learning framework that exploits freely available unlabeled data in the wild that captures the environmental test-time OOD distributions under both covariate and semantic shifts. We show both empirically and theoretically that the proposed margin constraint is the key to achieving both OOD generalization and detection. Extensive experiments show the superiority of our framework, outperforming competitive baselines that specialize in either OOD generalization or OOD detection. Code is publicly available at https://github.com/deeplearning-wisc/scone.
Reward Gaming in Conditional Text Generation
To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this discussion piece, we would like to highlight reward gaming in the natural language generation (NLG) community using concrete conditional text generation examples and discuss potential fixes and areas for future work.
Trained Transformers Learn Linear Models In-Context
Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.
A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.
RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness
We show that the effectiveness of the well celebrated Mixup [Zhang et al., 2018] can be further improved if instead of using it as the sole learning objective, it is utilized as an additional regularizer to the standard cross-entropy loss. This simple change not only provides much improved accuracy but also significantly improves the quality of the predictive uncertainty estimation of Mixup in most cases under various forms of covariate shifts and out-of-distribution detection experiments. In fact, we observe that Mixup yields much degraded performance on detecting out-of-distribution samples possibly, as we show empirically, because of its tendency to learn models that exhibit high-entropy throughout; making it difficult to differentiate in-distribution samples from out-distribution ones. To show the efficacy of our approach (RegMixup), we provide thorough analyses and experiments on vision datasets (ImageNet & CIFAR-10/100) and compare it with a suite of recent approaches for reliable uncertainty estimation.
VeLU: Variance-enhanced Learning Unit for Deep Neural Networks
Activation functions are fundamental in deep neural networks and directly impact gradient flow, optimization stability, and generalization. Although ReLU remains standard because of its simplicity, it suffers from vanishing gradients and lacks adaptability. Alternatives like Swish and GELU introduce smooth transitions, but fail to dynamically adjust to input statistics. We propose VeLU, a Variance-enhanced Learning Unit as an activation function that dynamically scales based on input variance by integrating ArcTan-Sin transformations and Wasserstein-2 regularization, effectively mitigating covariate shifts and stabilizing optimization. Extensive experiments on ViT_B16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB3 confirm VeLU's superiority over ReLU, ReLU6, Swish, and GELU on six vision benchmarks. The codes of VeLU are publicly available on GitHub.
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demands domain expertise and is sensitive to linguistic nuances. In this paper, we introduce Label-driven Automated Prompt Tuning (LAPT), a novel approach to OOD detection that reduces the need for manual prompt engineering. We develop distribution-aware prompts with in-distribution (ID) class names and negative labels mined automatically. Training samples linked to these class labels are collected autonomously via image synthesis and retrieval methods, allowing for prompt learning without manual effort. We utilize a simple cross-entropy loss for prompt optimization, with cross-modal and cross-distribution mixing strategies to reduce image noise and explore the intermediate space between distributions, respectively. The LAPT framework operates autonomously, requiring only ID class names as input and eliminating the need for manual intervention. With extensive experiments, LAPT consistently outperforms manually crafted prompts, setting a new standard for OOD detection. Moreover, LAPT not only enhances the distinction between ID and OOD samples, but also improves the ID classification accuracy and strengthens the generalization robustness to covariate shifts, resulting in outstanding performance in challenging full-spectrum OOD detection tasks. Codes are available at https://github.com/YBZh/LAPT.
Double-Weighting for Covariate Shift Adaptation
Supervised learning is often affected by a covariate shift in which the marginal distributions of instances (covariates x) of training and testing samples p_tr(x) and p_te(x) are different but the label conditionals coincide. Existing approaches address such covariate shift by either using the ratio p_te(x)/p_tr(x) to weight training samples (reweighted methods) or using the ratio p_tr(x)/p_te(x) to weight testing samples (robust methods). However, the performance of such approaches can be poor under support mismatch or when the above ratios take large values. We propose a minimax risk classification (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. In addition, we develop effective techniques that obtain both sets of weights and generalize the conventional kernel mean matching method. We provide novel generalization bounds for our method that show a significant increase in the effective sample size compared with reweighted methods. The proposed method also achieves enhanced classification performance in both synthetic and empirical experiments.
"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
Machine learning models frequently experience performance drops under distribution shifts. The underlying cause of such shifts may be multiple simultaneous factors such as changes in data quality, differences in specific covariate distributions, or changes in the relationship between label and features. When a model does fail during deployment, attributing performance change to these factors is critical for the model developer to identify the root cause and take mitigating actions. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.
Accurate and Scalable Estimation of Epistemic Uncertainty for Graph Neural Networks
Safe deployment of graph neural networks (GNNs) under distribution shift requires models to provide accurate confidence indicators (CI). However, while it is well-known in computer vision that CI quality diminishes under distribution shift, this behavior remains understudied for GNNs. Hence, we begin with a case study on CI calibration under controlled structural and feature distribution shifts and demonstrate that increased expressivity or model size do not always lead to improved CI performance. Consequently, we instead advocate for the use of epistemic uncertainty quantification (UQ) methods to modulate CIs. To this end, we propose G-DeltaUQ, a new single model UQ method that extends the recently proposed stochastic centering framework to support structured data and partial stochasticity. Evaluated across covariate, concept, and graph size shifts, G-DeltaUQ not only outperforms several popular UQ methods in obtaining calibrated CIs, but also outperforms alternatives when CIs are used for generalization gap prediction or OOD detection. Overall, our work not only introduces a new, flexible GNN UQ method, but also provides novel insights into GNN CIs on safety-critical tasks.
Search for dark matter subhalos among unassociated Fermi-LAT sources in presence of dataset shift
We search for dark matter (DM) annihilating subhalos of the Milky Way halo among the Fermi Large Area Telescope (LAT) unassociated sources. We construct, for the first time, a statistical model of the unassociated sources at latitudes above 10 degrees. The latter is built as a combination of both DM annihilation subhalos as well as Galactic and extragalactic astrophysical components. The astrophysical components are constructed based on distributions of associated sources, while the distribution of DM subhalos is derived from Monte Carlo simulations. In this model we take into account the differences in the distributions of associated and unassociated sources including both covariate and prior probability shifts (both being forms of ``dataset shifts''). Previous searches of DM subhalos were based on classify-and-count strategies, while the approach adopted in this work is based on quantification learning, which allows one to determine a well-defined statistical interpretation of the contribution of a population of DM subhalos to the unassociated Fermi-LAT sources. In the bb annihilation channel and for a range of DM masses from 10 GeV to 1 TeV, we don't find a significant contribution from DM subhalos and derive a statistical 95% confidence upper limit on the DM annihilation cross section in this channel. While the derived limits are consistent with previous classify-and-count approaches, our generative statistical model opens new avenues for population studies of Fermi-LAT sources and, more generally, for searches of anomalies on top of backgrounds in presence of statistical and systematic uncertainties.
ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms
The task of out-of-distribution (OOD) detection is notoriously ill-defined. Earlier works focused on new-class detection, aiming to identify label-altering data distribution shifts, also known as "semantic shift." However, recent works argue for a focus on failure detection, expanding the OOD evaluation framework to account for label-preserving data distribution shifts, also known as "covariate shift." Intriguingly, under this new framework, complex OOD detectors that were previously considered state-of-the-art now perform similarly to, or even worse than the simple maximum softmax probability baseline. This raises the question: what are the latest OOD detectors actually detecting? Deciphering the behavior of OOD detection algorithms requires evaluation datasets that decouples semantic shift and covariate shift. To aid our investigations, we present ImageNet-OOD, a clean semantic shift dataset that minimizes the interference of covariate shift. Through comprehensive experiments, we show that OOD detectors are more sensitive to covariate shift than to semantic shift, and the benefits of recent OOD detection algorithms on semantic shift detection is minimal. Our dataset and analyses provide important insights for guiding the design of future OOD detectors.
Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation
Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent work uses multilayer perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general purpose treatment effect estimator that significantly outperforms competitive baselines in a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models
ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables
Covariates provide valuable information on external factors that influence time series and are critical in many real-world time series forecasting tasks. For example, in retail, covariates may indicate promotions or peak dates such as holiday seasons that heavily influence demand forecasts. Recent advances in pretraining large language model architectures for time series forecasting have led to highly accurate forecasters. However, the majority of these models do not readily use covariates as they are often specific to a certain task or domain. This paper introduces a new method to incorporate covariates into pretrained time series forecasting models. Our proposed approach incorporates covariate information into pretrained forecasting models through modular blocks that inject past and future covariate information, without necessarily modifying the pretrained model in consideration. In order to evaluate our approach, we introduce a benchmark composed of 32 different synthetic datasets with varying dynamics to evaluate the effectivity of forecasting models with covariates. Extensive evaluations on both synthetic and real datasets show that our approach effectively incorporates covariate information into pretrained models, outperforming existing baselines.
Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation
Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Test-time Batch Statistics Calibration for Covariate Shift
Deep neural networks have a clear degradation when applying to the unseen environment due to the covariate shift. Conventional approaches like domain adaptation requires the pre-collected target data for iterative training, which is impractical in real-world applications. In this paper, we propose to adapt the deep models to the novel environment during inference. An previous solution is test time normalization, which substitutes the source statistics in BN layers with the target batch statistics. However, we show that test time normalization may potentially deteriorate the discriminative structures due to the mismatch between target batch statistics and source parameters. To this end, we present a general formulation alpha-BN to calibrate the batch statistics by mixing up the source and target statistics for both alleviating the domain shift and preserving the discriminative structures. Based on alpha-BN, we further present a novel loss function to form a unified test time adaptation framework Core, which performs the pairwise class correlation online optimization. Extensive experiments show that our approaches achieve the state-of-the-art performance on total twelve datasets from three topics, including model robustness to corruptions, domain generalization on image classification and semantic segmentation. Particularly, our alpha-BN improves 28.4\% to 43.9\% on GTA5 rightarrow Cityscapes without any training, even outperforms the latest source-free domain adaptation method.
Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift
A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.
Personalized Federated Learning under Mixture of Distributions
The recent trend towards Personalized Federated Learning (PFL) has garnered significant attention as it allows for the training of models that are tailored to each client while maintaining data privacy. However, current PFL techniques primarily focus on modeling the conditional distribution heterogeneity (i.e. concept shift), which can result in suboptimal performance when the distribution of input data across clients diverges (i.e. covariate shift). Additionally, these techniques often lack the ability to adapt to unseen data, further limiting their effectiveness in real-world scenarios. To address these limitations, we propose a novel approach, FedGMM, which utilizes Gaussian mixture models (GMM) to effectively fit the input data distributions across diverse clients. The model parameters are estimated by maximum likelihood estimation utilizing a federated Expectation-Maximization algorithm, which is solved in closed form and does not assume gradient similarity. Furthermore, FedGMM possesses an additional advantage of adapting to new clients with minimal overhead, and it also enables uncertainty quantification. Empirical evaluations on synthetic and benchmark datasets demonstrate the superior performance of our method in both PFL classification and novel sample detection.
Understanding Disparities in Post Hoc Machine Learning Explanation
Previous work has highlighted that existing post-hoc explanation methods exhibit disparities in explanation fidelity (across 'race' and 'gender' as sensitive attributes), and while a large body of work focuses on mitigating these issues at the explanation metric level, the role of the data generating process and black box model in relation to explanation disparities remains largely unexplored. Accordingly, through both simulations as well as experiments on a real-world dataset, we specifically assess challenges to explanation disparities that originate from properties of the data: limited sample size, covariate shift, concept shift, omitted variable bias, and challenges based on model properties: inclusion of the sensitive attribute and appropriate functional form. Through controlled simulation analyses, our study demonstrates that increased covariate shift, concept shift, and omission of covariates increase explanation disparities, with the effect pronounced higher for neural network models that are better able to capture the underlying functional form in comparison to linear models. We also observe consistent findings regarding the effect of concept shift and omitted variable bias on explanation disparities in the Adult income dataset. Overall, results indicate that disparities in model explanations can also depend on data and model properties. Based on this systematic investigation, we provide recommendations for the design of explanation methods that mitigate undesirable disparities.
ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift
The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the training source and testing domain, while the conditional distribution of features x given y remains the same. While most literature focuses on classification tasks with finite target space, the regression problem has an infinite dimensional target space, which makes many of the existing methods inapplicable. In this work, we show that the continuous target shift problem can be addressed by estimating the importance weight function from an ill-posed integral equation. We propose a nonparametric regularized approach named ReTaSA to solve the ill-posed integral equation and provide theoretical justification for the estimated importance weight function. The effectiveness of the proposed method has been demonstrated with extensive numerical studies on synthetic and real-world datasets.
Conditional Support Alignment for Domain Adaptation with Label Shift
Unsupervised domain adaptation (UDA) refers to a domain adaptation framework in which a learning model is trained based on the labeled samples on the source domain and unlabelled ones in the target domain. The dominant existing methods in the field that rely on the classical covariate shift assumption to learn domain-invariant feature representation have yielded suboptimal performance under the label distribution shift between source and target domains. In this paper, we propose a novel conditional adversarial support alignment (CASA) whose aim is to minimize the conditional symmetric support divergence between the source's and target domain's feature representation distributions, aiming at a more helpful representation for the classification task. We also introduce a novel theoretical target risk bound, which justifies the merits of aligning the supports of conditional feature distributions compared to the existing marginal support alignment approach in the UDA settings. We then provide a complete training process for learning in which the objective optimization functions are precisely based on the proposed target risk bound. Our empirical results demonstrate that CASA outperforms other state-of-the-art methods on different UDA benchmark tasks under label shift conditions.
Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates
Researchers are usually interested in examining the impact of covariates when separating heterogeneous samples into latent classes that are more homogeneous. The majority of theoretical and empirical studies with such aims have focused on identifying covariates as predictors of class membership in the structural equation modeling framework. In other words, the covariates only indirectly affect the sample heterogeneity. However, the covariates' influence on between-individual differences can also be direct. This article presents a mixture model that investigates covariates to explain within-cluster and between-cluster heterogeneity simultaneously, known as a mixture-of-experts (MoE) model. This study aims to extend the MoE framework to investigate heterogeneity in nonlinear trajectories: to identify latent classes, covariates as predictors to clusters, and covariates that explain within-cluster differences in change patterns over time. Our simulation studies demonstrate that the proposed model generally estimates the parameters unbiasedly, precisely and exhibits appropriate empirical coverage for a nominal 95% confidence interval. This study also proposes implementing structural equation model forests to shrink the covariate space of the proposed mixture model. We illustrate how to select covariates and construct the proposed model with longitudinal mathematics achievement data. Additionally, we demonstrate that the proposed mixture model can be further extended in the structural equation modeling framework by allowing the covariates that have direct effects to be time-varying.
Accounting For Informative Sampling When Learning to Forecast Treatment Outcomes Over Time
Machine learning (ML) holds great potential for accurately forecasting treatment outcomes over time, which could ultimately enable the adoption of more individualized treatment strategies in many practical applications. However, a significant challenge that has been largely overlooked by the ML literature on this topic is the presence of informative sampling in observational data. When instances are observed irregularly over time, sampling times are typically not random, but rather informative -- depending on the instance's characteristics, past outcomes, and administered treatments. In this work, we formalize informative sampling as a covariate shift problem and show that it can prohibit accurate estimation of treatment outcomes if not properly accounted for. To overcome this challenge, we present a general framework for learning treatment outcomes in the presence of informative sampling using inverse intensity-weighting, and propose a novel method, TESAR-CDE, that instantiates this framework using Neural CDEs. Using a simulation environment based on a clinical use case, we demonstrate the effectiveness of our approach in learning under informative sampling.
MALTS: Matching After Learning to Stretch
We introduce a flexible framework that produces high-quality almost-exact matches for causal inference. Most prior work in matching uses ad-hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates. In this work, we learn an interpretable distance metric for matching, which leads to substantially higher quality matches. The learned distance metric stretches the covariate space according to each covariate's contribution to outcome prediction: this stretching means that mismatches on important covariates carry a larger penalty than mismatches on irrelevant covariates. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for the estimation of conditional average treatment effects.
Locally Private Nonparametric Contextual Multi-armed Bandits
Motivated by privacy concerns in sequential decision-making on sensitive data, we address the challenge of nonparametric contextual multi-armed bandits (MAB) under local differential privacy (LDP). We develop a uniform-confidence-bound-type estimator, showing its minimax optimality supported by a matching minimax lower bound. We further consider the case where auxiliary datasets are available, subject also to (possibly heterogeneous) LDP constraints. Under the widely-used covariate shift framework, we propose a jump-start scheme to effectively utilize the auxiliary data, the minimax optimality of which is further established by a matching lower bound. Comprehensive experiments on both synthetic and real-world datasets validate our theoretical results and underscore the effectiveness of the proposed methods.
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate Q-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.
RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation
Achieving both realism and controllability in interactive closed-loop traffic simulation remains a key challenge in autonomous driving. Data-driven simulation methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and controllable closed-loop interactions but often lack expert demonstrations, compromising realism. To address these challenges, we introduce a dual-stage AV-centered simulation framework that conducts open-loop imitation learning pre-training in a data-driven simulator to capture trajectory-level realism and multimodality, followed by closed-loop reinforcement learning fine-tuning in a physics-based simulator to enhance controllability and mitigate covariate shift. In the fine-tuning stage, we propose RIFT, a simple yet effective closed-loop RL fine-tuning strategy that preserves the trajectory-level multimodality through a GRPO-style group-relative advantage formulation, while enhancing controllability and training stability by replacing KL regularization with the dual-clip mechanism. Extensive experiments demonstrate that RIFT significantly improves the realism and controllability of generated traffic scenarios, providing a robust platform for evaluating autonomous vehicle performance in diverse and interactive scenarios.
Unified Out-Of-Distribution Detection: A Model-Specific Perspective
Out-of-distribution (OOD) detection aims to identify test examples that do not belong to the training distribution and are thus unlikely to be predicted reliably. Despite a plethora of existing works, most of them focused only on the scenario where OOD examples come from semantic shift (e.g., unseen categories), ignoring other possible causes (e.g., covariate shift). In this paper, we present a novel, unifying framework to study OOD detection in a broader scope. Instead of detecting OOD examples from a particular cause, we propose to detect examples that a deployed machine learning model (e.g., an image classifier) is unable to predict correctly. That is, whether a test example should be detected and rejected or not is ``model-specific''. We show that this framework unifies the detection of OOD examples caused by semantic shift and covariate shift, and closely addresses the concern of applying a machine learning model to uncontrolled environments. We provide an extensive analysis that involves a variety of models (e.g., different architectures and training strategies), sources of OOD examples, and OOD detection approaches, and reveal several insights into improving and understanding OOD detection in uncontrolled environments.
DistilDoc: Knowledge Distillation for Visually-Rich Document Applications
This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.
ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation
The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent's ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model. Code and data will be released soon.
FEAMOE: Fair, Explainable and Adaptive Mixture of Experts
Three key properties that are desired of trustworthy machine learning models deployed in high-stakes environments are fairness, explainability, and an ability to account for various kinds of "drift". While drifts in model accuracy, for example due to covariate shift, have been widely investigated, drifts in fairness metrics over time remain largely unexplored. In this paper, we propose FEAMOE, a novel "mixture-of-experts" inspired framework aimed at learning fairer, more explainable/interpretable models that can also rapidly adjust to drifts in both the accuracy and the fairness of a classifier. We illustrate our framework for three popular fairness measures and demonstrate how drift can be handled with respect to these fairness constraints. Experiments on multiple datasets show that our framework as applied to a mixture of linear experts is able to perform comparably to neural networks in terms of accuracy while producing fairer models. We then use the large-scale HMDA dataset and show that while various models trained on HMDA demonstrate drift with respect to both accuracy and fairness, FEAMOE can ably handle these drifts with respect to all the considered fairness measures and maintain model accuracy as well. We also prove that the proposed framework allows for producing fast Shapley value explanations, which makes computationally efficient feature attribution based explanations of model decisions readily available via FEAMOE.
Locality Sensitive Sparse Encoding for Learning World Models Online
Acquiring an accurate world model online for model-based reinforcement learning (MBRL) is challenging due to data nonstationarity, which typically causes catastrophic forgetting for neural networks (NNs). From the online learning perspective, a Follow-The-Leader (FTL) world model is desirable, which optimally fits all previous experiences at each round. Unfortunately, NN-based models need re-training on all accumulated data at every interaction step to achieve FTL, which is computationally expensive for lifelong agents. In this paper, we revisit models that can achieve FTL with incremental updates. Specifically, our world model is a linear regression model supported by nonlinear random features. The linear part ensures efficient FTL update while the nonlinear random feature empowers the fitting of complex environments. To best trade off model capacity and computation efficiency, we introduce a locality sensitive sparse encoding, which allows us to conduct efficient sparse updates even with very high dimensional nonlinear features. We validate the representation power of our encoding and verify that it allows efficient online learning under data covariate shift. We also show, in the Dyna MBRL setting, that our world models learned online using a single pass of trajectory data either surpass or match the performance of deep world models trained with replay and other continual learning methods.
NICO++: Towards Better Benchmarking for Domain Generalization
Despite the remarkable performance that modern deep neural networks have achieved on independent and identically distributed (I.I.D.) data, they can crash under distribution shifts. Most current evaluation methods for domain generalization (DG) adopt the leave-one-out strategy as a compromise on the limited number of domains. We propose a large-scale benchmark with extensive labeled domains named NICO++ along with more rational evaluation methods for comprehensively evaluating DG algorithms. To evaluate DG datasets, we propose two metrics to quantify covariate shift and concept shift, respectively. Two novel generalization bounds from the perspective of data construction are proposed to prove that limited concept shift and significant covariate shift favor the evaluation capability for generalization. Through extensive experiments, NICO++ shows its superior evaluation capability compared with current DG datasets and its contribution in alleviating unfairness caused by the leak of oracle knowledge in model selection.
Model Transferability With Responsive Decision Subjects
Given an algorithmic predictor that is accurate on some source population consisting of strategic human decision subjects, will it remain accurate if the population respond to it? In our setting, an agent or a user corresponds to a sample (X,Y) drawn from a distribution D and will face a model h and its classification result h(X). Agents can modify X to adapt to h, which will incur a distribution shift on (X,Y). Our formulation is motivated by applications where the deployed machine learning models are subjected to human agents, and will ultimately face responsive and interactive data distributions. We formalize the discussions of the transferability of a model by studying how the performance of the model trained on the available source distribution (data) would translate to the performance on its induced domain. We provide both upper bounds for the performance gap due to the induced domain shift, as well as lower bounds for the trade-offs that a classifier has to suffer on either the source training distribution or the induced target distribution. We provide further instantiated analysis for two popular domain adaptation settings, including covariate shift and target shift.
OASIS: Open-world Adaptive Self-supervised and Imbalanced-aware System
The expansion of machine learning into dynamic environments presents challenges in handling open-world problems where label shift, covariate shift, and unknown classes emerge. Post-training methods have been explored to address these challenges, adapting models to newly emerging data. However, these methods struggle when the initial pre-training is performed on class-imbalanced datasets, limiting generalization to minority classes. To address this, we propose a method that effectively handles open-world problems even when pre-training is conducted on imbalanced data. Our contrastive-based pre-training approach enhances classification performance, particularly for underrepresented classes. Our post-training mechanism generates reliable pseudo-labels, improving model robustness against open-world problems. We also introduce selective activation criteria to optimize the post-training process, reducing unnecessary computation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art adaptation techniques in both accuracy and efficiency across diverse open-world scenarios.
COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP
Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier's capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at https://github.com/glhr/COOkeD
Similarity-Distance-Magnitude Activations
We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to co-variate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.
LUMOS: Language-Conditioned Imitation Learning with World Models
We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.
COMPASS: Cross-embodiment Mobility Policy via Residual RL and Skill Synthesis
As robots are increasingly deployed in diverse application domains, generalizable cross-embodiment mobility policies are increasingly essential. While classical mobility stacks have proven effective on specific robot platforms, they pose significant challenges when scaling to new embodiments. Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), offer alternative solutions but suffer from covariate shift, sparse sampling in large environments, and embodiment-specific constraints. This paper introduces COMPASS, a novel workflow for developing cross-embodiment mobility policies by integrating IL, residual RL, and policy distillation. We begin with IL on a mobile robot, leveraging easily accessible teacher policies to train a foundational model that combines a world model with a mobility policy. Building on this base, we employ residual RL to fine-tune embodiment-specific policies, exploiting pre-trained representations to improve sampling efficiency in handling various physical constraints and sensor modalities. Finally, policy distillation merges these embodiment-specialist policies into a single robust cross-embodiment policy. We empirically demonstrate that COMPASS scales effectively across diverse robot platforms while maintaining adaptability to various environment configurations, achieving a generalist policy with a success rate approximately 5X higher than the pre-trained IL policy. The resulting framework offers an efficient, scalable solution for cross-embodiment mobility, enabling robots with different designs to navigate safely and efficiently in complex scenarios.
Input-gradient space particle inference for neural network ensembles
Deep Ensembles (DEs) demonstrate improved accuracy, calibration and robustness to perturbations over single neural networks partly due to their functional diversity. Particle-based variational inference (ParVI) methods enhance diversity by formalizing a repulsion term based on a network similarity kernel. However, weight-space repulsion is inefficient due to over-parameterization, while direct function-space repulsion has been found to produce little improvement over DEs. To sidestep these difficulties, we propose First-order Repulsive Deep Ensemble (FoRDE), an ensemble learning method based on ParVI, which performs repulsion in the space of first-order input gradients. As input gradients uniquely characterize a function up to translation and are much smaller in dimension than the weights, this method guarantees that ensemble members are functionally different. Intuitively, diversifying the input gradients encourages each network to learn different features, which is expected to improve the robustness of an ensemble. Experiments on image classification datasets and transfer learning tasks show that FoRDE significantly outperforms the gold-standard DEs and other ensemble methods in accuracy and calibration under covariate shift due to input perturbations.
Treatment Effects Estimation by Uniform Transformer
In observational studies, balancing covariates in different treatment groups is essential to estimate treatment effects. One of the most commonly used methods for such purposes is weighting. The performance of this class of methods usually depends on strong regularity conditions for the underlying model, which might not hold in practice. In this paper, we investigate weighting methods from a functional estimation perspective and argue that the weights needed for covariate balancing could differ from those needed for treatment effects estimation under low regularity conditions. Motivated by this observation, we introduce a new framework of weighting that directly targets the treatment effects estimation. Unlike existing methods, the resulting estimator for a treatment effect under this new framework is a simple kernel-based U-statistic after applying a data-driven transformation to the observed covariates. We characterize the theoretical properties of the new estimators of treatment effects under a nonparametric setting and show that they are able to work robustly under low regularity conditions. The new framework is also applied to several numerical examples to demonstrate its practical merits.
Causal Discovery from Heterogeneous/Nonstationary Data with Independent Changes
It is commonplace to encounter heterogeneous or nonstationary data, of which the underlying generating process changes across domains or over time. Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper, we develop a framework for causal discovery from such data, called Constraint-based causal Discovery from heterogeneous/NOnstationary Data (CD-NOD), to find causal skeleton and directions and estimate the properties of mechanism changes. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a method to determine causal orientations by making use of independent changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. After learning the causal structure, next, we investigate how to efficiently estimate the "driving force" of the nonstationarity of a causal mechanism. That is, we aim to extract from data a low-dimensional representation of changes. The proposed methods are nonparametric, with no hard restrictions on data distributions and causal mechanisms, and do not rely on window segmentation. Furthermore, we find that data heterogeneity benefits causal structure identification even with particular types of confounders. Finally, we show the connection between heterogeneity/nonstationarity and soft intervention in causal discovery. Experimental results on various synthetic and real-world data sets (task-fMRI and stock market data) are presented to demonstrate the efficacy of the proposed methods.
A Test for Jumps in Metric-Space Conditional Means
Standard methods for detecting discontinuities in conditional means are not applicable to outcomes that are complex, non-Euclidean objects like distributions, networks, or covariance matrices. This article develops a nonparametric test for jumps in conditional means when outcomes lie in a non-Euclidean metric space. Using local Fr\'echet regressionx2014which generalizes standard regression to metric-space valued datax2014the method estimates a mean path on either side of a candidate cutoff, extending existing k-sample tests to a flexible regression setting. Key theoretical contributions include a central limit theorem for the local estimator of the conditional Fr\'echet variance and the asymptotic validity and consistency of the proposed test. Simulations confirm nominal size control and robust power in finite samples. Two applications demonstrate the method's value by revealing effects invisible to scalar-based tests. First, I detect a sharp change in work-from-home compositions at Washington State's income threshold for non-compete enforceability during COVID-19, highlighting remote work's role as a bargaining margin. Second, I find that countries restructure their input-output networks after losing preferential US trade access. These findings underscore that analyzing regression functions within their native metric spaces can reveal structural discontinuities that scalar summaries would miss.
Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests
While previous distribution shift detection approaches can identify if a shift has occurred, these approaches cannot localize which specific features have caused a distribution shift -- a critical step in diagnosing or fixing any underlying issue. For example, in military sensor networks, users will want to detect when one or more of the sensors has been compromised, and critically, they will want to know which specific sensors might be compromised. Thus, we first define a formalization of this problem as multiple conditional distribution hypothesis tests and propose both non-parametric and parametric statistical tests. For both efficiency and flexibility, we then propose to use a test statistic based on the density model score function (i.e. gradient with respect to the input) -- which can easily compute test statistics for all dimensions in a single forward and backward pass. Any density model could be used for computing the necessary statistics including deep density models such as normalizing flows or autoregressive models. We additionally develop methods for identifying when and where a shift occurs in multivariate time-series data and show results for multiple scenarios using realistic attack models on both simulated and real world data.
Partial Correlations in Compositional Data Analysis
Partial correlations quantify linear association between two variables adjusting for the influence of the remaining variables. They form the backbone for graphical models and are readily obtained from the inverse of the covariance matrix. For compositional data, the covariance structure is specified from log ratios of variables, so unless we try to "open" the data via a normalization, this implies changes in the definition and interpretation of partial correlations. In the present work, we elucidate how results derived by Aitchison (1986) lead to a natural definition of partial correlation that has a number of advantages over current measures of association. For this, we show that the residuals of log-ratios between a variable with a reference, when adjusting for all remaining variables including the reference, are reference-independent. Since the reference itself can be controlled for, correlations between residuals are defined for the variables directly without the necessity to recur to ratios except when specifying which variables are partialled out. Thus, perhaps surprisingly, partial correlations do not have the problems commonly found with measures of pairwise association on compositional data. They are well-defined between two variables, are properly scaled, and allow for negative association. By design, they are subcompositionally incoherent, but they share this property with conventional partial correlations (where results change when adjusting for the influence of fewer variables). We discuss the equivalence with normalization-based approaches whenever the normalizing variables are controlled for. We also discuss the partial variances and correlations we obtain from a previously studied data set of Roman glass cups.
Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation
Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally important. Moreover, anti-discrimination laws also apply to continuous attributes, such as age, for which many existing methods are not applicable. In practice, multiple protected attributes can exist simultaneously; however, methods targeting fairness across several attributes often overlook so-called "fairness gerrymandering", thereby ignoring disparities among intersectional subgroups (e.g., African-American women or Hispanic men). In this paper, we propose a distance covariance regularisation framework that mitigates the association between model predictions and protected attributes, in line with the fairness definition of demographic parity, and that captures both linear and nonlinear dependencies. To enhance applicability in the presence of multiple protected attributes, we extend our framework by incorporating two multivariate dependence measures based on distance covariance: the previously proposed joint distance covariance (JdCov) and our novel concatenated distance covariance (CCdCov), which effectively address fairness gerrymandering in both regression and classification tasks involving protected attributes of various types. We discuss and illustrate how to calibrate regularisation strength, including a method based on Jensen-Shannon divergence, which quantifies dissimilarities in prediction distributions across groups. We apply our framework to the COMPAS recidivism dataset and a large motor insurance claims dataset.
Covariate balancing using the integral probability metric for causal inference
Weighting methods in causal inference have been widely used to achieve a desirable level of covariate balancing. However, the existing weighting methods have desirable theoretical properties only when a certain model, either the propensity score or outcome regression model, is correctly specified. In addition, the corresponding estimators do not behave well for finite samples due to large variance even when the model is correctly specified. In this paper, we consider to use the integral probability metric (IPM), which is a metric between two probability measures, for covariate balancing. Optimal weights are determined so that weighted empirical distributions for the treated and control groups have the smallest IPM value for a given set of discriminators. We prove that the corresponding estimator can be consistent without correctly specifying any model (neither the propensity score nor the outcome regression model). In addition, we empirically show that our proposed method outperforms existing weighting methods with large margins for finite samples.
Chronos-2: From Univariate to Universal Forecasting
Pretrained time series models have enabled inference-only forecasting systems that produce accurate predictions without task-specific training. However, existing approaches largely focus on univariate forecasting, limiting their applicability in real-world scenarios where multivariate data and covariates play a crucial role. We present Chronos-2, a pretrained model capable of handling univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner. Chronos-2 employs a group attention mechanism that facilitates in-context learning (ICL) through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II. On fev-bench, which emphasizes multivariate and covariate-informed forecasting, Chronos-2's universal ICL capabilities lead to substantial improvements over existing models. On tasks involving covariates, it consistently outperforms baselines by a wide margin. Case studies in the energy and retail domains further highlight its practical advantages. The in-context learning capabilities of Chronos-2 establish it as a general-purpose forecasting model that can be used "as is" in real-world forecasting pipelines.
Bounds on the conditional and average treatment effect with unobserved confounding factors
For observational studies, we study the sensitivity of causal inference when treatment assignments may depend on unobserved confounders. We develop a loss minimization approach for estimating bounds on the conditional average treatment effect (CATE) when unobserved confounders have a bounded effect on the odds ratio of treatment selection. Our approach is scalable and allows flexible use of model classes in estimation, including nonparametric and black-box machine learning methods. Based on these bounds for the CATE, we propose a sensitivity analysis for the average treatment effect (ATE). Our semi-parametric estimator extends/bounds the augmented inverse propensity weighted (AIPW) estimator for the ATE under bounded unobserved confounding. By constructing a Neyman orthogonal score, our estimator of the bound for the ATE is a regular root-n estimator so long as the nuisance parameters are estimated at the o_p(n^{-1/4}) rate. We complement our methodology with optimality results showing that our proposed bounds are tight in certain cases. We demonstrate our method on simulated and real data examples, and show accurate coverage of our confidence intervals in practical finite sample regimes with rich covariate information.
Proximal Causal Learning of Conditional Average Treatment Effects
Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R- and DR-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.
Causal Strategic Classification: A Tale of Two Shifts
When users can benefit from certain predictive outcomes, they may be prone to act to achieve those outcome, e.g., by strategically modifying their features. The goal in strategic classification is therefore to train predictive models that are robust to such behavior. However, the conventional framework assumes that changing features does not change actual outcomes, which depicts users as "gaming" the system. Here we remove this assumption, and study learning in a causal strategic setting where true outcomes do change. Focusing on accuracy as our primary objective, we show how strategic behavior and causal effects underlie two complementing forms of distribution shift. We characterize these shifts, and propose a learning algorithm that balances between these two forces and over time, and permits end-to-end training. Experiments on synthetic and semi-synthetic data demonstrate the utility of our approach.
Dimension Reduction for Characterizing Sexual Dimorphism in Biomechanics of the Temporomandibular Joint
Sexual dimorphism is a critical factor in many biological and medical research fields. In biomechanics and bioengineering, understanding sex differences is crucial for studying musculoskeletal conditions such as temporomandibular disorder (TMD). This paper focuses on the association between the craniofacial skeletal morphology and temporomandibular joint (TMJ) related masticatory muscle attachments to discern sex differences. Data were collected from 10 male and 11 female cadaver heads to investigate sex-specific relationships between the skull and muscles. We propose a conditional cross-covariance reduction (CCR) model, designed to examine the dynamic association between two sets of random variables conditioned on a third binary variable (e.g., sex), highlighting the most distinctive sex-related relationships between skull and muscle attachments in the human cadaver data. Under the CCR model, we employ a sparse singular value decomposition algorithm and introduce a sequential permutation for selecting sparsity (SPSS) method to select important variables and to determine the optimal number of selected variables.
A Systematic Paradigm for Detecting, Surfacing, and Characterizing Heterogeneous Treatment Effects (HTE)
To effectively optimize and personalize treatments, it is necessary to investigate the heterogeneity of treatment effects. With the wide range of users being treated over many online controlled experiments, the typical approach of manually investigating each dimension of heterogeneity becomes overly cumbersome and prone to subjective human biases. We need an efficient way to search through thousands of experiments with hundreds of target covariates and hundreds of breakdown dimensions. In this paper, we propose a systematic paradigm for detecting, surfacing and characterizing heterogeneous treatment effects. First, we detect if treatment effect variation is present in an experiment, prior to specifying any breakdowns. Second, we surface the most relevant dimensions for heterogeneity. Finally, we characterize the heterogeneity beyond just the conditional average treatment effects (CATE) by studying the conditional distributions of the estimated individual treatment effects. We show the effectiveness of our methods using simulated data and empirical studies.
Time-Varying Propensity Score to Bridge the Gap between the Past and Present
Real-world deployment of machine learning models is challenging because data evolves over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data which allows us to selectively sample past data to update the model -- not just similar data from the past like that of a standard propensity score but also data that evolved in a similar fashion in the past. The time-varying propensity score is quite general: we demonstrate different ways of implementing it and evaluate it on a variety of problems ranging from supervised learning (e.g., image classification problems) where data undergoes a sequence of gradual shifts, to reinforcement learning tasks (e.g., robotic manipulation and continuous control) where data shifts as the policy or the task changes.
The Connection Between R-Learning and Inverse-Variance Weighting for Estimation of Heterogeneous Treatment Effects
Our motivation is to shed light the performance of the widely popular "R-Learner." Like many other methods for estimating conditional average treatment effects (CATEs), R-Learning can be expressed as a weighted pseudo-outcome regression (POR). Previous comparisons of POR techniques have paid careful attention to the choice of pseudo-outcome transformation. However, we argue that the dominant driver of performance is actually the choice of weights. Specifically, we argue that R-Learning implicitly performs an inverse-variance weighted form of POR. These weights stabilize the regression and allow for convenient simplifications of bias terms.
A Flexible Parametric Modelling Framework for Survival Analysis
We introduce a general, flexible, parametric survival modelling framework which encompasses key shapes of hazard function (constant, increasing, decreasing, up-then-down, down-then-up), various common survival distributions (log-logistic, Burr type XII, Weibull, Gompertz), and includes defective distributions (i.e., cure models). This generality is achieved using four basic distributional parameters: two scale-type parameters and two shape parameters. Generalising to covariate dependence, the scale-type regression components correspond to accelerated failure time (AFT) and proportional hazards (PH) models. Therefore, this general formulation unifies the most popular survival models which allows us to consider the practical value of possible modelling choices for survival data. Furthermore, in line with our proposed flexible baseline distribution, we advocate the use of multi-parameter regression in which more than one distributional parameter depends on covariates - rather than the usual convention of having a single covariate-dependent (scale) parameter. While many choices are available, we suggest introducing covariates through just one or other of the two scale parameters, which covers AFT and PH models, in combination with a `power' shape parameter, which allows for more complex non-AFT/non-PH effects, while the other shape parameter remains covariate-independent, and handles automatic selection of the baseline distribution. We explore inferential issues in simulations, both with and without a covariate, with particular focus on evidence concerning the need, or otherwise, to include both AFT and PH parameters. We illustrate the efficacy of our modelling framework by investigating differences between treatment groups using data from a lung cancer study and a melanoma study. Censoring is accommodated throughout.
Exploring HOD-dependent systematics for the DESI 2024 Full-Shape galaxy clustering analysis
We analyse the robustness of the DESI 2024 cosmological inference from fits to the full shape of the galaxy power spectrum to uncertainties in the Halo Occupation Distribution (HOD) model of the galaxy-halo connection and the choice of priors on nuisance parameters. We assess variations in the recovered cosmological parameters across a range of mocks populated with different HOD models and find that shifts are often greater than 20% of the expected statistical uncertainties from the DESI data. We encapsulate the effect of such shifts in terms of a systematic covariance term, C_{rm HOD}, and an additional diagonal contribution quantifying the impact of our choice of nuisance parameter priors on the ability of the effective field theory (EFT) model to correctly recover the cosmological parameters of the simulations. These two covariance contributions are designed to be added to the usual covariance term, C_{rm stat}, describing the statistical uncertainty in the power spectrum measurement, in order to fairly represent these sources of systematic uncertainty. This approach is more general and robust to choices of model free parameters or additional external datasets used in cosmological fits than the alternative approach of adding systematic uncertainties at the level of the recovered marginalised parameter posteriors. We compare the approaches within the context of a fixed LambdaCDM model and demonstrate that our method gives conservative estimates of the systematic uncertainty that nevertheless have little impact on the final posteriors obtained from DESI data.
Deconfounded Representation Similarity for Comparison of Neural Networks
Similarity metrics such as representational similarity analysis (RSA) and centered kernel alignment (CKA) have been used to compare layer-wise representations between neural networks. However, these metrics are confounded by the population structure of data items in the input space, leading to spuriously high similarity for even completely random neural networks and inconsistent domain relations in transfer learning. We introduce a simple and generally applicable fix to adjust for the confounder with covariate adjustment regression, which retains the intuitive invariance properties of the original similarity measures. We show that deconfounding the similarity metrics increases the resolution of detecting semantically similar neural networks. Moreover, in real-world applications, deconfounding improves the consistency of representation similarities with domain similarities in transfer learning, and increases correlation with out-of-distribution accuracy.
Inference Stage Denoising for Undersampled MRI Reconstruction
Reconstruction of magnetic resonance imaging (MRI) data has been positively affected by deep learning. A key challenge remains: to improve generalisation to distribution shifts between the training and testing data. Most approaches aim to address this via inductive design or data augmentation. However, they can be affected by misleading data, e.g. random noise, and cases where the inference stage data do not match assumptions in the modelled shifts. In this work, by employing a conditional hyperparameter network, we eliminate the need of augmentation, yet maintain robust performance under various levels of Gaussian noise. We demonstrate that our model withstands various input noise levels while producing high-definition reconstructions during the test stage. Moreover, we present a hyperparameter sampling strategy that accelerates the convergence of training. Our proposed method achieves the highest accuracy and image quality in all settings compared to baseline methods.
Causal Inference by String Diagram Surgery
Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors. A key notion in the identification of causal effects is that of an intervention, whereby a variable is forcefully set to a particular value independent of any prior propensities. We represent the effect of such an intervention as an endofunctor which performs `string diagram surgery' within the syntactic category of string diagrams. This diagram surgery in turn yields a new, interventional distribution via the interpretation functor. While in general there is no way to compute interventional distributions purely from observed data, we show that this is possible in certain special cases using a calculational tool called comb disintegration. We demonstrate the use of this technique on a well-known toy example, where we predict the causal effect of smoking on cancer in the presence of a confounding common cause. After developing this specific example, we show this technique provides simple sufficient conditions for computing interventions which apply to a wide variety of situations considered in the causal inference literature.
Classification of BCI-EEG based on augmented covariance matrix
Objective: Electroencephalography signals are recorded as a multidimensional dataset. We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. Methods: From the autoregressive model can be derived the Yule-Walker equations, which show the emergence of a symmetric positive definite matrix: the augmented covariance matrix. The state-of the art for classifying covariance matrices is based on Riemannian Geometry. A fairly natural idea is therefore to extend the standard approach using these augmented covariance matrices. The methodology for creating the augmented covariance matrix shows a natural connection with the delay embedding theorem proposed by Takens for dynamical systems. Such an embedding method is based on the knowledge of two parameters: the delay and the embedding dimension, respectively related to the lag and the order of the autoregressive model. This approach provides new methods to compute the hyper-parameters in addition to standard grid search. Results: The augmented covariance matrix performed noticeably better than any state-of-the-art methods. We will test our approach on several datasets and several subjects using the MOABB framework, using both within-session and cross-session evaluation. Conclusion: The improvement in results is due to the fact that the augmented covariance matrix incorporates not only spatial but also temporal information, incorporating nonlinear components of the signal through an embedding procedure, which allows the leveraging of dynamical systems algorithms. Significance: These results extend the concepts and the results of the Riemannian distance based classification algorithm.
SpaCE: The Spatial Confounding Environment
Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
Integrating Earth Observation Data into Causal Inference: Challenges and Opportunities
Observational studies require adjustment for confounding factors that are correlated with both the treatment and outcome. In the setting where the observed variables are tabular quantities such as average income in a neighborhood, tools have been developed for addressing such confounding. However, in many parts of the developing world, features about local communities may be scarce. In this context, satellite imagery can play an important role, serving as a proxy for the confounding variables otherwise unobserved. In this paper, we study confounder adjustment in this non-tabular setting, where patterns or objects found in satellite images contribute to the confounder bias. Using the evaluation of anti-poverty aid programs in Africa as our running example, we formalize the challenge of performing causal adjustment with such unstructured data -- what conditions are sufficient to identify causal effects, how to perform estimation, and how to quantify the ways in which certain aspects of the unstructured image object are most predictive of the treatment decision. Via simulation, we also explore the sensitivity of satellite image-based observational inference to image resolution and to misspecification of the image-associated confounder. Finally, we apply these tools in estimating the effect of anti-poverty interventions in African communities from satellite imagery.
Contamination Bias in Linear Regressions
We study regressions with multiple treatments and a set of controls that is flexible enough to purge omitted variable bias. We show that these regressions generally fail to estimate convex averages of heterogeneous treatment effects -- instead, estimates of each treatment's effect are contaminated by non-convex averages of the effects of other treatments. We discuss three estimation approaches that avoid such contamination bias, including the targeting of easiest-to-estimate weighted average effects. A re-analysis of nine empirical applications finds economically and statistically meaningful contamination bias in observational studies; contamination bias in experimental studies is more limited due to smaller variability in propensity scores.
Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns
Sharp, multidimensional changepoints-abrupt shifts in a regression surface whose locations and magnitudes are unknown-arise in settings as varied as gene-expression profiling, financial covariance breaks, climate-regime detection, and urban socioeconomic mapping. Despite their prevalence, there are no current approaches that jointly estimate the location and size of the discontinuity set in a one-shot approach with statistical guarantees. We therefore introduce Free Discontinuity Regression (FDR), a fully nonparametric estimator that simultaneously (i) smooths a regression surface, (ii) segments it into contiguous regions, and (iii) provably recovers the precise locations and sizes of its jumps. By extending a convex relaxation of the Mumford-Shah functional to random spatial sampling and correlated noise, FDR overcomes the fixed-grid and i.i.d. noise assumptions of classical image-segmentation approaches, thus enabling its application to real-world data of any dimension. This yields the first identification and uniform consistency results for multivariate jump surfaces: under mild SBV regularity, the estimated function, its discontinuity set, and all jump sizes converge to their true population counterparts. Hyperparameters are selected automatically from the data using Stein's Unbiased Risk Estimate, and large-scale simulations up to three dimensions validate the theoretical results and demonstrate good finite-sample performance. Applying FDR to an internet shutdown in India reveals a 25-35% reduction in economic activity around the estimated shutdown boundaries-much larger than previous estimates. By unifying smoothing, segmentation, and effect-size recovery in a general statistical setting, FDR turns free-discontinuity ideas into a practical tool with formal guarantees for modern multivariate data.
Effect Heterogeneity with Earth Observation in Randomized Controlled Trials: Exploring the Role of Data, Model, and Evaluation Metric Choice
Many social and environmental phenomena are associated with macroscopic changes in the built environment, captured by satellite imagery on a global scale and with daily temporal resolution. While widely used for prediction, these images and especially image sequences remain underutilized for causal inference, especially in the context of randomized controlled trials (RCTs), where causal identification is established by design. In this paper, we develop and compare a set of general tools for analyzing Conditional Average Treatment Effects (CATEs) from temporal satellite data that can be applied to any RCT where geographical identifiers are available. Through a simulation study, we analyze different modeling strategies for estimating CATE in sequences of satellite images. We find that image sequence representation models with more parameters generally yield a greater ability to detect heterogeneity. To explore the role of model and data choice in practice, we apply the approaches to two influential RCTs -- Banerjee et al. (2015), a poverty study in Cusco, Peru, and Bolsen et al. (2014), a water conservation experiment in Georgia, USA. We benchmark our image sequence models against image-only, tabular-only, and combined image-tabular data sources, summarizing practical implications for investigators in a multivariate analysis. Land cover classifications over satellite images facilitate interpretation of what image features drive heterogeneity. We also show robustness to data and model choice of satellite-based generalization of the RCT results to larger geographical areas outside the original. Overall, this paper shows how satellite sequence data can be incorporated into the analysis of RCTs, and provides evidence about the implications of data, model, and evaluation metric choice for causal analysis.
Change is Hard: A Closer Look at Subpopulation Shift
Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https://github.com/YyzHarry/SubpopBench.
On Heterogeneous Treatment Effects in Heterogeneous Causal Graphs
Heterogeneity and comorbidity are two interwoven challenges associated with various healthcare problems that greatly hampered research on developing effective treatment and understanding of the underlying neurobiological mechanism. Very few studies have been conducted to investigate heterogeneous causal effects (HCEs) in graphical contexts due to the lack of statistical methods. To characterize this heterogeneity, we first conceptualize heterogeneous causal graphs (HCGs) by generalizing the causal graphical model with confounder-based interactions and multiple mediators. Such confounders with an interaction with the treatment are known as moderators. This allows us to flexibly produce HCGs given different moderators and explicitly characterize HCEs from the treatment or potential mediators on the outcome. We establish the theoretical forms of HCEs and derive their properties at the individual level in both linear and nonlinear models. An interactive structural learning is developed to estimate the complex HCGs and HCEs with confidence intervals provided. Our method is empirically justified by extensive simulations and its practical usefulness is illustrated by exploring causality among psychiatric disorders for trauma survivors.
Contributions to Robust and Efficient Methods for Analysis of High Dimensional Data
A ubiquitous feature of data of our era is their extra-large sizes and dimensions. Analyzing such high-dimensional data poses significant challenges, since the feature dimension is often much larger than the sample size. This thesis introduces robust and computationally efficient methods to address several common challenges associated with high-dimensional data. In my first manuscript, I propose a coherent approach to variable screening that accommodates nonlinear associations. I develop a novel variable screening method that transcends traditional linear assumptions by leveraging mutual information, with an intended application in neuroimaging data. This approach allows for accurate identification of important variables by capturing nonlinear as well as linear relationships between the outcome and covariates. Building on this foundation, I develop new optimization methods for sparse estimation using nonconvex penalties in my second manuscript. These methods address notable challenges in current statistical computing practices, facilitating computationally efficient and robust analyses of complex datasets. The proposed method can be applied to a general class of optimization problems. In my third manuscript, I contribute to robust modeling of high-dimensional correlated observations by developing a mixed-effects model based on Tsallis power-law entropy maximization and discussed the theoretical properties of such distribution. This model surpasses the constraints of conventional Gaussian models by accommodating a broader class of distributions with enhanced robustness to outliers. Additionally, I develop a proximal nonlinear conjugate gradient algorithm that accelerates convergence while maintaining numerical stability, along with rigorous statistical properties for the proposed framework.
The Data Addition Dilemma
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the Data Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.
Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features
Transfer learning with a small amount of target data is an effective and common approach to adapting a pre-trained model to distribution shifts. In some situations, target data labels may be expensive to obtain, so we may only have access to a limited number of target data points. To make the most of a very small target dataset, we propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features. Our approach, Project and Probe (Pro^2), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. Pro^2 then learns a linear classifier on top of these projected features using a small target dataset. Theoretically, we find that Pro^2 results in more sample-efficient generalization by inducing a favorable bias-variance tradeoff. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro^2 improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing.
A Neural Framework for Generalized Causal Sensitivity Analysis
Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. The generality of \frameworkname is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data.
Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation
State-of-the-art methods for conditional average treatment effect (CATE) estimation make widespread use of representation learning. Here, the idea is to reduce the variance of the low-sample CATE estimation by a (potentially constrained) low-dimensional representation. However, low-dimensional representations can lose information about the observed confounders and thus lead to bias, because of which the validity of representation learning for CATE estimation is typically violated. In this paper, we propose a new, representation-agnostic framework for estimating bounds on the representation-induced confounding bias that comes from dimensionality reduction (or other constraints on the representations) in CATE estimation. First, we establish theoretically under which conditions CATEs are non-identifiable given low-dimensional (constrained) representations. Second, as our remedy, we propose to perform partial identification of CATEs or, equivalently, aim at estimating of lower and upper bounds of the representation-induced confounding bias. We demonstrate the effectiveness of our bounds in a series of experiments. In sum, our framework is of direct relevance in practice where the validity of CATE estimation is of importance.
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation
In machine learning, generalization against distribution shifts -- where deployment conditions diverge from the training scenarios -- is crucial, particularly in fields like climate modeling, biomedicine, and autonomous driving. The emergence of foundation models, distinguished by their extensive pretraining and task versatility, has led to an increased interest in their adaptability to distribution shifts. GPT-4V(ision) acts as the most advanced publicly accessible multimodal foundation model, with extensive applications across various domains, including anomaly detection, video understanding, image generation, and medical diagnosis. However, its robustness against data distributions remains largely underexplored. Addressing this gap, this study rigorously evaluates GPT-4V's adaptability and generalization capabilities in dynamic environments, benchmarking against prominent models like CLIP and LLaVA. We delve into GPT-4V's zero-shot generalization across 13 diverse datasets spanning natural, medical, and molecular domains. We further investigate its adaptability to controlled data perturbations and examine the efficacy of in-context learning as a tool to enhance its adaptation. Our findings delineate GPT-4V's capability boundaries in distribution shifts, shedding light on its strengths and limitations across various scenarios. Importantly, this investigation contributes to our understanding of how AI foundation models generalize to distribution shifts, offering pivotal insights into their adaptability and robustness. Code is publicly available at https://github.com/jameszhou-gl/gpt-4v-distribution-shift.
Structural Re-weighting Improves Graph Domain Adaptation
In many real-world applications, graph-structured data used for training and testing have differences in distribution, such as in high energy physics (HEP) where simulation data used for training may not match real experiments. Graph domain adaptation (GDA) is a method used to address these differences. However, current GDA primarily works by aligning the distributions of node representations output by a single graph neural network encoder shared across the training and testing domains, which may often yield sub-optimal solutions. This work examines different impacts of distribution shifts caused by either graph structure or node attributes and identifies a new type of shift, named conditional structure shift (CSS), which current GDA approaches are provably sub-optimal to deal with. A novel approach, called structural reweighting (StruRW), is proposed to address this issue and is tested on synthetic graphs, four benchmark datasets, and a new application in HEP. StruRW has shown significant performance improvement over the baselines in the settings with large graph structure shifts, and reasonable performance improvement when node attribute shift dominates.
Improving Fair Training under Correlation Shifts
Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and group fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key benefit of our approach lies in decoupling the roles of pre- and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.
Continuous Risk Factor Models: Analyzing Asset Correlations through Energy Distance
This paper introduces a novel approach to financial risk analysis that does not rely on traditional price and market data, instead using market news to model assets as distributions over a metric space of risk factors. By representing asset returns as integrals over the scalar field of these risk factors, we derive the covariance structure between asset returns. Utilizing encoder-only language models to embed this news data, we explore the relationships between asset return distributions through the concept of Energy Distance, establishing connections between distributional differences and excess returns co-movements. This data-agnostic approach provides new insights into portfolio diversification, risk management, and the construction of hedging strategies. Our findings have significant implications for both theoretical finance and practical risk management, offering a more robust framework for modelling complex financial systems without depending on conventional market data.
Neuroevolutionary Feature Representations for Causal Inference
Within the field of causal inference, we consider the problem of estimating heterogeneous treatment effects from data. We propose and validate a novel approach for learning feature representations to aid the estimation of the conditional average treatment effect or CATE. Our method focuses on an intermediate layer in a neural network trained to predict the outcome from the features. In contrast to previous approaches that encourage the distribution of representations to be treatment-invariant, we leverage a genetic algorithm that optimizes over representations useful for predicting the outcome to select those less useful for predicting the treatment. This allows us to retain information within the features useful for predicting outcome even if that information may be related to treatment assignment. We validate our method on synthetic examples and illustrate its use on a real life dataset.
Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: "One Map, Many Trials" in Satellite-Driven Poverty Analysis
Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can be leveraged across multiple causal trials. However, because standard training objectives prioritize overall predictive accuracy, these predictions inherently suffer from shrinkage toward the mean, leading to attenuated estimates of causal treatment effects and limiting their utility in policy. Existing debiasing methods, such as Prediction-Powered Inference, can handle this attenuation bias but require additional fresh ground-truth data at the downstream stage of causal inference, which restricts their applicability in data-scarce environments. Here, we introduce and evaluate two correction methods -- linear calibration correction and Tweedie's correction -- that substantially reduce prediction bias without relying on newly collected labeled data. Linear calibration corrects bias through a straightforward linear transformation derived from held-out calibration data, whereas Tweedie's correction leverages empirical Bayes principles to directly address shrinkage-induced biases by exploiting score functions derived from the model's learning patterns. Through analytical exercises and experiments using Demographic and Health Survey data, we demonstrate that the proposed methods meet or outperform existing approaches that either require (a) adjustments to training pipelines or (b) additional labeled data. These approaches may represent a promising avenue for improving the reliability of causal inference when direct outcome measures are limited or unavailable, enabling a "one map, many trials" paradigm where a single upstream data creation team produces predictions usable by many downstream teams across diverse ML pipelines.
PAC Generalization via Invariant Representations
One method for obtaining generalizable solutions to machine learning tasks when presented with diverse training environments is to find invariant representations of the data. These are representations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with out-of-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a {\em finite sample} setting, we consider the notion of epsilon-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen SEMs? This larger collection of SEMs is generated through a parameterized family of interventions. Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. Our results show bounds that do not scale in ambient dimension when intervention sites are restricted to lie in a constant size subset of in-degree bounded nodes. We also show how to extend our results to a linear indirect observation model that incorporates latent variables.
Detecting Errors in a Numerical Response via any Regression Model
Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
HyDA: Hypernetworks for Test Time Domain Adaptation in Medical Imaging Analysis
Medical imaging datasets often vary due to differences in acquisition protocols, patient demographics, and imaging devices. These variations in data distribution, known as domain shift, present a significant challenge in adapting imaging analysis models for practical healthcare applications. Most current domain adaptation (DA) approaches aim either to align the distributions between the source and target domains or to learn an invariant feature space that generalizes well across all domains. However, both strategies require access to a sufficient number of examples, though not necessarily annotated, from the test domain during training. This limitation hinders the widespread deployment of models in clinical settings, where target domain data may only be accessible in real time. In this work, we introduce HyDA, a novel hypernetwork framework that leverages domain characteristics rather than suppressing them, enabling dynamic adaptation at inference time. Specifically, HyDA learns implicit domain representations and uses them to adjust model parameters on-the-fly, effectively interpolating to unseen domains. We validate HyDA on two clinically relevant applications - MRI brain age prediction and chest X-ray pathology classification - demonstrating its ability to generalize across tasks and modalities. Our code is available at TBD.
Regression with Label Permutation in Generalized Linear Model
The assumption that response and predictor belong to the same statistical unit may be violated in practice. Unbiased estimation and recovery of true label ordering based on unlabeled data are challenging tasks and have attracted increasing attentions in the recent literature. In this paper, we present a relatively complete analysis of label permutation problem for the generalized linear model with multivariate responses. The theory is established under different scenarios, with knowledge of true parameters, with partial knowledge of underlying label permutation matrix and without any knowledge. Our results remove the stringent conditions required by the current literature and are further extended to the missing observation setting which has never been considered in the field of label permutation problem. On computational side, we propose two methods, "maximum likelihood estimation" algorithm and "two-step estimation" algorithm, to accommodate for different settings. When the proportion of permuted labels is moderate, both methods work effectively. Multiple numerical experiments are provided and corroborate our theoretical findings.
Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators
Recently, channel-independent methods have achieved state-of-the-art performance in multivariate time series (MTS) forecasting. Despite reducing overfitting risks, these methods miss potential opportunities in utilizing channel dependence for accurate predictions. We argue that there exist locally stationary lead-lag relationships between variates, i.e., some lagged variates may follow the leading indicators within a short time period. Exploiting such channel dependence is beneficial since leading indicators offer advance information that can be used to reduce the forecasting difficulty of the lagged variates. In this paper, we propose a new method named LIFT that first efficiently estimates leading indicators and their leading steps at each time step and then judiciously allows the lagged variates to utilize the advance information from leading indicators. LIFT plays as a plugin that can be seamlessly collaborated with arbitrary time series forecasting methods. Extensive experiments on six real-world datasets demonstrate that LIFT improves the state-of-the-art methods by 5.5% in average forecasting performance. Our code is available at https://github.com/SJTU-Quant/LIFT.
WILDS: A Benchmark of in-the-Wild Distribution Shifts
Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.
Accuracy on the Curve: On the Nonlinear Correlation of ML Performance Between Data Subpopulations
Understanding the performance of machine learning (ML) models across diverse data distributions is critically important for reliable applications. Despite recent empirical studies positing a near-perfect linear correlation between in-distribution (ID) and out-of-distribution (OOD) accuracies, we empirically demonstrate that this correlation is more nuanced under subpopulation shifts. Through rigorous experimentation and analysis across a variety of datasets, models, and training epochs, we demonstrate that OOD performance often has a nonlinear correlation with ID performance in subpopulation shifts. Our findings, which contrast previous studies that have posited a linear correlation in model performance during distribution shifts, reveal a "moon shape" correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This non-trivial nonlinear correlation holds across model architectures, hyperparameters, training durations, and the imbalance between subpopulations. Furthermore, we found that the nonlinearity of this "moon shape" is causally influenced by the degree of spurious correlations in the training data. Our controlled experiments show that stronger spurious correlation in the training data creates more nonlinear performance correlation. We provide complementary experimental and theoretical analyses for this phenomenon, and discuss its implications for ML reliability and fairness. Our work highlights the importance of understanding the nonlinear effects of model improvement on performance in different subpopulations, and has the potential to inform the development of more equitable and responsible machine learning models.
Causal Inference with Conditional Front-Door Adjustment and Identifiable Variational Autoencoder
An essential and challenging problem in causal inference is causal effect estimation from observational data. The problem becomes more difficult with the presence of unobserved confounding variables. The front-door adjustment is a practical approach for dealing with unobserved confounding variables. However, the restriction for the standard front-door adjustment is difficult to satisfy in practice. In this paper, we relax some of the restrictions by proposing the concept of conditional front-door (CFD) adjustment and develop the theorem that guarantees the causal effect identifiability of CFD adjustment. Furthermore, as it is often impossible for a CFD variable to be given in practice, it is desirable to learn it from data. By leveraging the ability of deep generative models, we propose CFDiVAE to learn the representation of the CFD adjustment variable directly from data with the identifiable Variational AutoEncoder and formally prove the model identifiability. Extensive experiments on synthetic datasets validate the effectiveness of CFDiVAE and its superiority over existing methods. The experiments also show that the performance of CFDiVAE is less sensitive to the causal strength of unobserved confounding variables. We further apply CFDiVAE to a real-world dataset to demonstrate its potential application.
Explore and Exploit the Diverse Knowledge in Model Zoo for Domain Generalization
The proliferation of pretrained models, as a result of advancements in pretraining techniques, has led to the emergence of a vast zoo of publicly available models. Effectively utilizing these resources to obtain models with robust out-of-distribution generalization capabilities for downstream tasks has become a crucial area of research. Previous research has primarily focused on identifying the most powerful models within the model zoo, neglecting to fully leverage the diverse inductive biases contained within. This paper argues that the knowledge contained in weaker models is valuable and presents a method for leveraging the diversity within the model zoo to improve out-of-distribution generalization capabilities. Specifically, we investigate the behaviors of various pretrained models across different domains of downstream tasks by characterizing the variations in their encoded representations in terms of two dimensions: diversity shift and correlation shift. This characterization enables us to propose a new algorithm for integrating diverse pretrained models, not limited to the strongest models, in order to achieve enhanced out-of-distribution generalization performance. Our proposed method demonstrates state-of-the-art empirical results on a variety of datasets, thus validating the benefits of utilizing diverse knowledge.
When to Accept Automated Predictions and When to Defer to Human Judgment?
Ensuring the reliability and safety of automated decision-making is crucial. It is well-known that data distribution shifts in machine learning can produce unreliable outcomes. This paper proposes a new approach for measuring the reliability of predictions under distribution shifts. We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions under distribution shifts. We assign each prediction to a cluster with centroid representing the mean softmax output for all correct predictions of a given class. We then define a safety threshold for a class as the smallest distance from an incorrect prediction to the given class centroid. We evaluate the approach on the MNIST and CIFAR-10 datasets using a Convolutional Neural Network and a Vision Transformer, respectively. The results show that our approach is consistent across these data sets and network models, and indicate that the proposed metric can offer an efficient way of determining when automated predictions are acceptable and when they should be deferred to human operators given a distribution shift.
Optimizing Brain Tumor Segmentation with MedNeXt: BraTS 2024 SSA and Pediatrics
Identifying key pathological features in brain MRIs is crucial for the long-term survival of glioma patients. However, manual segmentation is time-consuming, requiring expert intervention and is susceptible to human error. Therefore, significant research has been devoted to developing machine learning methods that can accurately segment tumors in 3D multimodal brain MRI scans. Despite their progress, state-of-the-art models are often limited by the data they are trained on, raising concerns about their reliability when applied to diverse populations that may introduce distribution shifts. Such shifts can stem from lower quality MRI technology (e.g., in sub-Saharan Africa) or variations in patient demographics (e.g., children). The BraTS-2024 challenge provides a platform to address these issues. This study presents our methodology for segmenting tumors in the BraTS-2024 SSA and Pediatric Tumors tasks using MedNeXt, comprehensive model ensembling, and thorough postprocessing. Our approach demonstrated strong performance on the unseen validation set, achieving an average Dice Similarity Coefficient (DSC) of 0.896 on the BraTS-2024 SSA dataset and an average DSC of 0.830 on the BraTS Pediatric Tumor dataset. Additionally, our method achieved an average Hausdorff Distance (HD95) of 14.682 on the BraTS-2024 SSA dataset and an average HD95 of 37.508 on the BraTS Pediatric dataset. Our GitHub repository can be accessed here: Project Repository : https://github.com/python-arch/BioMbz-Optimizing-Brain-Tumor-Segmentation-with-MedNeXt-BraTS-2024-SSA-and-Pediatrics
Graph-based Virtual Sensing from Sparse and Partial Multivariate Observations
Virtual sensing techniques allow for inferring signals at new unmonitored locations by exploiting spatio-temporal measurements coming from physical sensors at different locations. However, as the sensor coverage becomes sparse due to costs or other constraints, physical proximity cannot be used to support interpolation. In this paper, we overcome this challenge by leveraging dependencies between the target variable and a set of correlated variables (covariates) that can frequently be associated with each location of interest. From this viewpoint, covariates provide partial observability, and the problem consists of inferring values for unobserved channels by exploiting observations at other locations to learn how such variables can correlate. We introduce a novel graph-based methodology to exploit such relationships and design a graph deep learning architecture, named GgNet, implementing the framework. The proposed approach relies on propagating information over a nested graph structure that is used to learn dependencies between variables as well as locations. GgNet is extensively evaluated under different virtual sensing scenarios, demonstrating higher reconstruction accuracy compared to the state-of-the-art.
Conditional Instrumental Variable Regression with Representation Learning for Causal Inference
This paper studies the challenging problem of estimating causal effects from observational data, in the presence of unobserved confounders. The two-stage least square (TSLS) method and its variants with a standard instrumental variable (IV) are commonly used to eliminate confounding bias, including the bias caused by unobserved confounders, but they rely on the linearity assumption. Besides, the strict condition of unconfounded instruments posed on a standard IV is too strong to be practical. To address these challenging and practical problems of the standard IV method (linearity assumption and the strict condition), in this paper, we use a conditional IV (CIV) to relax the unconfounded instrument condition of standard IV and propose a non-linear CIV regression with Confounding Balancing Representation Learning, CBRL.CIV, for jointly eliminating the confounding bias from unobserved confounders and balancing the observed confounders, without the linearity assumption. We theoretically demonstrate the soundness of CBRL.CIV. Extensive experiments on synthetic and two real-world datasets show the competitive performance of CBRL.CIV against state-of-the-art IV-based estimators and superiority in dealing with the non-linear situation.
An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation
The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) -- sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation
We consider the task of evaluating policies of algorithmic resource allocation through randomized controlled trials (RCTs). Such policies are tasked with optimizing the utilization of limited intervention resources, with the goal of maximizing the benefits derived. Evaluation of such allocation policies through RCTs proves difficult, notwithstanding the scale of the trial, because the individuals' outcomes are inextricably interlinked through resource constraints controlling the policy decisions. Our key contribution is to present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We identify conditions under which such reassignments are permissible and can be leveraged to construct counterfactual trials, whose outcomes can be accurately ascertained, for free. We prove theoretically that such an estimator is more accurate than common estimators based on sample means -- we show that it returns an unbiased estimate and simultaneously reduces variance. We demonstrate the value of our approach through empirical experiments on synthetic, semi-synthetic as well as real case study data and show improved estimation accuracy across the board.
Chinese vs. World Bank Development Projects: Insights from Earth Observation and Computer Vision on Wealth Gains in Africa, 2002-2013
Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002 to 2013), representative of 88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with rich tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery systematically shrinks effects relative to tabular-only models, indicating prior work likely overstated benefits. On average, both donors raise wealth, with larger gains for China; sector extremes in our sample include Trade and Tourism for the World Bank (+6.27 IWI points), and Emergency Response for China (+14.32). Assignment-mechanism analyses show World Bank placement is generally more predictable from imagery alone, as well as from tabular covariates. This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 450 times finer than prior fixed effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but directionally consistent effects.
Towards Explaining Distribution Shifts
A distribution shift can have fundamental consequences such as signaling a change in the operating environment or significantly reducing the accuracy of downstream models. Thus, understanding distribution shifts is critical for examining and hopefully mitigating the effect of such a shift. Most prior work focuses on merely detecting if a shift has occurred and assumes any detected shift can be understood and handled appropriately by a human operator. We hope to aid in these manual mitigation tasks by explaining the distribution shift using interpretable transportation maps from the original distribution to the shifted one. We derive our interpretable mappings from a relaxation of optimal transport, where the candidate mappings are restricted to a set of interpretable mappings. We then inspect multiple quintessential use-cases of distribution shift in real-world tabular, text, and image datasets to showcase how our explanatory mappings provide a better balance between detail and interpretability than baseline explanations by both visual inspection and our PercentExplained metric.
On the limits of cross-domain generalization in automated X-ray prediction
This large scale study focuses on quantifying what X-rays diagnostic prediction tasks generalize well across multiple different datasets. We present evidence that the issue of generalization is not due to a shift in the images but instead a shift in the labels. We study the cross-domain performance, agreement between models, and model representations. We find interesting discrepancies between performance and agreement where models which both achieve good performance disagree in their predictions as well as models which agree yet achieve poor performance. We also test for concept similarity by regularizing a network to group tasks across multiple datasets together and observe variation across the tasks. All code is made available online and data is publicly available: https://github.com/mlmed/torchxrayvision
Inference in Non-stationary High-Dimensional VARs
In this paper we construct an inferential procedure for Granger causality in high-dimensional non-stationary vector autoregressive (VAR) models. Our method does not require knowledge of the order of integration of the time series under consideration. We augment the VAR with at least as many lags as the suspected maximum order of integration, an approach which has been proven to be robust against the presence of unit roots in low dimensions. We prove that we can restrict the augmentation to only the variables of interest for the testing, thereby making the approach suitable for high dimensions. We combine this lag augmentation with a post-double-selection procedure in which a set of initial penalized regressions is performed to select the relevant variables for both the Granger causing and caused variables. We then establish uniform asymptotic normality of a second-stage regression involving only the selected variables. Finite sample simulations show good performance, an application to investigate the (predictive) causes and effects of economic uncertainty illustrates the need to allow for unknown orders of integration.
Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift
This paper first answers the question "why do the two most powerful techniques Dropout and Batch Normalization (BN) often lead to a worse performance when they are combined together?" in both theoretical and statistical aspects. Theoretically, we find that Dropout would shift the variance of a specific neural unit when we transfer the state of that network from train to test. However, BN would maintain its statistical variance, which is accumulated from the entire learning procedure, in the test phase. The inconsistency of that variance (we name this scheme as "variance shift") causes the unstable numerical behavior in inference that leads to more erroneous predictions finally, when applying Dropout before BN. Thorough experiments on DenseNet, ResNet, ResNeXt and Wide ResNet confirm our findings. According to the uncovered mechanism, we next explore several strategies that modifies Dropout and try to overcome the limitations of their combination by avoiding the variance shift risks.
How can the use of different modes of survey data collection introduce bias? A simple introduction to mode effects using directed acyclic graphs (DAGs)
Survey data are self-reported data collected directly from respondents by a questionnaire or an interview and are commonly used in epidemiology. Such data are traditionally collected via a single mode (e.g. face-to-face interview alone), but use of mixed-mode designs (e.g. offering face-to-face interview or online survey) has become more common. This introduces two key challenges. First, individuals may respond differently to the same question depending on the mode; these differences due to measurement are known as 'mode effects'. Second, different individuals may participate via different modes; these differences in sample composition between modes are known as 'mode selection'. Where recognised, mode effects are often handled by straightforward approaches such as conditioning on survey mode. However, while reducing mode effects, this and other equivalent approaches may introduce collider bias in the presence of mode selection. The existence of mode effects and the consequences of na\"ive conditioning may be underappreciated in epidemiology. This paper offers a simple introduction to these challenges using directed acyclic graphs by exploring a range of possible data structures. We discuss the potential implications of using conditioning- or imputation-based approaches and outline the advantages of quantitative bias analyses for dealing with mode effects.
Twitter conversations predict the daily confirmed COVID-19 cases
As of writing this paper, COVID-19 (Coronavirus disease 2019) has spread to more than 220 countries and territories. Following the outbreak, the pandemic's seriousness has made people more active on social media, especially on the microblogging platforms such as Twitter and Weibo. The pandemic-specific discourse has remained on-trend on these platforms for months now. Previous studies have confirmed the contributions of such socially generated conversations towards situational awareness of crisis events. The early forecasts of cases are essential to authorities to estimate the requirements of resources needed to cope with the outgrowths of the virus. Therefore, this study attempts to incorporate the public discourse in the design of forecasting models particularly targeted for the steep-hill region of an ongoing wave. We propose a sentiment-involved topic-based latent variables search methodology for designing forecasting models from publicly available Twitter conversations. As a use case, we implement the proposed methodology on Australian COVID-19 daily cases and Twitter conversations generated within the country. Experimental results: (i) show the presence of latent social media variables that Granger-cause the daily COVID-19 confirmed cases, and (ii) confirm that those variables offer additional prediction capability to forecasting models. Further, the results show that the inclusion of social media variables introduces 48.83--51.38% improvements on RMSE over the baseline models. We also release the large-scale COVID-19 specific geotagged global tweets dataset, MegaGeoCOV, to the public anticipating that the geotagged data of this scale would aid in understanding the conversational dynamics of the pandemic through other spatial and temporal contexts.
Coping with Information Loss and the Use of Auxiliary Sources of Data: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions
Clinical trials disruption has always represented a non negligible part of the ending of interventional studies. While the SARS-CoV-2 (COVID-19) pandemic has led to an impressive and unprecedented initiation of clinical research, it has also led to considerable disruption of clinical trials in other disease areas, with around 80% of non-COVID-19 trials stopped or interrupted during the pandemic. In many cases the disrupted trials will not have the planned statistical power necessary to yield interpretable results. This paper describes methods to compensate for the information loss arising from trial disruptions by incorporating additional information available from auxiliary data sources. The methods described include the use of auxiliary data on baseline and early outcome data available from the trial itself and frequentist and Bayesian approaches for the incorporation of information from external data sources. The methods are illustrated by application to the analysis of artificial data based on the Primary care pediatrics Learning Activity Nutrition (PLAN) study, a clinical trial assessing a diet and exercise intervention for overweight children, that was affected by the COVID-19 pandemic. We show how all of the methods proposed lead to an increase in precision relative to use of complete case data only.
Model-Twin Randomization (MoTR): A Monte Carlo Method for Estimating the Within-Individual Average Treatment Effect Using Wearable Sensors
Temporally dense single-person "small data" have become widely available thanks to mobile apps and wearable sensors. Many caregivers and self-trackers want to use these data to help a specific person change their behavior to achieve desired health outcomes. Ideally, this involves discerning possible causes from correlations using that person's own observational time series data. In this paper, we estimate within-individual average treatment effects of physical activity on sleep duration, and vice-versa. We introduce the model twin randomization (MoTR; "motor") method for analyzing an individual's intensive longitudinal data. Formally, MoTR is an application of the g-formula (i.e., standardization, back-door adjustment) under serial interference. It estimates stable recurring effects, as is done in n-of-1 trials and single case experimental designs. We compare our approach to standard methods (with possible confounding) to show how to use causal inference to make better personalized recommendations for health behavior change, and analyze 222 days of Fitbit sleep and steps data for one of the authors.
One-connection rule for structural equation models
Linear structural equation models are multivariate statistical models encoded by mixed graphs. In particular, the set of covariance matrices for distributions belonging to a linear structural equation model for a fixed mixed graph G=(V, D,B) is parameterized by a rational function with parameters for each vertex and edge in G. This rational parametrization naturally allows for the study of these models from an algebraic and combinatorial point of view. Indeed, this point of view has led to a collection of results in the literature, mainly focusing on questions related to identifiability and determining relationships between covariances (i.e., finding polynomials in the Gaussian vanishing ideal). So far, a large proportion of these results has focused on the case when D, the directed part of the mixed graph G, is acyclic. This is due to the fact that in the acyclic case, the parametrization becomes polynomial and there is a description of the entries of the covariance matrices in terms of a finite sum. We move beyond the acyclic case and give a closed form expression for the entries of the covariance matrices in terms of the one-connections in a graph obtained from D through some small operations. This closed form expression then allows us to show that if G is simple, then the parametrization map is generically finite-to-one. Finally, having a closed form expression for the covariance matrices allows for the development of an algorithm for systematically exploring possible polynomials in the Gaussian vanishing ideal.
Degrees of Randomness in Rerandomization Procedures
Randomized controlled trials are susceptible to imbalance on covariates predictive of the outcome. Rerandomization and deterministic treatment assignment are two proposed solutions. This paper explores the relationship between rerandomization and deterministic assignment, showing how deterministic assignment is an extreme case of rerandomization. The paper argues that in small experiments, both fully randomized and fully deterministic assignment have limitations. Instead, the researcher should consider setting the rerandomization acceptance probability based on an analysis of covariates and assumptions about the data structure to achieve an optimal alignment between randomness and balance. This allows for the calculation of minimum p-values along with valid permutation tests and fiducial intervals. The paper also introduces tools, including a new, open-source R package named fastrerandomize, to implement rerandomization and explore options for optimal rerandomization acceptance thresholds.
Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions
Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods.
VLUCI: Variational Learning of Unobserved Confounders for Counterfactual Inference
Causal inference plays a vital role in diverse domains like epidemiology, healthcare, and economics. De-confounding and counterfactual prediction in observational data has emerged as a prominent concern in causal inference research. While existing models tackle observed confounders, the presence of unobserved confounders remains a significant challenge, distorting causal inference and impacting counterfactual outcome accuracy. To address this, we propose a novel variational learning model of unobserved confounders for counterfactual inference (VLUCI), which generates the posterior distribution of unobserved confounders. VLUCI relaxes the unconfoundedness assumption often overlooked by most causal inference methods. By disentangling observed and unobserved confounders, VLUCI constructs a doubly variational inference model to approximate the distribution of unobserved confounders, which are used for inferring more accurate counterfactual outcomes. Extensive experiments on synthetic and semi-synthetic datasets demonstrate VLUCI's superior performance in inferring unobserved confounders. It is compatible with state-of-the-art counterfactual inference models, significantly improving inference accuracy at both group and individual levels. Additionally, VLUCI provides confidence intervals for counterfactual outcomes, aiding decision-making in risk-sensitive domains. We further clarify the considerations when applying VLUCI to cases where unobserved confounders don't strictly conform to our model assumptions using the public IHDP dataset as an example, highlighting the practical advantages of VLUCI.
RadEdit: stress-testing biomedical vision models via diffusion image editing
Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method RadEdit that uses multiple masks, if present, to constrain changes and ensure consistency in the edited images. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI.
Predicting Rare Events by Shrinking Towards Proportional Odds
Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.
Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions
Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.
Comparison of meta-learners for estimating multi-valued treatment heterogeneous effects
Conditional Average Treatment Effects (CATE) estimation is one of the main challenges in causal inference with observational data. In addition to Machine Learning based-models, nonparametric estimators called meta-learners have been developed to estimate the CATE with the main advantage of not restraining the estimation to a specific supervised learning method. This task becomes, however, more complicated when the treatment is not binary as some limitations of the naive extensions emerge. This paper looks into meta-learners for estimating the heterogeneous effects of multi-valued treatments. We consider different meta-learners, and we carry out a theoretical analysis of their error upper bounds as functions of important parameters such as the number of treatment levels, showing that the naive extensions do not always provide satisfactory results. We introduce and discuss meta-learners that perform well as the number of treatments increases. We empirically confirm the strengths and weaknesses of those methods with synthetic and semi-synthetic datasets.
Phase Transitions in the Detection of Correlated Databases
We study the problem of detecting the correlation between two Gaussian databases XinR^{ntimes d} and Y^{ntimes d}, each composed of n users with d features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation sigma over the set of n users (or, row permutation), such that X is rho-correlated with Y^sigma, a permuted version of Y. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of n and d. Specifically, we prove that if rho^2dto0, as dtoinfty, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of n. This compliments the performance of a simple test that thresholds the sum all entries of X^TY. Furthermore, when d is fixed, we prove that strong detection (vanishing error probability) is impossible for any rho<rho^star, where rho^star is an explicit function of d, while weak detection is again impossible as long as rho^2dto0. These results close significant gaps in current recent related studies.
Attenuation Bias with Latent Predictors
Many political science theories relate to latent variables, but such quantities cannot be observed directly and must instead be estimated from data with inherent uncertainty. In regression models, when a variable is measured with error, its slope coefficient is known to be biased toward zero. We show how measurement error interacts with unique aspects of latent variable estimation, identification restrictions in particular, and demonstrate how common error adjustment strategies can worsen bias. We introduce a method for adjusting coefficients on latent predictors, which reduces bias and typically increases the magnitude of estimated coefficients, often dramatically. We illustrate these dynamics using several different estimation strategies for the latent predictors. Corrected estimates using our proposed method show stronger relationships -- sometimes up to 50% larger -- than those from naive regression. Our findings highlight the importance of considering measurement error in latent predictors and the inadequacy of many commonly used approaches for dealing with this issue.
Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize
Learning representations that generalize under distribution shifts is critical for building robust machine learning models. However, despite significant efforts in recent years, algorithmic advances in this direction have been limited. In this work, we seek to understand the fundamental difficulty of out-of-distribution generalization with deep neural networks. We first empirically show that perhaps surprisingly, even allowing a neural network to explicitly fit the representations obtained from a teacher network that can generalize out-of-distribution is insufficient for the generalization of the student network. Then, by a theoretical study of two-layer ReLU networks optimized by stochastic gradient descent (SGD) under a structured feature model, we identify a fundamental yet unexplored feature learning proclivity of neural networks, feature contamination: neural networks can learn uncorrelated features together with predictive features, resulting in generalization failure under distribution shifts. Notably, this mechanism essentially differs from the prevailing narrative in the literature that attributes the generalization failure to spurious correlations. Overall, our results offer new insights into the non-linear feature learning dynamics of neural networks and highlight the necessity of considering inductive biases in out-of-distribution generalization.
Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data
Learning causal structure from observational data often assumes that we observe independent and identically distributed (i.\,i.\,d) data. The traditional approach aims to find a graphical representation that encodes the same set of conditional independence relationships as those present in the observed distribution. It is known that under i.\,i.\,d assumption, even with infinite data, there is a limit to how fine-grained a causal structure we can identify. To overcome this limitation, recent work has explored using data originating from different, related environments to learn richer causal structure. These approaches implicitly rely on the independent causal mechanisms (ICM) principle, which postulates that the mechanism giving rise to an effect given its causes and the mechanism which generates the causes do not inform or influence each other. Thus, components of the causal model can independently change from environment to environment. Despite its wide application in machine learning and causal inference, there is a lack of statistical formalization of the ICM principle and how it enables identification of richer causal structures from grouped data. Here we present new causal de Finetti theorems which offer a first statistical formalization of ICM principle and show how causal structure identification is possible from exchangeable data. Our work provides theoretical justification for a broad range of techniques leveraging multi-environment data to learn causal structure.
Calibrated Chaos: Variance Between Runs of Neural Network Training is Harmless and Inevitable
Typical neural network trainings have substantial variance in test-set performance between repeated runs, impeding hyperparameter comparison and training reproducibility. We present the following results towards understanding this variation. (1) Despite having significant variance on their test-sets, we demonstrate that standard CIFAR-10 and ImageNet trainings have very little variance in their performance on the test-distributions from which those test-sets are sampled, suggesting that variance is less of a practical issue than previously thought. (2) We present a simplifying statistical assumption which closely approximates the structure of the test-set accuracy distribution. (3) We argue that test-set variance is inevitable in the following two senses. First, we show that variance is largely caused by high sensitivity of the training process to initial conditions, rather than by specific sources of randomness like the data order and augmentations. Second, we prove that variance is unavoidable given the observation that ensembles of trained networks are well-calibrated. (4) We conduct preliminary studies of distribution-shift, fine-tuning, data augmentation and learning rate through the lens of variance between runs.
Bayesian Evidence Synthesis for Modeling SARS-CoV-2 Transmission
The acute phase of the Covid-19 pandemic has made apparent the need for decision support based upon accurate epidemic modeling. This process is substantially hampered by under-reporting of cases and related data incompleteness issues. In this article we adopt the Bayesian paradigm and synthesize publicly available data via a discrete-time stochastic epidemic modeling framework. The models allow for estimating the total number of infections while accounting for the endemic phase of the pandemic. We assess the prediction of the infection rate utilizing mobility information, notably the principal components of the mobility data. We evaluate variational Bayes in this context and find that Hamiltonian Monte Carlo offers a robust inference alternative for such models. We elaborate upon vector analysis of the epidemic dynamics, thus enriching the traditional tools used for decision making. In particular, we show how certain 2-dimensional plots on the phase plane may yield intuitive information regarding the speed and the type of transmission dynamics. We investigate the potential of a two-stage analysis as a consequence of cutting feedback, for inference on certain functionals of the model parameters. Finally, we show that a point mass on critical parameters is overly restrictive and investigate informative priors as a suitable alternative.
Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach
In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators that do not cause shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods.
Sensitivity Analysis On Loss Landscape
Gradients can be employed for sensitivity analysis. Here, we leverage the advantages of the Loss Landscape to comprehend which independent variables impact the dependent variable. We seek to grasp the loss landscape by utilizing first, second, and third derivatives through automatic differentiation. we know that Spearman's rank correlation coefficient can detect the monotonic relationship between two variables. However, I have found that second-order gradients, with certain configurations and parameters, provide information that can be visualized similarly to Spearman results, In this approach, we incorporate a loss function with an activation function, resulting in a non-linear pattern. Each exploration of the loss landscape through retraining yields new valuable information. Furthermore, the first and third derivatives are also beneficial, as they indicate the extent to which independent variables influence the dependent variable.
Learning the Dynamics of Sparsely Observed Interacting Systems
We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.
Proximity Ascertainment Bias in Early Covid Case Locations
A comparison of the distances to the Huanan Seafood Market of early Covid cases with known links to the market versus cases without known links shows results apparently incompatible with a location model lacking proximity ascertainment bias. The sign of the difference instead agrees with a model in which such ascertainment bias is large. In the presence of such bias inferences based on the clustering of case locations become unreliable.
A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables
Most existing causal discovery methods rely on the assumption of no latent confounders, limiting their applicability in solving real-life problems. In this paper, we introduce a novel, versatile framework for causal discovery that accommodates the presence of causally-related hidden variables almost everywhere in the causal network (for instance, they can be effects of observed variables), based on rank information of covariance matrix over observed variables. We start by investigating the efficacy of rank in comparison to conditional independence and, theoretically, establish necessary and sufficient conditions for the identifiability of certain latent structural patterns. Furthermore, we develop a Rank-based Latent Causal Discovery algorithm, RLCD, that can efficiently locate hidden variables, determine their cardinalities, and discover the entire causal structure over both measured and hidden ones. We also show that, under certain graphical conditions, RLCD correctly identifies the Markov Equivalence Class of the whole latent causal graph asymptotically. Experimental results on both synthetic and real-world personality data sets demonstrate the efficacy of the proposed approach in finite-sample cases.
Sequential Underspecified Instrument Selection for Cause-Effect Estimation
Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the treatment variable(s). Most IV applications focus on low-dimensional treatments and crucially require at least as many instruments as treatments. This assumption is restrictive: in the natural sciences we often seek to infer causal effects of high-dimensional treatments (e.g., the effect of gene expressions or microbiota on health and disease), but can only run few experiments with a limited number of instruments (e.g., drugs or antibiotics). In such underspecified problems, the full treatment effect is not identifiable in a single experiment even in the linear case. We show that one can still reliably recover the projection of the treatment effect onto the instrumented subspace and develop techniques to consistently combine such partial estimates from different sets of instruments. We then leverage our combined estimators in an algorithm that iteratively proposes the most informative instruments at each round of experimentation to maximize the overall information about the full causal effect.
A Bayesian approach to the g-formula
Epidemiologists often wish to estimate quantities that are easy to communicate and correspond to the results of realistic public health scenarios. Methods from causal inference can answer these questions. We adopt the language of potential outcomes under Rubin's original Bayesian framework and show that the parametric g-formula is easily amenable to a Bayesian approach. We show that the frequentist properties of the Bayesian g-formula suggest it improves the accuracy of estimates of causal effects in small samples or when data may be sparse. We demonstrate our approach to estimate the effect of environmental tobacco smoke on body mass index z-scores among children aged 4-9 years who were enrolled in a longitudinal birth cohort in New York, USA. We give a general algorithm and supply SAS and Stan code that can be adopted to implement our computational approach in both time-fixed and longitudinal data.
Demystifying Disagreement-on-the-Line in High Dimensions
Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have led to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.
DAG-aware Transformer for Causal Effect Estimation
Causal inference is a critical task across fields such as healthcare, economics, and the social sciences. While recent advances in machine learning, especially those based on the deep-learning architectures, have shown potential in estimating causal effects, existing approaches often fall short in handling complex causal structures and lack adaptability across various causal scenarios. In this paper, we present a novel transformer-based method for causal inference that overcomes these challenges. The core innovation of our model lies in its integration of causal Directed Acyclic Graphs (DAGs) directly into the attention mechanism, enabling it to accurately model the underlying causal structure. This allows for flexible estimation of both average treatment effects (ATE) and conditional average treatment effects (CATE). Extensive experiments on both synthetic and real-world datasets demonstrate that our approach surpasses existing methods in estimating causal effects across a wide range of scenarios. The flexibility and robustness of our model make it a valuable tool for researchers and practitioners tackling complex causal inference problems.
Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings
Machine learning models in health care are often deployed in settings where it is important to protect patient privacy. In such settings, methods for differentially private (DP) learning provide a general-purpose approach to learn models with privacy guarantees. Modern methods for DP learning ensure privacy through mechanisms that censor information judged as too unique. The resulting privacy-preserving models, therefore, neglect information from the tails of a data distribution, resulting in a loss of accuracy that can disproportionately affect small groups. In this paper, we study the effects of DP learning in health care. We use state-of-the-art methods for DP learning to train privacy-preserving models in clinical prediction tasks, including x-ray classification of images and mortality prediction in time series data. We use these models to perform a comprehensive empirical investigation of the tradeoffs between privacy, utility, robustness to dataset shift, and fairness. Our results highlight lesser-known limitations of methods for DP learning in health care, models that exhibit steep tradeoffs between privacy and utility, and models whose predictions are disproportionately influenced by large demographic groups in the training data. We discuss the costs and benefits of differentially private learning in health care.
Vector-Valued Control Variates
Control variates are variance reduction tools for Monte Carlo estimators. They can provide significant variance reduction, but usually require a large number of samples, which can be prohibitive when sampling or evaluating the integrand is computationally expensive. Furthermore, there are many scenarios where we need to compute multiple related integrals simultaneously or sequentially, which can further exacerbate computational costs. In this paper, we propose vector-valued control variates, an extension of control variates which can be used to reduce the variance of multiple Monte Carlo estimators jointly. This allows for the transfer of information across integration tasks, and hence reduces the need for a large number of samples. We focus on control variates based on kernel interpolants and our novel construction is obtained through a generalised Stein identity and the development of novel matrix-valued Stein reproducing kernels. We demonstrate our methodology on a range of problems including multifidelity modelling, Bayesian inference for dynamical systems, and model evidence computation through thermodynamic integration.
Test-Time Style Shifting: Handling Arbitrary Styles in Domain Generalization
In domain generalization (DG), the target domain is unknown when the model is being trained, and the trained model should successfully work on an arbitrary (and possibly unseen) target domain during inference. This is a difficult problem, and despite active studies in recent years, it remains a great challenge. In this paper, we take a simple yet effective approach to tackle this issue. We propose test-time style shifting, which shifts the style of the test sample (that has a large style gap with the source domains) to the nearest source domain that the model is already familiar with, before making the prediction. This strategy enables the model to handle any target domains with arbitrary style statistics, without additional model update at test-time. Additionally, we propose style balancing, which provides a great platform for maximizing the advantage of test-time style shifting by handling the DG-specific imbalance issues. The proposed ideas are easy to implement and successfully work in conjunction with various other DG schemes. Experimental results on different datasets show the effectiveness of our methods.
Identifiable Latent Polynomial Causal Models Through the Lens of Change
Causal representation learning aims to unveil latent high-level causal representations from observed low-level data. One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as identifiability. A recent breakthrough explores identifiability by leveraging the change of causal influences among latent causal variables across multiple environments liu2022identifying. However, this progress rests on the assumption that the causal relationships among latent causal variables adhere strictly to linear Gaussian models. In this paper, we extend the scope of latent causal models to involve nonlinear causal relationships, represented by polynomial models, and general noise distributions conforming to the exponential family. Additionally, we investigate the necessity of imposing changes on all causal parameters and present partial identifiability results when part of them remains unchanged. Further, we propose a novel empirical estimation method, grounded in our theoretical finding, that enables learning consistent latent causal representations. Our experimental results, obtained from both synthetic and real-world data, validate our theoretical contributions concerning identifiability and consistency.
Conformal Prediction with Missing Values
Conformal prediction is a theoretically grounded framework for constructing predictive intervals. We study conformal prediction with missing values in the covariates -- a setting that brings new challenges to uncertainty quantification. We first show that the marginal coverage guarantee of conformal prediction holds on imputed data for any missingness distribution and almost all imputation functions. However, we emphasize that the average coverage varies depending on the pattern of missing values: conformal methods tend to construct prediction intervals that under-cover the response conditionally to some missing patterns. This motivates our novel generalized conformalized quantile regression framework, missing data augmentation, which yields prediction intervals that are valid conditionally to the patterns of missing values, despite their exponential number. We then show that a universally consistent quantile regression algorithm trained on the imputed data is Bayes optimal for the pinball risk, thus achieving valid coverage conditionally to any given data point. Moreover, we examine the case of a linear model, which demonstrates the importance of our proposal in overcoming the heteroskedasticity induced by missing values. Using synthetic and data from critical care, we corroborate our theory and report improved performance of our methods.
Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes
The quality of many modern machine learning models improves as model complexity increases, an effect that has been quantified, for predictive performance, with the non-monotonic double descent learning curve. Here, we address the overarching question: is there an analogous theory of double descent for models which estimate uncertainty? We provide a partially affirmative and partially negative answer in the setting of Gaussian processes (GP). Under standard assumptions, we prove that higher model quality for optimally-tuned GPs (including uncertainty prediction) under marginal likelihood is realized for larger input dimensions, and therefore exhibits a monotone error curve. After showing that marginal likelihood does not naturally exhibit double descent in the input dimension, we highlight related forms of posterior predictive loss that do exhibit non-monotonicity. Finally, we verify empirically that our results hold for real data, beyond our considered assumptions, and we explore consequences involving synthetic covariates.
Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation
We study the problem of model selection in causal inference, specifically for the case of conditional average treatment effect (CATE) estimation under binary treatments. Unlike model selection in machine learning, there is no perfect analogue of cross-validation as we do not observe the counterfactual potential outcome for any data point. Towards this, there have been a variety of proxy metrics proposed in the literature, that depend on auxiliary nuisance models estimated from the observed data (propensity score model, outcome regression model). However, the effectiveness of these metrics has only been studied on synthetic datasets as we can access the counterfactual data for them. We conduct an extensive empirical analysis to judge the performance of these metrics introduced in the literature, and novel ones introduced in this work, where we utilize the latest advances in generative modeling to incorporate multiple realistic datasets. Our analysis suggests novel model selection strategies based on careful hyperparameter tuning of CATE estimators and causal ensembling.
Pay Attention to Evolution: Time Series Forecasting with Deep Graph-Evolution Learning
Time-series forecasting is one of the most active research topics in artificial intelligence. Applications in real-world time series should consider two factors for achieving reliable predictions: modeling dynamic dependencies among multiple variables and adjusting the model's intrinsic hyperparameters. A still open gap in that literature is that statistical and ensemble learning approaches systematically present lower predictive performance than deep learning methods. They generally disregard the data sequence aspect entangled with multivariate data represented in more than one time series. Conversely, this work presents a novel neural network architecture for time-series forecasting that combines the power of graph evolution with deep recurrent learning on distinct data distributions; we named our method Recurrent Graph Evolution Neural Network (ReGENN). The idea is to infer multiple multivariate relationships between co-occurring time-series by assuming that the temporal data depends not only on inner variables and intra-temporal relationships (i.e., observations from itself) but also on outer variables and inter-temporal relationships (i.e., observations from other-selves). An extensive set of experiments was conducted comparing ReGENN with dozens of ensemble methods and classical statistical ones, showing sound improvement of up to 64.87% over the competing algorithms. Furthermore, we present an analysis of the intermediate weights arising from ReGENN, showing that by looking at inter and intra-temporal relationships simultaneously, time-series forecasting is majorly improved if paying attention to how multiple multivariate data synchronously evolve.
WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
PCM Selector: Penalized Covariate-Mediator Selection Operator for Evaluating Linear Causal Effects
For a data-generating process for random variables that can be described with a linear structural equation model, we consider a situation in which (i) a set of covariates satisfying the back-door criterion cannot be observed or (ii) such a set can be observed, but standard statistical estimation methods cannot be applied to estimate causal effects because of multicollinearity/high-dimensional data problems. We propose a novel two-stage penalized regression approach, the penalized covariate-mediator selection operator (PCM Selector), to estimate the causal effects in such scenarios. Unlike existing penalized regression analyses, when a set of intermediate variables is available, PCM Selector provides a consistent or less biased estimator of the causal effect. In addition, PCM Selector provides a variable selection procedure for intermediate variables to obtain better estimation accuracy of the causal effects than does the back-door criterion.
