Recalibration of Aleatoric and EpistemicRegression Uncertainty in Medical Imaging

Max-Heinrich Laves1,2, Sontje Ihler2, Jacob F. Fast2,3, Lüder A. Kahrs4,5, Tobias Ortmaier2
1: Institute of Medical Technology and Intelligent Systems, Hamburg University of Technology, 2: Institute of Mechatronic Systems, Leibniz Universit ̈at Hannover, 3: Hannover Medical School, 4: Centre for Image Guided Innovation and Therapeutic Intervention, The Hospital for Sick Children, Toronto, 5: Department of Mathematical and Computational Sciences, University of Toronto Mississauga
MIDL 2020 special issue
Publication date: 2021/04/28
PDF · arXiv · Video · Code

Abstract

The consideration of predictive uncertainty in medical imaging with deep learning is of utmost importance. We apply estimation of both aleatoric and epistemic uncertainty by variational Bayesian inference with Monte Carlo dropout to regression tasks and show that predictive uncertainty is systematically underestimated. We apply sigma scaling with a single scalar value; a simple, yet effective calibration method for both types of uncertainty. The performance of our approach is evaluated on a variety of common medical regression data sets using different state-of-the-art convolutional network architectures. In our experiments, sigma scaling is able to reliably recalibrate predictive uncertainty. It is easy to implement and maintains the accuracy. Well-calibrated uncertainty in regression allows robust rejection of unreliable predictions or detection of out-of-distribution samples. Our source code is available at: https://github.com/mlaves/well-calibrated-regression-uncertainty

Keywords

bayesian approximation · variational inference

Bibtex @article{melba:2021:008:laves, title = "Recalibration of Aleatoric and EpistemicRegression Uncertainty in Medical Imaging", author = "Laves, Max-Heinrich and Ihler, Sontje and Fast, Jacob F. and Kahrs, Lüder A. and Ortmaier, Tobias", journal = "Machine Learning for Biomedical Imaging", volume = "1", issue = "MIDL 2020 special issue", year = "2021", pages = "1--26", issn = "2766-905X", url = "https://melba-journal.org/2021:008" }
RISTY - JOUR AU - Laves, Max-Heinrich AU - Ihler, Sontje AU - Fast, Jacob F. AU - Kahrs, Lüder A. AU - Ortmaier, Tobias PY - 2021 TI - Recalibration of Aleatoric and EpistemicRegression Uncertainty in Medical Imaging T2 - Machine Learning for Biomedical Imaging VL - 1 IS - MIDL 2020 special issue SP - 1 EP - 26 SN - 2766-905X UR - https://melba-journal.org/2021:008 ER -

2021:008 cover


1 Introduction

Predictive uncertainty should be considered in any medical imaging task that is approached with deep learning. Well-calibrated uncertainty is of great importance for decision-making and is anticipated to increase patient safety. It allows to robustly reject unreliable predictions or out-of-distribution samples. In this paper, we address the problem of miscalibration of regression uncertainty with application to medical image analysis.

For the task of regression, we aim to estimate a continuous target value 𝒚d𝒚superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT given an input image 𝒙𝒙\bm{x}bold_italic_x. Regression in medical imaging with deep learning has been applied to forensic age estimation from hand CT/MRI (Halabi et al., 2019; Štern et al., 2016), natural landmark localization (Payer et al., 2019), cell detection in histology (Xie et al., 2018), or instrument pose estimation (Gessert et al., 2018). By predicting the coordinates of object boundaries, segmentation can also be performed as a regression task. This has been done for segmentation of pulmonary nodules in CT (Messay et al., 2015), kidneys in ultrasound (Yin et al., 2020), or left ventricles in MRI (Tan et al., 2017). In registration of medical images, a continuous displacement field is predicted for each coordinate of 𝒙𝒙\bm{x}bold_italic_x, which has also recently been addressed by CNNs for regression (Dalca et al., 2019).

In medical imaging, it is crucial to consider the predictive uncertainty of deep learning models. Bayesian neural networks (BNN) and their approximation provide mathematical tools for reasoning the uncertainty (Bishop, 2006; Kingma and Welling, 2014). In general, predictive uncertainty can be split into two types: aleatoric and epistemic uncertainty (Tanno et al., 2017; Kendall and Gal, 2017). This distinction was first made in the context of risk management (Hora, 1996). Aleatoric uncertainty arises from the data directly; e. g. sensor noise or motion artifacts. Epistemic uncertainty is caused by uncertainty in the model parameters due to a limited amount of training data (Bishop, 2006). A well-accepted approach to quantify epistemic uncertainty is variational inference with Monte Carlo (MC) dropout, where dropout is used at test time to sample from the approximate posterior (Gal and Ghahramani, 2016).

Refer to caption
Figure 1: Calibration plots and uncertainty calibration error (UCE) for EfficientNet-B4 on BreastPathQ test set. Uncalibrated uncertainty is underestimated and does not correspond well with the model error (left). Uncertainty can be calibrated most effectively with σ𝜎\sigmaitalic_σ scaling (right). Solid lines show the mean and shaded areas show standard deviation from 5 repeated runs. Dashed lines denote perfect calibration.

Uncertainty quantification in regression problems in medical imaging has been addressed by prior work. Medical image enhancement with image quality transfer (IQT) has been extended to a Bayesian approach to obtain pixel-wise uncertainty (Tanno et al., 2016). Additionally, CNN-based IQT was used to estimate both aleatoric and epistemic uncertainty in MRI super-resolution (Tanno et al., 2017). Dalca et al. (2019) estimated uncertainty for a deformation field in medical image registration using a probabilistic CNN. Registration uncertainty has also been addressed outside the deep learning community (Luo et al., 2019). Schlemper et al. (2018) used sub-network ensembles to obtain uncertainty estimates in cardiac MRI reconstruction. Aleatoric and epistemic uncertainty was also used in multitask learning for MRI-based radiotherapy planning (Bragman et al., 2018).

Uncertainty obtained by deep BNNs tends to be miscalibrated, i. e. it does not correlate well with the model error (Laves et al., 2019). Fig. 1 shows calibration plots (observed uncertainty vs. expected uncertainty) for uncalibrated and calibrated uncertainty estimates. The predicted uncertainty (taking into account both epistemic and aleatoric uncertainty) is underestimated and does not allow robust detection of uncertain predictions at test time.

Calibration of uncertainty in regression has been addressed in prior work outside medical imaging. In (Kuleshov et al., 2018), inaccurate uncertainties from Bayesian models for regression are recalibrated using a technique inspired by Platt scaling. Given a pre-trained, miscalibrated model 𝑯𝑯\bm{H}bold_italic_H, an auxiliary model 𝑹:[0,1]d[0,1]d:𝑹superscript01𝑑superscript01𝑑\bm{R}:[0,1]^{d}\rightarrow[0,1]^{d}bold_italic_R : [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is trained, that yields a calibrated regressor 𝑹𝑯𝑹𝑯\bm{R}\circ\bm{H}bold_italic_R ∘ bold_italic_H. In (Phan et al., 2018), this method was applied to bounding box regression. However, an auxiliary model with enough capacity will always be able to recalibrate, even if the predicted uncertainty is completely uncorrelated with the real uncertainty. Furthermore, Kuleshov et al. (2018) state that calibration via 𝑹𝑹\bm{R}bold_italic_R is possible if enough independent and identically distributed (i.i.d.) data is available. In medical imaging, large data sets are usually hard to obtain, which can cause 𝑹𝑹\bm{R}bold_italic_R to overfit the calibration set. This downside was addressed in (Levi et al., 2019), which is most related to our work. They proposed to scale the standard deviation of a Gaussian model to recalibrate aleatoric uncertainty. In contrast to our work, they do not take into account epistemic uncertainty, which is an important source of uncertainty, especially when dealing with small data sets in medical imaging.

This paper extends a preliminary version of this work presented at the Medical Imaging with Deep Learning (MIDL) 2020 conference (Laves et al., 2020). We continue this work by providing a new derivation of our definition of perfect calibrtaion, new experimental results, analysis and discussion. Additionally, prediction intervals are computed to further assess the quality of the estimated uncertainty. We find that prediction intervals are estimated too narrow and that recalibration can mitigate this problem.

To the best of our knowledge, calibration of predictive uncertainty for regression tasks in medical imaging has not been addressed. Our main contributions are: (1) We suggest to use σ𝜎\sigmaitalic_σ scaling in a separate calibration phase to tackle underestimation of aleatoric and epistemic uncertainty, (2) we propose to use the uncertainty calibration error and prediction intervals to assess the quality of the estimated uncertainty, and (3) we perform extensive experiments on four different data sets to show the effectiveness of the proposed method.

2 Methods

In this section, we discuss estimation of aleatoric and epistemic uncertainty for regression and show why uncertainty is systematically miscalibrated. We propose to use σ𝜎\sigmaitalic_σ scaling to jointly calibrate aleatoric and epistemic uncertainty.

2.1 Conditional Log-Likelihood for Regression

We revisit regression under the maximum posterior (MAP) framework to derive direct estimation of heteroscedastic aleatoric uncertainty. That is, the aleatoric uncertainty varies with the input and is not assumed to be constant. The goal of our regression model is to predict a target value 𝒚𝒚\bm{y}bold_italic_y given some new input 𝒙𝒙\bm{x}bold_italic_x and a training set 𝒟𝒟\mathcal{D}caligraphic_D of m𝑚mitalic_m inputs 𝑿={𝒙1,,𝒙m}𝑿subscript𝒙1subscript𝒙𝑚\bm{X}=\{\bm{x}_{1},\ldots,\bm{x}_{m}\}bold_italic_X = { bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and their corresponding (observed) target values 𝒀={𝒚1,,𝒚m}𝒀subscript𝒚1subscript𝒚𝑚\bm{Y}=\{\bm{y}_{1},\ldots,\bm{y}_{m}\}bold_italic_Y = { bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. We assume that 𝒚𝒚\bm{y}bold_italic_y has a Gaussian distribution 𝒩(𝒚;𝒚^(𝒙),σ^2(𝒙))𝒩𝒚^𝒚𝒙superscript^𝜎2𝒙\mathcal{N}\left(\bm{y};\hat{\bm{y}}(\bm{x}),\hat{\sigma}^{2}(\bm{x})\right)caligraphic_N ( bold_italic_y ; over^ start_ARG bold_italic_y end_ARG ( bold_italic_x ) , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) ) with mean equal to 𝒚^(𝒙)^𝒚𝒙\hat{\bm{y}}(\bm{x})over^ start_ARG bold_italic_y end_ARG ( bold_italic_x ) and variance σ^2(𝒙)superscript^𝜎2𝒙\hat{\sigma}^{2}(\bm{x})over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ). A neural network with parameters 𝜽𝜽\bm{\theta}bold_italic_θ

𝒇𝜽(𝒙)=[𝒚^(𝒙),σ^2(𝒙)],𝒚^d,σ^2,σ^20formulae-sequencesubscript𝒇𝜽𝒙^𝒚𝒙superscript^𝜎2𝒙formulae-sequence^𝒚superscript𝑑formulae-sequencesuperscript^𝜎2superscript^𝜎20\bm{f}_{\bm{\theta}}\left(\bm{x}\right)=\left[\hat{\bm{y}}(\bm{x}),\hat{\sigma% }^{2}(\bm{x})\right],~{}\hat{\bm{y}}\in\mathbb{R}^{d},~{}\hat{\sigma}^{2}\in% \mathbb{R},\hat{\sigma}^{2}\geq 0bold_italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = [ over^ start_ARG bold_italic_y end_ARG ( bold_italic_x ) , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) ] , over^ start_ARG bold_italic_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0(1)

outputs these values for a given input (Nix and Weigend, 1994). By assuming a Gaussian prior over the parameters 𝜽𝒩(𝜽;𝟎,λ1𝑰)similar-to𝜽𝒩𝜽0superscript𝜆1𝑰\bm{\theta}\sim\mathcal{N}(\bm{\theta};\bm{0},\lambda^{-1}\bm{I})bold_italic_θ ∼ caligraphic_N ( bold_italic_θ ; bold_0 , italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_I ), MAP estimation becomes maximum-likelihood estimation with added weight decay (Bishop, 2006). With m𝑚mitalic_m i.i.d. random samples, the conditional log-likelihood logp(𝒀|𝑿,𝜽)𝑝conditional𝒀𝑿𝜽\log p(\bm{Y}\,|\,\bm{X},\bm{\theta})roman_log italic_p ( bold_italic_Y | bold_italic_X , bold_italic_θ ) is given by

logp(𝒀|𝑿,𝜽)=𝑝conditional𝒀𝑿𝜽absent\displaystyle\log p(\bm{Y}\,|\,\bm{X},\bm{\theta})=roman_log italic_p ( bold_italic_Y | bold_italic_X , bold_italic_θ ) =i=1mlog(12πσ^𝜽(i)exp{𝒚(i)𝒚^𝜽(i)22(σ^𝜽(i))2})superscriptsubscript𝑖1𝑚12𝜋subscriptsuperscript^𝜎𝑖𝜽superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖22superscriptsubscriptsuperscript^𝜎𝑖𝜽2\displaystyle\sum_{i=1}^{m}\log\left(\frac{1}{\sqrt{2\pi}\hat{\sigma}^{(i)}_{% \bm{\theta}}}\exp\left\{-\frac{\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}% ^{(i)}\big{\|}^{2}}{2\big{(}\hat{\sigma}^{(i)}_{\bm{\theta}}\big{)}^{2}}\right% \}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG roman_exp { - divide start_ARG ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } )(2)
=\displaystyle==m2log(2π)i=1mlog(σ^𝜽(i))+12(σ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2.𝑚22𝜋superscriptsubscript𝑖1𝑚subscriptsuperscript^𝜎𝑖𝜽12superscriptsuperscriptsubscript^𝜎𝜽𝑖2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2\displaystyle-\dfrac{m}{2}\log\left(2\pi\right)-\sum_{i=1}^{m}\log\big{(}\hat{% \sigma}^{(i)}_{\bm{\theta}}\big{)}+\frac{1}{2\big{(}\hat{\sigma}_{\bm{\theta}}% ^{(i)}\big{)}^{2}}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|% }^{2}~{}.- divide start_ARG italic_m end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(3)

The dependence on 𝒙𝒙\bm{x}bold_italic_x has been omitted to simplify the notation. Maximizing the log-likelihood in Eq. (3) w.r.t. 𝜽𝜽\bm{\theta}bold_italic_θ is equivalent to minimizing the negative log-likelihood (NLL), which leads to the following optimization criterion (with weight decay)

G(𝜽)=i=1m(σ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2+log((σ^𝜽(i))2).subscriptG𝜽superscriptsubscript𝑖1𝑚superscriptsubscriptsuperscript^𝜎𝑖𝜽2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2superscriptsuperscriptsubscript^𝜎𝜽𝑖2\mathcal{L}_{\mathrm{G}}(\bm{\theta})=\sum_{i=1}^{m}\big{(}\hat{\sigma}^{(i)}_% {\bm{\theta}}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}% \big{\|}^{2}+\log\big{(}(\hat{\sigma}_{\bm{\theta}}^{(i)})^{2}\big{)}~{}.caligraphic_L start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log ( ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(4)

Here, 𝒚^𝜽subscript^𝒚𝜽\hat{\bm{y}}_{\bm{\theta}}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and σ^𝜽subscript^𝜎𝜽\hat{\sigma}_{\bm{\theta}}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT are estimated jointly by finding 𝜽𝜽\bm{\theta}bold_italic_θ that minimizes Eq. (4). This can be achieved using gradient descent in a standard training procedure. In this case, σ^𝜽subscript^𝜎𝜽\hat{\sigma}_{\bm{\theta}}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT captures the uncertainty that is inherent in the data (aleatoric uncertainty). To avoid numerical instability due to potential division by zero, we directly estimate logσ^2(𝒙)superscript^𝜎2𝒙\log\hat{\sigma}^{2}(\bm{x})roman_log over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) and implement Eq. (4) in similar practice to Kendall and Gal (2017).

2.2 Biased estimation of σ𝜎\sigmaitalic_σ

Refer to caption
Figure 2: Biased estimation of aleatoric uncertainty σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The deep model overfits estimation of 𝒚𝒚\bm{y}bold_italic_y on the training set. On unseen test data, the MSE of predictive mean is higher and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is underestimated. Early stopping (e.g. at epoch 50) would result in an unbiased estimator, but this would not be optimal in terms of test MSE.

Ignoring their dependence through 𝜽𝜽\bm{\theta}bold_italic_θ, the solution to Eq. (4) decouples estimation of 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG and σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG. In case of a Gaussian likelihood, minimizing Eq. (4) w.r.t. 𝒚^(i)superscript^𝒚𝑖\hat{\bm{y}}^{(i)}over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT yields

𝒚^(i)=argmin𝒚^(i)G=𝒚(i)i.superscript^𝒚𝑖subscriptargminsuperscript^𝒚𝑖subscriptGsuperscript𝒚𝑖for-all𝑖\hat{\bm{y}}^{(i)}=\operatorname*{arg\,min}_{\hat{\bm{y}}^{(i)}}\mathcal{L}_{% \mathrm{G}}=\bm{y}^{(i)}~{}\operatorname{\forall}i~{}.over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT = bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∀ italic_i .(5)

Minimizing (4) w.r.t. (σ^(i))2superscriptsuperscript^𝜎𝑖2(\hat{\sigma}^{(i)})^{2}( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT yields

(σ^(i))2=argmin(σ^(i))2G=𝒚(i)𝒚^(i)2i.superscriptsuperscript^𝜎𝑖2subscriptargminsuperscriptsuperscript^𝜎𝑖2subscriptGsuperscriptnormsuperscript𝒚𝑖superscript^𝒚𝑖2for-all𝑖\big{(}\hat{\sigma}^{(i)}\big{)}^{2}=\operatorname*{arg\,min}_{(\hat{\sigma}^{% (i)})^{2}}\mathcal{L}_{\mathrm{G}}=\|\bm{y}^{(i)}-\hat{\bm{y}}^{(i)}\|^{2}~{}% \operatorname{\forall}i~{}.( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT = ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∀ italic_i .(6)

That is, estimation of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT should perfectly reflect the squared error. However, in Eq. (6) σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is estimated relative to the estimated mean 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG and therefore biased. In fact, the maximum likelihood solution systematically underestimates σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is a phenomenon of overfitting the training set (Bishop, 2006). The squared error 𝒚𝒚^2superscriptnorm𝒚^𝒚2\|\bm{y}-\hat{\bm{y}}\|^{2}∥ bold_italic_y - over^ start_ARG bold_italic_y end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT will be lower on the training set and σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on new samples will be systematically too low (see Fig. 2). This is a problem especially in deep learning, where large models have millions of parameters and tend to overfit. To solve this issue, we introduce a simple learnable scalar parameter s𝑠sitalic_s to rescale the biased estimation of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

2.3 σ𝜎\sigmaitalic_σ Scaling for Aleatoric Uncertainty

We first derive σ𝜎\sigmaitalic_σ scaling for aleatoric uncertainty. Using a Gaussian model, we scale the standard deviation σ𝜎\sigmaitalic_σ with a scalar value s𝑠sitalic_s to recalibrate the probability density function

p(𝒚|𝒙;𝒚^(𝒙),σ^2(𝒙))=𝒩(𝒚;𝒚^(𝒙),(sσ^(𝒙))2).𝑝conditional𝒚𝒙^𝒚𝒙superscript^𝜎2𝒙𝒩𝒚^𝒚𝒙superscript𝑠^𝜎𝒙2p\left(\bm{y}|\bm{x};\hat{\bm{y}}(\bm{x}),\hat{\sigma}^{2}(\bm{x})\right)=% \mathcal{N}\left(\bm{y};\hat{\bm{y}}(\bm{x}),(s\cdot\hat{\sigma}(\bm{x}))^{2}% \right)~{}.italic_p ( bold_italic_y | bold_italic_x ; over^ start_ARG bold_italic_y end_ARG ( bold_italic_x ) , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) ) = caligraphic_N ( bold_italic_y ; over^ start_ARG bold_italic_y end_ARG ( bold_italic_x ) , ( italic_s ⋅ over^ start_ARG italic_σ end_ARG ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(7)

This results in the following minimization objective:

G(s)=mlog(s)+12s2i=1m(σ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2.subscriptG𝑠𝑚𝑠12superscript𝑠2superscriptsubscript𝑖1𝑚superscriptsuperscriptsubscript^𝜎𝜽𝑖2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2\mathcal{L}_{\mathrm{G}}(s)=m\log(s)+\tfrac{1}{2}s^{-2}\sum_{i=1}^{m}\big{(}% \hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_% {\bm{\theta}}^{(i)}\big{\|}^{2}~{}.caligraphic_L start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( italic_s ) = italic_m roman_log ( italic_s ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_s start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(8)

Eq. (8) is optimized w.r.t. s𝑠sitalic_s with fixed 𝜽𝜽\bm{\theta}bold_italic_θ using gradient descent in a separate calibration phase after training to calibrate aleatoric uncertainty measured by σ^𝜽2superscriptsubscript^𝜎𝜽2\hat{\sigma}_{\bm{\theta}}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In case of a single scalar, the solution to Eq. (8) can also be written in closed form as

s=±1mi=1m(σ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2.𝑠plus-or-minus1𝑚superscriptsubscript𝑖1𝑚superscriptsuperscriptsubscript^𝜎𝜽𝑖2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2s=\pm\sqrt{\frac{1}{m}\sum_{i=1}^{m}\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}% \big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|}^{2}}% ~{}.italic_s = ± square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .(9)

We apply σ𝜎\sigmaitalic_σ scaling to jointly calibrate aleatoric and epistemic uncertainty in the next section.

2.4 Well-Calibrated Estimation of Predictive Uncertainty

So far we have assumed a MAP point estimate for 𝜽𝜽\bm{\theta}bold_italic_θ which does not consider uncertainty in the parameters. To quantify both aleatoric and epistemic uncertainty, we extend 𝒇𝜽subscript𝒇𝜽\bm{f}_{\bm{\theta}}bold_italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT into a fully Bayesian model under the variational inference framework with Monte Carlo dropout (Gal and Ghahramani, 2016). In MC dropout, the model 𝒇𝜽~subscript𝒇~𝜽\bm{f}_{\tilde{\bm{\theta}}}bold_italic_f start_POSTSUBSCRIPT over~ start_ARG bold_italic_θ end_ARG end_POSTSUBSCRIPT is trained with dropout (Srivastava et al., 2014) and dropout is applied at test time by performing N𝑁Nitalic_N stochastic forward passes to sample from the approximate Bayesian posterior 𝜽~q(𝜽)similar-to~𝜽𝑞𝜽\tilde{\bm{\theta}}\sim q(\bm{\theta})over~ start_ARG bold_italic_θ end_ARG ∼ italic_q ( bold_italic_θ ). Following (Kendall and Gal, 2017), we use MC integration to approximate the predictive variance

Σ^2=1Nn=1N(𝒚^n1Nn=1N𝒚^n)2epistemic+1Nn=1Nσ^n2aleatoricsuperscript^Σ2subscript1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛1𝑁superscriptsubscript𝑛1𝑁subscript^𝒚𝑛2epistemicsubscript1𝑁superscriptsubscript𝑛1𝑁subscriptsuperscript^𝜎2𝑛aleatoric\hat{\Sigma}^{2}=\underbrace{\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-% \frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}}_{\mathrm{epistemic}}+% \underbrace{\frac{1}{N}\sum_{n=1}^{N}\hat{\sigma}^{2}_{n}}_{\mathrm{aleatoric}}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT roman_epistemic end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_aleatoric end_POSTSUBSCRIPT(10)

and use Σ^2superscript^Σ2\hat{\Sigma}^{2}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as a measure of predictive uncertainty. If the neural network has multiple outputs (d>1𝑑1d>1italic_d > 1), the predictive variance is calculated per output and the mean across d𝑑ditalic_d forms the final uncertainty value. Eq. (10) is an unbiased estimator of the approximate predictive variance (see proof in Appendix C). From Eq. (33) of our proof follows, that Σ^2superscript^Σ2\hat{\Sigma}^{2}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is expected to equal the true variance Σ=𝔼[(𝒚^𝒚)2]Σ𝔼delimited-[]superscript^𝒚𝒚2\Sigma=\mathbb{E}[(\hat{\bm{y}}-\bm{y})^{2}]roman_Σ = blackboard_E [ ( over^ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Thus, we define perfect calibration of regression uncertainty as

𝔼𝒙,𝒚[𝔼[(𝒚^𝒚)2]|Σ^2=α2]=α2{α2|α20},subscript𝔼𝒙𝒚delimited-[]conditional𝔼delimited-[]superscript^𝒚𝒚2superscript^Σ2superscript𝛼2superscript𝛼2for-allsuperscript𝛼2conditionalsuperscript𝛼20\mathbb{E}_{\bm{x},\bm{y}}\left[\mathbb{E}[(\hat{\bm{y}}-\bm{y})^{2}]\,\big{|}% \,\hat{\Sigma}^{2}=\alpha^{2}\right]=\alpha^{2}\quad\operatorname{\forall}% \left\{\alpha^{2}\in\mathbb{R}\,|\,\alpha^{2}\geq 0\right\}~{},blackboard_E start_POSTSUBSCRIPT bold_italic_x , bold_italic_y end_POSTSUBSCRIPT [ blackboard_E [ ( over^ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] | over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∀ { italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R | italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0 } ,(11)

which extends the definition of (Levi et al., 2019) to both aleatoric and epistemic uncertainty. We expect that additionally accounting for epistemic uncertainty is particularly beneficial for smaller data sets. However, even in deep learning with Bayesian principles, the approximate posterior predictive distribution can overfit on small data sets. In practice, this leads to underestimation of the predictive uncertainty.

One could regularize overfitting by early stopping that prevents large differences between training and test loss, which would circumvent underestimation of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. However, our experiments show that early stopping is not optimal with regard to accuracy, i.e. the squared error of 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG on both training and testing data (see Fig. 2). In contrast, the model with lowest mean error on the validation set underestimates predictive uncertainty considerably. Therefore, we apply σ𝜎\sigmaitalic_σ scaling to recalibrate the predictive uncertainty Σ^2superscript^Σ2\hat{\Sigma}^{2}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This allows a lower squared error while reducing underestimation of uncertainty as shown experimentally in the following section.

2.5 Expected Uncertainty Calibration Error for Regression

We extend the definition of miscalibrated uncertainty for classification (Laves et al., 2019) to quantify miscalibration of uncertainty in regression

𝔼Σ^2[|(𝔼[(𝒚^𝒚)2]|Σ^2=α2)α2|]{α2|α20},\mathbb{E}_{\hat{\Sigma}^{2}}\left[\big{|}\big{(}\mathbb{E}[(\hat{\bm{y}}-\bm{% y})^{2}]\,\big{|}\,\hat{\Sigma}^{2}=\alpha^{2}\big{)}-\alpha^{2}\big{|}\right]% \quad\operatorname{\forall}\left\{\alpha^{2}\in\mathbb{R}\,|\,\alpha^{2}\geq 0% \right\}~{},blackboard_E start_POSTSUBSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ | ( blackboard_E [ ( over^ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] | over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ] ∀ { italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R | italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0 } ,(12)

using the second moment of the error. On finite data sets, this can be approximated with the expected uncertainty calibration error (UCE) for regression. Following (Guo et al., 2017), the uncertainty output Σ^2superscript^Σ2\hat{\Sigma}^{2}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of a deep model is partitioned into K𝐾Kitalic_K bins with equal width. A weighted average of the difference between the variance and predictive uncertainty is used:

UCE:=k=1K|Bk|m|var(Bk)uncert(Bk)|,assignUCEsuperscriptsubscript𝑘1𝐾subscript𝐵𝑘𝑚varsubscript𝐵𝑘uncertsubscript𝐵𝑘\mathrm{UCE}:={\sum_{k=1}^{K}}\frac{|B_{k}|}{m}\big{|}{\mathrm{var}}(B_{k})-% \mathrm{uncert}(B_{k})\big{|}~{},roman_UCE := ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG | italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG italic_m end_ARG | roman_var ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_uncert ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | ,(13)

with number of inputs m𝑚mitalic_m and set of indices Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of inputs, for which the uncertainty falls into the bin k𝑘kitalic_k. The variance per bin is defined as

var(Bk):=1|Bk|iBm1Nn=1N(𝒚^i,n𝒚i)2,assignvarsubscript𝐵𝑘1subscript𝐵𝑘subscript𝑖subscript𝐵𝑚1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑖𝑛subscript𝒚𝑖2\mathrm{var}(B_{k}):=\frac{1}{|B_{k}|}\sum_{i\in B_{m}}{\frac{1}{N}\sum_{n=1}^% {N}\left(\hat{\bm{y}}_{i,n}-\bm{y}_{i}\right)^{2}}~{},roman_var ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(14)

with N𝑁Nitalic_N stochastic forward passes, and the uncertainty per bin is defined as

uncert(Bk):=1|Bk|iBkΣ^i2.assignuncertsubscript𝐵𝑘1subscript𝐵𝑘subscript𝑖subscript𝐵𝑘superscriptsubscript^Σ𝑖2\mathrm{uncert}(B_{k}):=\frac{1}{|B_{k}|}\sum_{i\in B_{k}}\hat{\Sigma}_{i}^{2}% ~{}.roman_uncert ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG | italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(15)

Note that computing the second moment from Eq. (12) also incorporates MC samples, which can introduce some bias in the evaluation. The UCE considers both aleatoric and epistemic uncertainty and is given in % throughout this work. Additionally, we plot var(Bk)varsubscript𝐵𝑘\mathrm{var}(B_{k})roman_var ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) vs. uncert(Bk)uncertsubscript𝐵𝑘\mathrm{uncert}(B_{k})roman_uncert ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to create calibration diagrams.

3 Experiments

We use four data sets and three common deep network architectures to evaluate recalibration with σ𝜎\sigmaitalic_σ scaling. The data sets were selected to represent various regression tasks in medical imaging with different dimension d𝑑ditalic_d of target value 𝒚d𝒚superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT:

(1) Estimation of tumor cellularity in histology whole slides of cancerous breast tissue from the BreastPathQ SPIE challenge data set (d=1𝑑1d=1italic_d = 1) (Martel et al., 2019). The public data set consists of 2579 images, from which 1379/600/600 are used for training/validation/testing. The ground truth label is a single scalar y[0,1]𝑦01y\in[0,1]italic_y ∈ [ 0 , 1 ] denoting the ratio of tumor cells to non-tumor cells.

(2) Hand CT age regression from the RSNA pediatric bone age data set (d=1𝑑1d=1italic_d = 1) (Halabi et al., 2019). The task is to infer a person’s age in months from CT scans of the hand. This data set is the largest used in this paper and has 12,811 images, from which we use 6811/2000/4000 images for training/validation/testing.

(3) Surgical instrument tracking on endoscopic images from the EndoVis endoscopic vision challenge 2015111endovissub-instrument.grand-challenge.org data set (d=2𝑑2d=2italic_d = 2). This data set contains 8,984 video frames from 6 different robot-assisted laparoscopic interventions showing surgical instruments with ground truth pixel coordinates of the instrument’s center point 𝒚2𝒚superscript2\bm{y}\in\mathbb{R}^{2}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We use 4483/2248/2253 frames for training/validation/testing. As the public data set is only sparsely annotated, we created our own ground truth labels, which can be found in our code repository.

(4) 6DoF needle pose estimation on optical coherence tomography (OCT) scans from our own data set222Our OCT pose estimation data set is publicly available at github.com/mlaves/3doct-pose-dataset. This data set contains 5,000 3D-OCT scans with the accompanying needle pose 𝒚6𝒚superscript6\bm{y}\in\mathbb{R}^{6}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, from which we use 3300/850/850 for training/validation/testing. Additional details on this data set can be found in Appendix E.

All outputs are normalized such that 𝒚[0,1]d𝒚superscript01𝑑\bm{y}\in[0,1]^{d}bold_italic_y ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The employed network architectures are ResNet-101, DenseNet-201 and EfficientNet-B4 (He et al., 2016; Huang et al., 2017; Tan and Le, 2019). The last linear layer of all networks is replaced by two linear layers predicting 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG and σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as described in § 2.1. For MC dropout, we use dropout before the last linear layers. Dropout is further added after each of the four layers of stacked residual blocks in ResNet. In DenseNet and EfficientNet, we use the default configuration of dropout during training and testing. The networks are trained until no further decrease in mean squared error (MSE) on the validation set can be observed. More details on the training procedure can be found in Appendix D.

Calibration is performed after training in a separate calibration phase using the validation data set. We plug the predictive uncertainty Σ^2superscript^Σ2\hat{\Sigma}^{2}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT into Eq. (8) and minimize w.r.t. s𝑠sitalic_s. Additionally, we compare σ𝜎\sigmaitalic_σ scaling to a more powerful auxiliary recalibration model 𝑹𝑹\bm{R}bold_italic_R consisting of a two-layer fully-connected network with 16 hidden units and ReLU activations (inspired by (Kuleshov et al., 2018), see § 1).

Refer to caption
Refer to caption
Figure 3: Calibration plots for ResNet-101 on BreastPathQ (top row) and EfficientNet-B4 on EndoVis (bottom row). Aux scaling tends to overfit the calibration set, which results in higher UCE compared to simple σ𝜎\sigmaitalic_σ scaling. Dashed lines denote perfect calibration.

4 Results

To quantify miscalibration, we use the proposed expected uncertainty calibration error for regression. We visualize (mis-)calibration in Fig. 1 and Fig. 3 using calibration diagrams, which show expected uncertainty vs. observed uncertainty. The discrepancy to the identity function reveals miscalibration. The calibration diagrams clearly show the underestimation of uncertainty for the uncalibrated models. After calibration with both aux and σ𝜎\sigmaitalic_σ scaling, the estimated uncertainty better reflects the actual uncertainty. Figures for all configurations are listed in Appendix G.

Tab. 1 reports UCE values of all data set/model combinations on the respective test sets. The negative log-likelihood also measures miscalibration; the values on the test set can be found in Tab. 2 in the appendix. In general, recalibration considerably reduces miscalibration. On the data sets BoneAge, EndoVis and OCT, both scaling methods perform similarly well. However, on the BreastPathQ data set, σ𝜎\sigmaitalic_σ scaling clearly outperforms aux scaling in terms of UCE. BreastPathQ is the smallest data set and thus has the smallest calibration set size. We hypothesize that the more powerful auxiliary model 𝑹𝑹\bm{R}bold_italic_R overfits the calibration set (see BreastPathQ/DenseNet-201 in Tab. 1), which leads to an increase of UCE on the test set. An ablation study on BreastPathQ for the auxiliary model can be found in Appendix F.

We also compare our approach to Levi et al. (2019) in Tab. 1, which only considers aleatoric uncertainty. The aleatoric uncertainty is well-calibrated if it reflects the bias (𝔼[𝒚^n]𝒚)2superscript𝔼delimited-[]subscript^𝒚𝑛𝒚2\left(\mathbb{E}\left[\hat{\bm{y}}_{n}\right]-\bm{y}\right)^{2}( blackboard_E [ over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is given by the squared error between the expectation of the stochastic predictions 𝒚^nsubscript^𝒚𝑛\hat{\bm{y}}_{n}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the ground truth. Therefore, the UCE for aleatoric-only is computed by UCE=k=1K|Bk|m|err(Bk)uncert(Bk)|UCEsuperscriptsubscript𝑘1𝐾subscript𝐵𝑘𝑚errsubscript𝐵𝑘uncertsubscript𝐵𝑘\mathrm{UCE}={\sum_{k=1}^{K}}\frac{|B_{k}|}{m}\big{|}{\mathrm{err}}(B_{k})-% \mathrm{uncert}(B_{k})\big{|}~{}roman_UCE = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG | italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG italic_m end_ARG | roman_err ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_uncert ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) |, where err()err\mathrm{err}(\cdot)roman_err ( ⋅ ) is the mean squared error and uncert()uncert\mathrm{uncert}(\cdot)roman_uncert ( ⋅ ) is the mean aleatoric uncertainty per bin. Consideration of epistemic uncertainty is especially beneficial on smaller data sets (BreastPathQ), where our approach outperforms Levi et al. (2019). On larger data sets, the benefit diminishes and both approaches are equally calibrated.

Additionally, we report UCE values from a DenseNet ensemble for comparison. In contrast to what is reported by Lakshminarayanan et al. (2017), the deep ensemble tends to be calibrated worse. Only on BoneAge, the ensemble is better calibrated prior to recalibration of the other methods. After recalibration, both approaches outperform the deep ensemble.

Fig. 4 shows the result of intra-training calibration of aleatoric uncertainty. It indicates that the gap between training and test loss is successfully closed.

Refer to caption
Figure 4: (Left) Intra-training calibration of aleatoric uncertainty with σ𝜎\sigmaitalic_σ scaling. The deep model no longer underestimates σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on unseen test data. (Right) The MSE of predictive mean is higher and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is underestimated. Note: Calibration is only applied at test time.
Refer to caption

pixel coordinates

Figure 5: Example result from EndoVis test set. The task is to predict pixel coordinates of the forceps shaft center. Before calibration, the uncertainty is underestimated and the true instrument position 𝒚𝒚\bm{y}bold_italic_y does not fall into the prediction region 𝒚^±Σ^plus-or-minus^𝒚^Σ\hat{\bm{y}}\pm\hat{\Sigma}over^ start_ARG bold_italic_y end_ARG ± over^ start_ARG roman_Σ end_ARG. After calibration with σ𝜎\sigmaitalic_σ scaling, the uncertainty better reflects the predictive error.
Table 1: Uncertainty calibration error test set results for different datasets and model architectures (averaged over 5 runs). High UCE values indicate miscalibration. In addition, the resulting s𝑠sitalic_s for σ𝜎\sigmaitalic_σ scaling is given. We also report UCE values for an ensemble of DenseNets. Bold font indicates lowest values in each experiment.
Levi et al.ours
Data SetModelMSEnoneauxσ𝜎\sigmaitalic_σs𝑠sitalic_snoneauxσ𝜎\sigmaitalic_σs𝑠sitalic_sensemble
ResNet-1016.4e-30.510.350.282.910.490.310.202.37
BreastPathQDenseNet-2017.0e-30.210.380.151.620.110.360.151.330.51
EfficientNet-B46.4e-30.490.650.102.300.460.470.171.77
ResNet-1015.3e-30.280.070.061.460.280.020.061.40
BoneAgeDenseNet-2013.5e-30.310.050.052.980.310.050.052.540.09
EfficientNet-B43.5e-30.300.050.104.830.300.030.123.98
ResNet-1014.0e-40.040.100.096.070.040.040.043.50
EndoVisDenseNet-2011.1e-30.090.050.053.240.040.040.042.570.08
EfficientNet-B48.9e-40.060.050.062.250.060.040.041.79
ResNet-1012.0e-30.170.020.022.740.170.010.022.14
OCTDenseNet-2011.3e-30.080.010.021.600.040.030.021.260.67
EfficientNet-B41.4e-30.120.010.012.650.120.010.011.94

4.1 Posterior Prediction Intervals

In addition to the calibration diagrams, we compute prediction intervals from the uncalibrated and calibrated posterior predictive distribution. Well-calibrated prediction intervals provide a reliable measure of precision of the estimated target value. In Bayesian inference, prediction intervals define an interval within which the true target value 𝒚superscript𝒚\bm{y}^{\ast}bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of a new, unobserved input 𝒙superscript𝒙\bm{x}^{\ast}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is expected to fall with a specific probability (Heskes, 1997; Held and Sabanés Bové, 2014). This is also referred to as the credible interval of the posterior predictive distribution. For γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), a γ100%𝛾percent100\gamma\cdot 100\,\%italic_γ ⋅ 100 % prediction interval is defined through zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT such that

zlzup(𝒚|𝒙,𝒟)d𝒚=γ,superscriptsubscriptsubscript𝑧𝑙subscript𝑧𝑢𝑝conditionalsuperscript𝒚superscript𝒙𝒟differential-dsuperscript𝒚𝛾\int_{z_{l}}^{z_{u}}p(\bm{y}^{\ast}\,|\,\bm{x}^{\ast},\mathcal{D})\,\mathrm{d}% \bm{y}^{\ast}=\gamma~{},∫ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) roman_d bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_γ ,(16)

with posterior predictive distribution p(𝒚|𝒙,𝒟)𝑝conditionalsuperscript𝒚superscript𝒙𝒟p(\bm{y}^{\ast}\,|\,\bm{x}^{\ast},\mathcal{D})italic_p ( bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ). We compute the 50 %, 90 %, 95 %, and 99 % prediction interval using the root of the predictive variance from Eq. (10); that is, the 𝒚^±zΣ^plus-or-minus^𝒚𝑧^Σ\hat{\bm{y}}\pm z\hat{\Sigma}over^ start_ARG bold_italic_y end_ARG ± italic_z over^ start_ARG roman_Σ end_ARG intervals with z{Φ(0.5),Φ(0.9),Φ(0.95),Φ(0.99)}𝑧Φ0.5Φ0.9Φ0.95Φ0.99z\in\{\Phi(0.5),\Phi(0.9),\Phi(0.95),\Phi(0.99)\}italic_z ∈ { roman_Φ ( 0.5 ) , roman_Φ ( 0.9 ) , roman_Φ ( 0.95 ) , roman_Φ ( 0.99 ) } (estimated interval), with probit function Φ(p)=2erf1(p)Φ𝑝2superscripterf1𝑝\Phi(p)=\sqrt{2}\mathrm{erf}^{-1}(p)roman_Φ ( italic_p ) = square-root start_ARG 2 end_ARG roman_erf start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p ) and erf(p)erf𝑝\mathrm{erf}(p)roman_erf ( italic_p ) is the Gaussian error function. This assumes that the posterior predictive distribution is Gaussian, which is not generally the case. To assess the calibration of the posterior prediction interval, we compute the percentage of how many of the ground truth values of the test set actually fall within the respective intervals (observed interval). In Fig. 6, selected plots of observed vs. estimated prediction intervals are shown. A complete list of prediction intervals can be found in Appendix G.1.

In general, the uncalibrated prediction intervals are estimated to be too narrow, which is a direct consequence of the underestimated predictive variance. For example, the uncalibrated 90 % interval on DenseNet-201/BoneAge actually only contains approx. 50 % of the ground truth values. On this data set, the prediction intervals are considerably improved after recalibration (Fig. 6 left). If a network is already well-calibrated, recalibration can lead to overestimation of the lower prediction intervals (Fig. 6 right). However, in all cases, the 99 % prediction interval contains approx. 99 % of the ground truth test set values after recalibration. This is not the case without the proposed calibration methods. Fig. 5 shows a practical example of the 𝒚^±Σ^plus-or-minus^𝒚^Σ\hat{\bm{y}}\pm\hat{\Sigma}over^ start_ARG bold_italic_y end_ARG ± over^ start_ARG roman_Σ end_ARG prediction region from the EndoVis test set. Even though the posterior predictive distribution is not necessarily Gaussian, the calibrated results fit the prediction intervals well. This is especially the case for BoneAge, which is the largest data set used in this paper.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Observed vs. estimated posterior prediction intervals on the test sets. (Left & center) The uncalibrated prediction interval is too narrow due to underestimation of uncertainty. (Right) Calibration can lead to overestimation of predictive intervals, if the network is already well-calibrated. Dashed lines denote 1:1 mapping.

4.2 Detection of Out-of-Distribution Data and Unreliable Predictions

Deep neural networks only yield reliable predictions for data which follow the same distribution as the training data. A shift in distribution could occur when a model trained on CT data from a specific CT device is applied to data from another manufacturer’s CT device, for example. This could potentially lead to wrong predictions with low uncertainty, which we tackle with recalibration. To create a moderate distribution shift, we preprocess images from the BoneAge data set using Contrast Limited Adapative Histogram Equalization (CLAHE) (Pizer et al., 1987) with a clip-limit of 0.03 and report histograms of the uncertainties (see Fig. 7). Additionally, a severe distribution shift is created by presenting images from the BreastPathQ data set to the models trained on BoneAge. Lakshminarayanan et al. (2017) state that deep ensembles provide better-calibrated uncertainty than Bayesian neural networks with MC dropout variational inference. We therefore train an ensemble of 5 randomly initialized DenseNet-201 and compare Bayesian uncertainty with σ𝜎\sigmaitalic_σ scaling to ensemble uncertainty under distribution shift. The results with σ𝜎\sigmaitalic_σ scaling are comparable to those from a deep ensemble for a moderate shift, but without the need to train multiple models on the same data set. A severe shift leads to only slightly increased uncertainties from the calibrated MC dropout model, while the deep ensemble is more sensitive.

Additionally, we apply the well-calibrated models to detect and reject uncertain predictions, as crucial decisions in medical practice should only be made on the basis of reliable predictions. An uncertainty threshold Σmax2superscriptsubscriptΣmax2\Sigma_{\mathrm{max}}^{2}roman_Σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is defined and all predictions from the test set are rejected where Σ^2>Σmax2superscript^Σ2superscriptsubscriptΣmax2\hat{\Sigma}^{2}>\Sigma_{\mathrm{max}}^{2}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > roman_Σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (see Fig. 8). From this, a decrease in overall MSE is expected. We additionally compare rejection on the basis of σ𝜎\sigmaitalic_σ scaled uncertainty to uncertainty from the aforementioned ensemble. In case of σ𝜎\sigmaitalic_σ scaling, the test set MSE decreases monotonically as a function of the uncertainty threshold, whereas the ensemble initially shows an increasing MSE (see Fig. 8).

Refer to caption
Refer to caption
Figure 7: Histograms of the uncertainties for out-of-distribution detection with DenseNet-201 on BoneAge test set. (Left) Uncertainties from a non-Bayesian ensemble of five DenseNets and (right) Bayesian uncertainties calibrated with σ𝜎\sigmaitalic_σ scaling. The distribution shifts have been created with pre-processing by CLAHE (moderate) and images from a different domain (severe).
Refer to caption
Refer to caption
Figure 8: Rejection of uncertain predictions with DenseNet-201 on BoneAge test set with Σ^2>Σmax2superscript^Σ2subscriptsuperscriptΣ2max\hat{\Sigma}^{2}>\Sigma^{2}_{\mathrm{max}}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > roman_Σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. The shaded area width visualizes the percentage of rejected samples. The dashed line visualizes linear relationship.

5 Discussion & Conclusion

In this paper, well-calibrated predictive uncertainty in medical imaging obtained by variational inference with deep Bayesian models is discussed. Both aux and σ𝜎\sigmaitalic_σ scaling calibration methods considerably reduce miscalibration of predictive uncertainty in terms of UCE. If the deep model is already well-calibrated, σ𝜎\sigmaitalic_σ scaling does not negatively affect the calibration, which results in s1𝑠1s\rightarrow 1italic_s → 1. More complex calibration methods such as aux scaling have to be used with caution, as they can overfit the data set used for calibration. If the calibration set is sufficiently large, they can outperform simple scaling. However, models trained on large data sets are generally better calibrated and the benefit diminishes. Compared to the work of Levi et al. (2019), accounting for epistemic uncertainty is particularly beneficial for smaller data sets, which is helpful in medical practice where access to large labeled data sets is less common and is associated with great costs.

Posterior prediction intervals provide another insight into the calibration of deep models. After recalibration, the 99 % posterior prediction intervals correctly contain approx. 99 % of the ground truth test set values. In some cases, lower prediction intervals are estimated to be too wide after calibration. This is especially the case for smaller data sets and we conjecture that small calibration sets may not contain enough i.i.d. data for calibrating lower prediction intervals and that the assumption of a Gaussian predictive distribution is too strong in this case. On the smallest data set BreastPathQ, aux scaling seems to perform better in terms of prediction intervals, but not in terms of UCE.

Well-calibrated uncertainties from MC dropout are able to detect a moderate shift in the data distribution. However, deep ensembles perform better under a severe distribution shift. BNNs with calibrated uncertainty by σ𝜎\sigmaitalic_σ scaling outperform ensemble uncertainty in the rejection task, which we attribute to the generally poorer calibration of ensembles on in-distribution data.

σ𝜎\sigmaitalic_σ scaling is simple to implement, does not change the predictive mean 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG, and therefore guarantees to conserve the model’s accuracy. It is preferable to regularization (e. g. early stopping) or more complex recalibration methods in calibrated uncertainty estimation with Bayesian deep learning. The disconnection between training and test NLL can successfully be closed, which creates highly accurate models with reliable uncertainty estimates. However, there are many factors (e. g. network capacity, weight decay, dropout configuration) influencing the uncertainty that have not been discussed here and will be addressed in future work.


Acknowledgments

We thank Vincent Modes and Mark Wielitzka for their insightful comments. This research has received funding from the European Union as being part of the ERDF OPhonLas project.

References

  • Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. ISBN 978-0-387-31073-2.
  • Bragman et al. (2018) Felix JS Bragman, Ryutaro Tanno, Zach Eaton-Rosen, Wenqi Li, David J Hawkes, Sebastien Ourselin, Daniel C Alexander, Jamie R McClelland, and M Jorge Cardoso. Uncertainty in multitask learning: joint representations for probabilistic mr-only radiotherapy planning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–11, 2018.
  • Dalca et al. (2019) Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Med Image Anal, 57:226–236, 2019. doi: https://doi.org/10.1016/j.media.2019.07.006.
  • Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
  • Gessert et al. (2018) Nils Gessert, Matthias Schlüter, and Alexander Schlaefer. A deep learning approach for pose estimation from volumetric OCT data. Med Image Anal, 46:162–179, 2018. doi: 10.1016/j.media.2018.03.002.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
  • Halabi et al. (2019) Safwan S. Halabi, Luciano M. Prevedello, Jayashree Kalpathy-Cramer, Artem B. Mamonov, Alexander Bilbily, Mark Cicero, Ian Pan, Lucas Araújo Pereira, Rafael Teixeira Sousa, Nitamar Abdala, Felipe Campos Kitamura, Hans H. Thodberg, Leon Chen, George Shih, Katherine Andriole, Marc D. Kohli, Bradley J. Erickson, and Adam E. Flanders. The RSNA pediatric bone age machine learning challenge. Radiol, 290(2):498–503, 2019. doi: 10.1148/radiol.2018180736.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • Held and Sabanés Bové (2014) Leonhard Held and Daniel Sabanés Bové. Applied Statistical Inference. Springer, 1 edition, 2014. ISBN 978-3-642-37887-4. doi: 10.1007/978-3-642-37887-4.
  • Heskes (1997) Tom Heskes. Practical confidence and prediction intervals. In NeurIPS, pages 176–182, 1997.
  • Hora (1996) Stephen C Hora. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2-3):217–223, 1996.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
  • Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS, pages 5574–5584, 2017.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • Kuleshov et al. (2018) Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In ICML, volume 80, pages 2796–2804, 2018.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, pages 6402–6413, 2017.
  • Laves et al. (2017) Max-Heinrich Laves, Andreas Schoob, Lüder A. Kahrs, Tom Pfeiffer, Robert Huber, and Tobias Ortmaier. Feature tracking for automated volume of interest stabilization on 4D-OCT images. In SPIE Medical Imaging, volume 10135, pages 256–262, 2017. doi: 10.1117/12.2255090.
  • Laves et al. (2019) Max-Heinrich Laves, Sontje Ihler, Karl-Philipp Kortmann, and Tobias Ortmaier. Well-calibrated model uncertainty with temperature scaling for dropout variational inference. In Bayesian Deep Learning Workshop (NeurIPS), 2019. arXiv:1909.13550.
  • Laves et al. (2020) Max-Heinrich Laves, Sontje Ihler, Jacob F Fast, Lüder A Kahrs, and Tobias Ortmaier. Well-calibrated regression uncertainty in medical imaging with deep learning. In Medical Imaging with Deep Learning, 2020.
  • Levi et al. (2019) Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Evaluating and calibrating uncertainty prediction in regression tasks. In arXiv, 2019. arXiv:1905.11659.
  • Luo et al. (2019) Jie Luo, Alireza Sedghi, Karteek Popuri, Dana Cobzas, Miaomiao Zhang, Frank Preiswerk, Matthew Toews, Alexandra Golby, Masashi Sugiyama, William M Wells, et al. On the applicability of registration uncertainty. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 410–419, 2019.
  • Martel et al. (2019) A. L. Martel, S. Nofech-Mozes, S. Salama, S. Akbar, and M. Peikari. Assessment of residual breast cancer cellularity after neoadjuvant chemotherapy using digital pathology [data set]. The Cancer Imaging Archive, 2019. doi: 10.7937/TCIA.2019.4YIBTJNO.
  • Messay et al. (2015) Temesguen Messay, Russell C. Hardie, and Timothy R. Tuinstra. Segmentation of pulmonary nodules in computed tomography using a regression neural network approach and its application to the lung image database consortium and image database resource initiative dataset. Med Image Anal, 22(1):48–62, 2015. doi: 10.1016/j.media.2015.02.002.
  • Nix and Weigend (1994) David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of IEEE International Conference on Neural Networks, volume 1, pages 55–60, 1994.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.
  • Payer et al. (2019) Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Med Image Anal, 54:207–219, 2019. doi: 10.1016/j.media.2019.03.007.
  • Phan et al. (2018) Buu Phan, Rick Salay, Krzysztof Czarnecki, Vahdat Abdelzad, Taylor Denouden, and Sachin Vernekar. Calibrating uncertainties in object localization task. In Bayesian Deep Learning Workshop (NeurIPS), 2018. arXiv:1811.11210.
  • Pizer et al. (1987) Stephen M Pizer, E Philip Amburn, John D Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 39(3):355–368, 1987.
  • Schlemper et al. (2018) Jo Schlemper, Guang Yang, Pedro Ferreira, Andrew Scott, Laura-Ann McGill, Zohya Khalique, Margarita Gorodezky, Malte Roehl, Jennifer Keegan, Dudley Pennell, et al. Stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 295–303, 2018.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
  • Štern et al. (2016) Darko Štern, Christian Payer, Vincent Lepetit, and Martin Urschler. Automated age estimation from hand mri volumes using deep learning. In MICCAI, pages 194–202, 2016.
  • Tan et al. (2017) Li Kuo Tan, Yih Miin Liew, Einly Lim, and Robert A. McLaughlin. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med Image Anal, 39:78–86, 2017. doi: 10.1016/j.media.2017.04.002.
  • Tan and Le (2019) Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  • Tanno et al. (2016) Ryutaro Tanno, Aurobrata Ghosh, Francesco Grussu, Enrico Kaden, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 265–273, 2016.
  • Tanno et al. (2017) Ryutaro Tanno, Daniel E Worrall, Aurobrata Ghosh, Enrico Kaden, Stamatios N Sotiropoulos, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer with cnns: exploring uncertainty in dmri super-resolution. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 611–619, 2017.
  • Xie et al. (2018) Yuanpu Xie, Fuyong Xing, Xiaoshuang Shi, Xiangfei Kong, Hai Su, and Lin Yang. Efficient and robust cell detection: A structured regression approach. Med Image Anal, 44:245–254, 2018. doi: 10.1016/j.media.2017.07.003.
  • Yin et al. (2020) Shi Yin, Qinmu Peng, Hongming Li, Zhengqiang Zhang, Xinge You, Katherine Fischer, Susan L. Furth, Gregory E. Tasian, and Yong Fan. Automatic kidney segmentation in ultrasound images using subsequent boundary distance regression and pixelwise classification networks. Med Image Anal, 60:101602, 2020. doi: 10.1016/j.media.2019.101602.

A Laplacian Model

Using 𝖫𝖺𝗉𝗅𝖺𝖼𝖾(𝒚^(𝒙),σ^(𝒙))𝖫𝖺𝗉𝗅𝖺𝖼𝖾^𝒚𝒙^𝜎𝒙\mathsf{Laplace}(\hat{\bm{y}}(\bm{x}),\hat{\sigma}(\bm{x}))sansserif_Laplace ( over^ start_ARG bold_italic_y end_ARG ( bold_italic_x ) , over^ start_ARG italic_σ end_ARG ( bold_italic_x ) ) as model, the conditional log-likelihood is given by

logp(𝒀|𝑿,𝜽)=𝑝conditional𝒀𝑿𝜽absent\displaystyle\log p(\bm{Y}\,|\,\bm{X},\bm{\theta})=roman_log italic_p ( bold_italic_Y | bold_italic_X , bold_italic_θ ) =i=1mlog(12σ^𝜽(i)exp{|𝒚(i)𝒚^𝜽(i)|σ^𝜽(i)}),superscriptsubscript𝑖1𝑚12subscriptsuperscript^𝜎𝑖𝜽superscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖subscriptsuperscript^𝜎𝑖𝜽\displaystyle\sum_{i=1}^{m}\log\left(\frac{1}{2\hat{\sigma}^{(i)}_{\bm{\theta}% }}\exp\left\{-\frac{\big{|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{|% }}{\hat{\sigma}^{(i)}_{\bm{\theta}}}\right\}\right)~{},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG 2 over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG roman_exp { - divide start_ARG | bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG } ) ,(17)

which results in the following minimization criterion:

L(𝜽)=i=1m1σ^𝜽(i)|𝒚(i)𝒚^𝜽(i)|+log(σ^𝜽(i)).subscriptL𝜽superscriptsubscript𝑖1𝑚1subscriptsuperscript^𝜎𝑖𝜽superscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖superscriptsubscript^𝜎𝜽𝑖\mathcal{L}_{\mathrm{L}}(\bm{\theta})=\sum_{i=1}^{m}\frac{1}{\hat{\sigma}^{(i)% }_{\bm{\theta}}}\big{|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{|}+% \log\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}~{}.caligraphic_L start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG | bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | + roman_log ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) .(18)

Using L(𝜽)subscriptL𝜽\mathcal{L}_{\mathrm{L}}(\bm{\theta})caligraphic_L start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT ( bold_italic_θ ) instead of G(𝜽)subscriptG𝜽\mathcal{L}_{\mathrm{G}}(\bm{\theta})caligraphic_L start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( bold_italic_θ ) results in applying an L1 metric on the predictive mean. In some cases, this led to better results. However, we have not conducted extensive experiments with it and leave it to future work.

B Derivation of σ𝜎\sigmaitalic_σ Scaling

See § 2.3. Using a Gaussian model, we scale the standard deviation σ𝜎\sigmaitalic_σ with a scalar value s𝑠sitalic_s to calibrate the probability density function

p(𝒚|𝒙;𝒚^(x),σ^2(x))=𝒩(𝒚;𝒚^(x),(sσ^(x))2).𝑝conditional𝒚𝒙^𝒚𝑥superscript^𝜎2𝑥𝒩𝒚^𝒚𝑥superscript𝑠^𝜎𝑥2p\left(\bm{y}|\bm{x};\hat{\bm{y}}(x),\hat{\sigma}^{2}(x)\right)=\mathcal{N}% \left(\bm{y};\hat{\bm{y}}(x),(s\cdot\hat{\sigma}(x))^{2}\right)~{}.italic_p ( bold_italic_y | bold_italic_x ; over^ start_ARG bold_italic_y end_ARG ( italic_x ) , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ) = caligraphic_N ( bold_italic_y ; over^ start_ARG bold_italic_y end_ARG ( italic_x ) , ( italic_s ⋅ over^ start_ARG italic_σ end_ARG ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(19)

The conditional log-likelihood is given by

logp(𝒀|𝑿,𝜽)𝑝conditional𝒀𝑿𝜽\displaystyle\log p(\bm{Y}\,|\,\bm{X},\bm{\theta})roman_log italic_p ( bold_italic_Y | bold_italic_X , bold_italic_θ )=i=1mlog(12πsσ^θ(i)exp(𝒚(i)𝒚^𝜽(i)22(sσ^𝜽(i))2))absentsuperscriptsubscript𝑖1𝑚12𝜋𝑠superscriptsubscript^𝜎𝜃𝑖superscriptnormsuperscript𝒚𝑖subscriptsuperscript^𝒚𝑖𝜽22superscript𝑠superscriptsubscript^𝜎𝜽𝑖2\displaystyle=\sum_{i=1}^{m}\log\left(\frac{1}{\sqrt{2\pi}s\hat{\sigma}_{% \theta}^{(i)}}\exp\left(\frac{\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}^{(i)}_{\bm{% \theta}}\big{\|}^{2}}{2\left(s\hat{\sigma}_{\bm{\theta}}^{(i)}\right)^{2}}% \right)\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG italic_s over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG roman_exp ( divide start_ARG ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_s over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) )(20)
=m2log(2π)i=1mlog(sσ^𝜽(i))+12(sσ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2absent𝑚22𝜋superscriptsubscript𝑖1𝑚𝑠subscriptsuperscript^𝜎𝑖𝜽12superscript𝑠superscriptsubscript^𝜎𝜽𝑖2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2\displaystyle=-\dfrac{m}{2}\log\left(2\pi\right)-\sum_{i=1}^{m}\log\left(s\hat% {\sigma}^{(i)}_{\bm{\theta}}\right)+\frac{1}{2}\left(s\hat{\sigma}_{\bm{\theta% }}^{(i)}\right)^{-2}\cdot\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}% \big{\|}^{2}= - divide start_ARG italic_m end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log ( italic_s over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_s over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(21)

This results in the following optimization objective (ignoring constants):

G(s)=mlog(s)+12s2i=1m(σ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2.subscriptG𝑠𝑚𝑠12superscript𝑠2superscriptsubscript𝑖1𝑚superscriptsuperscriptsubscript^𝜎𝜽𝑖2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2\mathcal{L}_{\mathrm{G}}(s)=m\log(s)+\tfrac{1}{2}s^{-2}\sum_{i=1}^{m}(\hat{% \sigma}_{\bm{\theta}}^{(i)})^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta% }}^{(i)}\big{\|}^{2}~{}.caligraphic_L start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( italic_s ) = italic_m roman_log ( italic_s ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_s start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(22)

Using a Laplacian model, the optimization criterion follows as

L(s)=mlog(s)+s1i=1m1σ^𝜽(i)|𝒚(i)𝒚^𝜽(i)|.subscriptL𝑠𝑚𝑠superscript𝑠1superscriptsubscript𝑖1𝑚1superscriptsubscript^𝜎𝜽𝑖superscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖\mathcal{L}_{\mathrm{L}}(s)=m\log(s)+s^{-1}\sum_{i=1}^{m}\frac{1}{\hat{\sigma}% _{\bm{\theta}}^{(i)}}\left|\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\right% |~{}.caligraphic_L start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT ( italic_s ) = italic_m roman_log ( italic_s ) + italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG | bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | .(23)

Eq. (22) and (23) are optimized w.r.t. s𝑠sitalic_s with fixed 𝜽𝜽\bm{\theta}bold_italic_θ using gradient descent in a separate calibration phase after training. The solution to Eq. (22) can also be written in closed form as

sG=±1mi=1m(σ^𝜽(i))2𝒚(i)𝒚^𝜽(i)2subscript𝑠Gplus-or-minus1𝑚superscriptsubscript𝑖1𝑚superscriptsuperscriptsubscript^𝜎𝜽𝑖2superscriptnormsuperscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖2s_{\mathrm{G}}=\pm\sqrt{\frac{1}{m}\sum_{i=1}^{m}\big{(}\hat{\sigma}_{\bm{% \theta}}^{(i)}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)% }\big{\|}^{2}}italic_s start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT = ± square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG(24)

and the solution to Eq. (23) follows as

sL=1mi=1m1σ^𝜽(i)|𝒚(i)𝒚^𝜽(i)|,subscript𝑠L1𝑚superscriptsubscript𝑖1𝑚1superscriptsubscript^𝜎𝜽𝑖superscript𝒚𝑖superscriptsubscript^𝒚𝜽𝑖s_{\mathrm{L}}=\frac{1}{m}\sum_{i=1}^{m}\frac{1}{\hat{\sigma}_{\bm{\theta}}^{(% i)}}\left|\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\right|~{},italic_s start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG | bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | ,(25)

respectively. We apply σ𝜎\sigmaitalic_σ scaling to jointly calibrate aleatoric and epistemic uncertainty as described in § 2.4.

C Unbiased Estimator of the Approximate Predictive Variance

We show that the expectation of the predictive sample variance from MC dropout, as given in (Kendall and Gal, 2017), equals the true variance of the approximate posterior predictive distribution.

Proposition 1

Given N𝑁Nitalic_N MC dropout samples 𝐟𝛉n=[𝐲^n,σ^n2]subscript𝐟subscript𝛉𝑛subscriptnormal-^𝐲𝑛subscriptsuperscriptnormal-^𝜎2𝑛\bm{f}_{\bm{\theta}_{n}}=[\hat{\bm{y}}_{n},\hat{\sigma}^{2}_{n}]bold_italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] from our approximate predictive distribution p(𝐲|𝐱,𝒟)=𝒩(𝐲;𝐲,Σ2)𝑝conditionalsuperscript𝐲normal-∗superscript𝐱normal-∗𝒟𝒩superscript𝐲normal-∗𝐲superscriptnormal-Σ2p(\bm{y}^{\ast}|\bm{x}^{\ast},\mathcal{D})=\mathcal{N}(\bm{y}^{\ast};\bm{y},% \Sigma^{2})italic_p ( bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) = caligraphic_N ( bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_y , roman_Σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), the predictive sample variance

Σ^2=1Nn=1N(𝒚^n1Nn=1N𝒚^n)2+1Nn=1Nσ^n2superscript^Σ21𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛1𝑁superscriptsubscript𝑛1𝑁subscript^𝒚𝑛21𝑁superscriptsubscript𝑛1𝑁subscriptsuperscript^𝜎2𝑛\hat{\Sigma}^{2}=\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\frac{1}{N}% \sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}+\frac{1}{N}\sum_{n=1}^{N}\hat{\sigma% }^{2}_{n}over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT(26)

is an unbiased estimator of the approximate predictive variance.

Proof 

𝔼[Σ^2]𝔼delimited-[]superscript^Σ2\displaystyle\mathbb{E}\left[\hat{\Sigma}^{2}\right]blackboard_E [ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]=𝔼[1Nn=1N(𝒚^n1Nn=1N𝒚^n)2+1Nn=1Nσ^n2]absent𝔼delimited-[]1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛1𝑁superscriptsubscript𝑛1𝑁subscript^𝒚𝑛21𝑁superscriptsubscript𝑛1𝑁subscriptsuperscript^𝜎2𝑛\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-% \frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}+\frac{1}{N}\sum_{n=1}^{N}% \hat{\sigma}^{2}_{n}\right]= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ](27)
=𝔼[1Nn=1N(𝒚^n1Nn=1N𝒚^n)2]+𝔼[1Nn=1Nσ^n2]absent𝔼delimited-[]1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛1𝑁superscriptsubscript𝑛1𝑁subscript^𝒚𝑛2𝔼delimited-[]1𝑁superscriptsubscript𝑛1𝑁subscriptsuperscript^𝜎2𝑛\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-% \frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}\right]+\mathbb{E}\left[% \frac{1}{N}\sum_{n=1}^{N}\hat{\sigma}^{2}_{n}\right]= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ](28)
with1Nn=1N𝒚^n=𝒚¯followsformulae-sequencewith1𝑁superscriptsubscript𝑛1𝑁subscript^𝒚𝑛¯𝒚follows\displaystyle\mathrm{with}\quad\frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}=\bar{% \bm{y}}\quad\mathrm{follows}roman_with divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over¯ start_ARG bold_italic_y end_ARG roman_follows(29)
=𝔼[1Nn=1N(𝒚^n𝒚¯)2]+σ^2absent𝔼delimited-[]1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛¯𝒚2superscript^𝜎2\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-% \bar{\bm{y}}\right)^{2}\right]+\hat{\sigma}^{2}= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(30)
=𝔼[1Nn=1N(𝒚^n𝒚¯)2+𝒚¯2𝒚¯2+𝒚2𝒚2+2𝒚¯𝒚2𝒚¯𝒚]+σ^2absent𝔼delimited-[]1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛¯𝒚2superscript¯𝒚2superscript¯𝒚2superscript𝒚2superscript𝒚22¯𝒚𝒚2¯𝒚𝒚superscript^𝜎2\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-% \bar{\bm{y}}\right)^{2}+\bar{\bm{y}}^{2}-\bar{\bm{y}}^{2}+\bm{y}^{2}-\bm{y}^{2% }+2\bar{\bm{y}}\bm{y}-2\bar{\bm{y}}\bm{y}\right]+\hat{\sigma}^{2}= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over¯ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 over¯ start_ARG bold_italic_y end_ARG bold_italic_y - 2 over¯ start_ARG bold_italic_y end_ARG bold_italic_y ] + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(31)
=𝔼[1Nn=1N(𝒚^n𝒚)2(𝒚¯𝒚)2]+σ^2absent𝔼delimited-[]1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript^𝒚𝑛𝒚2superscript¯𝒚𝒚2superscript^𝜎2\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-% \bm{y}\right)^{2}-\left(\bar{\bm{y}}-\bm{y}\right)^{2}\right]+\hat{\sigma}^{2}= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over¯ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(32)
=𝔼[(𝒚^𝒚)2]𝔼[(𝒚¯𝒚)2]+σ^2absent𝔼delimited-[]superscript^𝒚𝒚2𝔼delimited-[]superscript¯𝒚𝒚2superscript^𝜎2\displaystyle=\mathbb{E}\left[\left(\hat{\bm{y}}-\bm{y}\right)^{2}\right]-% \mathbb{E}\left[\left(\bar{\bm{y}}-\bm{y}\right)^{2}\right]+\hat{\sigma}^{2}= blackboard_E [ ( over^ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ( over¯ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT=Σ2σ^2+σ^2absentsuperscriptΣ2superscript^𝜎2superscript^𝜎2\displaystyle=\Sigma^{2}-\hat{\sigma}^{2}+\hat{\sigma}^{2}= roman_Σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(33)
𝔼[Σ^2]𝔼delimited-[]superscript^Σ2\displaystyle\mathbb{E}\left[\hat{\Sigma}^{2}\right]blackboard_E [ over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]=Σ2absentsuperscriptΣ2\displaystyle=\Sigma^{2}= roman_Σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(34)

Note that the predicted heteroscedastic aleatoric uncertainty σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT equals the bias 𝔼[(𝒚¯𝒚)2]𝔼delimited-[]superscript¯𝒚𝒚2\mathbb{E}[(\bar{\bm{y}}-\bm{y})^{2}]blackboard_E [ ( over¯ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] in Eq. (33) when the aleatoric uncertainty is perfectly calibrated, thus 𝔼[(𝒚¯𝒚)2]=σ^2𝔼delimited-[]superscript¯𝒚𝒚2superscript^𝜎2\mathbb{E}[(\bar{\bm{y}}-\bm{y})^{2}]=\hat{\sigma}^{2}blackboard_E [ ( over¯ start_ARG bold_italic_y end_ARG - bold_italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.  

D Training Procedure

The model implementations from PyTorch 1.3 (Paszke et al., 2019) are used and trained with the following settings:

  • training for 500 epochs with batch size of 16

  • Adam optimizer with initial learn rate of 31043superscript1043\cdot 10^{-4}3 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and weight decay with λ=107𝜆superscript107\lambda=10^{-7}italic_λ = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT

  • reduce-on-plateau learn rate scheduler (patience of 20 epochs) with factor of 0.1

  • in MC dropout, N=25𝑁25N=25italic_N = 25 forward passes were performed with dropout with p=0.5𝑝0.5p=0.5italic_p = 0.5 used for ResNet (as described in (Gal and Ghahramani, 2016)). In DenseNet (p=0.2𝑝0.2p=0.2italic_p = 0.2) and EfficientNet (p=0.4𝑝0.4p=0.4italic_p = 0.4) standard dropout p𝑝pitalic_p of the architecture is used.

  • Additional validation and test sets are used if provided by the data sets; otherwise, a train/validation/test split of approx. 50% / 25% / 25% is used

  • Source code for all experiments is available at github.com/mlaves/well-calibrated-regression-uncertainty

E 3D OCT Needle Pose Data Set

Refer to caption

pixel coordinates

Figure 9: Example image from OCT data set showing argmaxargmax\operatorname*{arg\,max}roman_arg roman_max projections of a surgical needle tip acquired by optical coherence tomography.

Our data set was created by attaching a surgical needle to a high-precision six-axis hexapod robot (H-826, Physik Instrumente GmbH & Co. KG, Germany) and observing the needle tip with 3D optical coherence tomography (OCS1300SS, Thorlabs Inc., USA). The data set consists of 5,000 OCT acquisitions with (64×64×512)6464512(64\times 64\times 512)( 64 × 64 × 512 ) voxels, covering a volume of approx. (3×3×3)333(3\times 3\times 3)( 3 × 3 × 3 )mm3superscriptmm3\mathrm{mm}^{3}roman_mm start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Each acquisition is taken at a different robot configuration and labeled with the corresponding 6DoF pose 𝒚6𝒚superscript6\bm{y}\in\mathbb{R}^{6}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT. To process the volumetric data with CNNs for planar images, we calculate 3 planar projections along the spatial dimensions using the argmaxargmax\operatorname*{arg\,max}roman_arg roman_max operator, scale them to equal size and stack them together as three-channel image (see Fig. 9). A similar approach was presented in (Laves et al., 2017) and (Gessert et al., 2018). The data are characterized by a high amount of speckle noise, which is a typical phenomenon in optical coherence tomography. The data set is publicly available at github.com/mlaves/3doct-pose-dataset.

F Ablation Study on Auxiliary Model Scaling

Refer to caption
Figure 10: Calibration diagrams for aux scaling with different number of hidden layer units hhitalic_h on BreastPathQ/DenseNet-201.

Here, we investigate the overfitting behavior of aux scaling by reducing the number of hidden layer units hhitalic_h of the two-layer auxiliary model with ReLU activations. Aux scaling is more powerful than σ𝜎\sigmaitalic_σ scaling, which can lead to overfitting the calibration set. Fig. 10 shows calibration diagrams for the auxiliary model ablations. Reducing hhitalic_h leads to a minor calibration improvement, but at h=22h=2italic_h = 2, the model outputs a constant uncertainty, which is close to the overall mean of the observed uncertainty. A single-layer single-unit model without bias would be equivalent to σ𝜎\sigmaitalic_σ scaling.

G Additional Results and Calibration Diagrams

All test set runs have been repeated 5 times. Solid lines denote mean and shaded areas denote standard deviation calculated from the repeated runs.

Table 2: Negative log-likelihood test set results for different datasets and model architectures (averaged over 5 runs). High NLL values indicate miscalibration. We also report NLL values for an ensemble of DenseNets. Bold font indicates lowest values in each experiment.
Levi et al.ours
Data SetModelMSEnoneauxσ𝜎\sigmaitalic_σnoneauxσ𝜎\sigmaitalic_σensemble
ResNet-1016.4e-3-0.78-5.06-5.06-2.89-5.17-5.16
BreastPathQDenseNet-2017.0e-3-5.16-5.84-5.70-5.67-6.03-5.780.11
EfficientNet-B46.4e-3-3.11-5.99-5.53-4.73-6.16-5.62
ResNet-1015.3e-3-3.90-4.34-4.34-3.99-4.34-4.34
BoneAgeDenseNet-2013.5e-31.74-4.70-4.69-0.75-4.70-4.690.07
EfficientNet-B43.5e-313.61-4.74-4.676.40-4.75-4.64
ResNet-1014.0e-4-0.53-6.32-6.33-3.85-6.76-6.72
EndoVisDenseNet-2011.1e-3-0.72-6.10-5.99-4.94-6.05-6.040.04
EfficientNet-B48.9e-4-5.10-6.06-6.07-5.94-6.17-6.17
ResNet-1012.0e-3-1.08-5.24-5.24-3.38-5.24-5.24
OCTDenseNet-2011.3e-3-5.05-5.61-5.61-5.51-5.62-5.610.10
EfficientNet-B41.4e-3-1.72-5.58-5.57-4.25-5.58-5.57
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

G.1 Additional Prediction Intervals

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Observed vs. estimated posterior prediction intervals for all networks/data sets.