Uncertainty quantification in non-rigid image registration via stochastic gradient Markov chain Monte Carlo

Daniel Grzech1, Mohammad Farid Azampour1,2,3, Huaqi Qiu1, Ben Glocker1, Bernhard Kainz1,4, Loïc Le Folgoc1
1: Imperial College London, UK, 2: Technische Universität München, Germany, 3: Sharif University of Technology, Tehran, Iran, 4: Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
UNSURE2020 special issue
Publication date: 2021/10/27
PDF · arXiv · Code

Abstract

We develop a new Bayesian model for non-rigid registration of three-dimensional medical images, with a focus on uncertainty quantification. Probabilistic registration of large images with calibrated uncertainty estimates is difficult for both computational and modelling reasons. To address the computational issues, we explore connections between the Markov chain Monte Carlo by backpropagation and the variational inference by backpropagation frameworks, in order to efficiently draw samples from the posterior distribution of transformation parameters. To address the modelling issues, we formulate a Bayesian model for image registration that overcomes the existing barriers when using a dense, high-dimensional, and diffeomorphic transformation parametrisation. This results in improved calibration of uncertainty estimates. We compare the model in terms of both image registration accuracy and uncertainty quantification to VoxelMorph, a state-of-the-art image registration model based on deep learning.

Keywords

deformable image registration · uncertainty quantification · SG-MCMC · SGLD

Bibtex @article{melba:2021:016:grzech, title = "Uncertainty quantification in non-rigid image registration via stochastic gradient Markov chain Monte Carlo", author = "Grzech, Daniel and Azampour, Mohammad Farid and Qiu, Huaqi and Glocker, Ben and Kainz, Bernhard and Le Folgoc, Loïc", journal = "Machine Learning for Biomedical Imaging", volume = "1", issue = "UNSURE2020 special issue", year = "2021", pages = "1--25", issn = "2766-905X", url = "https://melba-journal.org/2021:016" }
RISTY - JOUR AU - Grzech, Daniel AU - Azampour, Mohammad Farid AU - Qiu, Huaqi AU - Glocker, Ben AU - Kainz, Bernhard AU - Le Folgoc, Loïc PY - 2021 TI - Uncertainty quantification in non-rigid image registration via stochastic gradient Markov chain Monte Carlo T2 - Machine Learning for Biomedical Imaging VL - 1 IS - UNSURE2020 special issue SP - 1 EP - 25 SN - 2766-905X UR - https://melba-journal.org/2021:016 ER -

2021:016 cover

1 Introduction

Image registration is the problem of aligning images into a common coordinate system such that the discrete pixel locations have the same semantic information. It is a common pre-processing step for many applications, e.g. the statistical analysis of imaging data and computer-aided diagnosis through comparison with an atlas. Image registration methods based on deep learning tend to incorporate task-specific knowledge from large datasets, whereas traditional methods are more general purpose. Many established models are based on the iterative optimisation of an energy function consisting of task-specific similarity and regularisation terms, which has to be done independently for every pair of images in order to calculate the deformation field (Schnabel et al., 2001; Klein et al., 2009; Avants et al., 2014).

DLIR (de Vos et al., 2019) and VoxelMorph (Balakrishnan et al., 2018, 2019; Dalca et al., 2018, 2019) changed this paradigm by learning a function that maps a pair of input images to a deformation field. This gives a speed-up of several orders of magnitude at inference time and maintains an accuracy comparable to traditional methods. An overview of state-of-the-art models for image registration based on deep learning can be found in Lee et al. (2019).

Due to the perceived conceptual difficulty and computational overhead, Bayesian methods tend to be shunned when designing medical image analysis algorithms. However, in order to fully explore the parameter space and lessen the impact of ad-hoc hyperparameter choices, it is desirable to use Bayesian models. In addition, with help of open-source libraries with automatic differentiation like PyTorch, the implementation of even complex Bayesian models for image registration is very similar to that of non-probabilistic models.

In this paper, we make use of the stochastic gradient Markov chain Monte Carlo (SG-MCMC)algorithm to design an efficient posterior sampling algorithm for 3D non-rigid image registration. SG-MCMCis based on the idea of stochastic gradient descent interpreted as a stochastic process with a stationary distribution centred on the optimum and whose covariance can be used to approximate the posterior distribution (Chen et al., 2016; Mandt et al., 2017). SG-MCMCmethods have been useful for training generative models on very large datasets ubiquitous in computer vision, e.g. Du and Mordatch (2019); Nijkamp et al. (2019); Zhang et al. (2020). We show that they are also applicable to image registration.

This work is an extended version of Grzech et al. (2020), where we first proposed use of the SG-MCMCalgorithm for non-rigid image registration. The code to reproduce the results is available in a public repository: https://github.com/dgrzech/ir-sgmcmc. The following is a summary of the main contributions of the previous work:

  1. 1.

    We proposed a computationally efficient SG-MCMCalgorithm for three-dimensional diffeomorphic non-rigid image registration;

  2. 2.

    We introduced a new regularisation loss, which allows to carry out inference of the regularisation strength when using a transformation parametrisation with a large number of degrees of freedom;

  3. 3.

    We evaluated the model both qualitatively and quantitatively by analysing the output transformations, image registration accuracy, and uncertainty estimates on inter-subject brain magnetic resonance imaging (MRI)data from the UK Biobank dataset.

In this version, we extend the previous work:

  • We provide more details on the Bayesian formulation, including a comprehensive analysis of the learnable regularisation loss, as well as a more in-depth analysis of the model hyperparameters and hyperpriors;

  • We conduct additional experiments in order to compare the uncertainty estimates output by variational inference (VI), SG-MCMC, and VoxelMorph qualitatively, as well as quantitatively by analysing the Pearson correlation coefficient between the displacement and label uncertainties;

  • We analyse the differences between uncertainty estimates when the SG-MCMCalgorithm is initialised to different transformations and when using different parametrisations of the transformation, including non-parametric stationary velocity fields (SVFs)and SVFsbased on B-splines; and

  • We include a detailed evaluation of the computational speed and of the output transformation smoothness.

2 Related work

The problem of uncertainty quantification in non-rigid image registration is controversial because of ambiguity regarding the definition of uncertainty as well as the accuracy of uncertainty estimates (Luo et al., 2019). Uncertainty quantification in probabilistic image registration relies either on variational Bayesian methods (Simpson et al., 2012, 2013; Wassermann et al., 2014), which are fast and approximate, and popular within models based on deep learning (Dalca et al., 2018; Liu et al., 2019; Schultz et al., 2019), or Markov chain Monte Carlo (MCMC)methods, which are slower but enable asymptotically exact sampling from the posterior distribution of the transformation parameters. The latter include e.g. Metropolis-Hastings used for intra-subject registration of brain MRIscans (Risholm et al., 2010, 2013) and estimating delivery dose in radiotherapy (Risholm et al., 2011), reversible-jump MCMCused for cardiac MRI(Le Folgoc et al., 2017), and Hamiltonian Monte Carlo used for atlas building (Zhang et al., 2013).

Uncertainty quantification for image registration has also been done via kernel regression (Zöllei et al., 2007; Janoos et al., 2012) and deep learning (Dalca et al., 2018; Krebs et al., 2019; Heinrich, 2019; Sedghi et al., 2019). More generally, Bayesian frameworks have been used e.g. to characterize image intensities (Hachama et al., 2012) and anatomic variability (Zhang and Fletcher, 2014).

One of the main obstacles to a more widespread use of MCMCmethods for uncertainty quantification is the computational cost. This was recently tackled by embedding MCMCin a multilevel framework (Schultz et al., 2018). SG-MCMCwas previously used for rigid image registration (Karabulut et al., 2017). It has also been employed in the context of unsupervised non-rigid image registration based on deep learning, where it allowed to sample from the posterior distribution of the network weights, rather than directly the transformation parameters (Khawaled and Freiman, 2020).

Previous work on data-driven regularisation focuses on transformation parametrisations with a relatively low number of degrees of freedom, e.g. B-splines (Simpson et al., 2012) and a sparse parametrisation based on Gaussian radial basis functions (RBFs)(Le Folgoc et al., 2017). Limited work exists also on spatially-varying regularisation, again with B-splines (Simpson et al., 2015). Deep learning has been used for spatially-varying regularisation learnt using more than one image pair (Niethammer et al., 2019). Shen et al. (2019) introduced a related model which could be used for learning regularisation strength based on a single image pair but suffered from non-diffeomorphic output transformations and slow speed.

3 Registration model

We denote an image pair by 𝒟=(F,M)𝒟𝐹𝑀{\cal D}=(F,M)caligraphic_D = ( italic_F , italic_M ), where F:ΩF:𝐹subscriptΩ𝐹F\colon\Omega_{F}\to\mathbb{R}italic_F : roman_Ω start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT → blackboard_R and M:ΩM:𝑀subscriptΩ𝑀M\colon\Omega_{M}\to\mathbb{R}italic_M : roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT → blackboard_R are a fixed and a moving image respectively. The goal of image registration is to align the underlying domains ΩFsubscriptΩ𝐹\Omega_{F}roman_Ω start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and ΩMsubscriptΩ𝑀\Omega_{M}roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT with a transformation φ(w):ΩFΩM:𝜑𝑤subscriptΩ𝐹subscriptΩ𝑀\varphi\left(w\right)\colon\Omega_{F}\to\Omega_{M}italic_φ ( italic_w ) : roman_Ω start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT → roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, i.e. to calculate parameters w𝑤witalic_w such that FM(w)Mφ1(w)similar-to-or-equals𝐹𝑀𝑤𝑀superscript𝜑1𝑤F\simeq M(w)\coloneqq M\circ\varphi^{-1}(w)italic_F ≃ italic_M ( italic_w ) ≔ italic_M ∘ italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_w ). The transformation is often expected to possess desirable properties, e.g. diffeomorphic transformations are smooth and invertible, with a smooth inverse.

We parametrise the transformation using the SVFformulation (Arsigny et al., 2006; Ashburner, 2007), which we briefly review below. The ordinary differential equation (ODE)that defines the evolution of the transformation is given by:

φ(t)t=w(φ(t))superscript𝜑𝑡𝑡𝑤superscript𝜑𝑡\frac{\partial\varphi^{(t)}}{\partial t}=w\left(\varphi^{(t)}\right)divide start_ARG ∂ italic_φ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_t end_ARG = italic_w ( italic_φ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT )(1)

where φ(0)superscript𝜑0\varphi^{(0)}italic_φ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the identity transformation and t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. If the velocity field w𝑤witalic_w is spatially smooth, then the solution to Equation 1 is a diffeomorphic transformation. Numerical integration is done by scaling and squaring, which uses the following recurrence relation with 2Tsuperscript2𝑇2^{T}2 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT steps (Arsigny et al., 2006):

φ(1/2t1)=φ(1/2t)φ(1/2t)superscript𝜑1superscript2𝑡1superscript𝜑1superscript2𝑡superscript𝜑1superscript2𝑡\varphi^{(1/{2}^{t-1})}=\varphi^{(1/{2}^{t})}\circ\varphi^{(1/{2}^{t})}italic_φ start_POSTSUPERSCRIPT ( 1 / 2 start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT = italic_φ start_POSTSUPERSCRIPT ( 1 / 2 start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ∘ italic_φ start_POSTSUPERSCRIPT ( 1 / 2 start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT(2)

The Bayesian image registration framework that we present is not limited to SVFs. Moreover, there is a very limited amount of research on the impact of the transformation parametrisation on uncertainty quantification. Previous work on uncertainty quantification in image registration characterised uncertainty using a single transformation parametrisation, e.g. a small deformation model using B-splines in Simpson et al. (2012), the finite element (FE)method in Risholm et al. (2013), and multi-scale Gaussian RBFsin Le Folgoc et al. (2017), or a large deformation diffeomorphic metric mapping (LDDMM)in Wassermann et al. (2014).

To help understand the potential impact of the transformation parametrisation on uncertainty quantification, we also implement SVFsbased on cubic B-splines (Modat et al., 2012). In this case, the SVFconsists of a grid of B-spline control points, with regular spacing δ1𝛿1\delta\geq 1italic_δ ≥ 1 voxel. The dense SVFat each point is a weighted combination of cubic B-spline basis functions (Rueckert et al., 1999). To calculate the transformation based on the dense velocity field, we again use the scaling and squaring algorithm in Equation 2.

3.1 Likelihood model

The likelihood p(𝒟w)𝑝conditional𝒟𝑤p\left({\cal D}\mid w\right)italic_p ( caligraphic_D ∣ italic_w ) specifies the relationship between the data and the transformation parameters by means of a similarity metric. In probabilistic image registration, it usually takes the form of a Boltzmann distribution (Ashburner, 2007):

logp(𝒟w;)data(𝒟,w;)proportional-to𝑝conditional𝒟𝑤subscriptdata𝒟𝑤\log p\left({\cal D}\mid w;\mathcal{H}\right)\varpropto-\mathcal{E}_{\text{% data}}\left(\mathcal{D},w;\mathcal{H}\right)roman_log italic_p ( caligraphic_D ∣ italic_w ; caligraphic_H ) ∝ - caligraphic_E start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( caligraphic_D , italic_w ; caligraphic_H )(3)

where datasubscriptdata\mathcal{E}_{\text{data}}caligraphic_E start_POSTSUBSCRIPT data end_POSTSUBSCRIPT is the similarity metric and \mathcal{H}caligraphic_H an optional set of hyperparameters.

Local cross-correlation (LCC), which is invariant to linear intensity scaling, is a popular similarity metric but not meaningful in a probabilistic context. For this reason, instead of the sum of the voxel-wise product of intensities, like in standard LCC, we opt for the sum of voxel-wise squared differences of images standardised to zero mean and unit variance inside a local neighbourhood of five voxels. This way, we can benefit from robustness under linear intensity transformations, as well as desirable properties of a Gaussian mixture model (GMM)of intensity residuals, i.e. robustness to outlier values caused by acquisition artefacts and misalignment over the course of registration (Le Folgoc et al., 2017).

Let F¯¯𝐹\overline{F}over¯ start_ARG italic_F end_ARG and M(w)¯¯𝑀𝑤\overline{M(w)}over¯ start_ARG italic_M ( italic_w ) end_ARG be respectively the fixed and the warped moving image with intensities standardised to zero mean and unit variance inside a neighbourhood of five voxels. For each voxel, the intensity residual ri=F¯M(w)¯subscript𝑟𝑖¯𝐹¯𝑀𝑤r_{i}=\overline{F}-\overline{M(w)}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over¯ start_ARG italic_F end_ARG - over¯ start_ARG italic_M ( italic_w ) end_ARG, i{1,,N3}𝑖1superscript𝑁3i\in\{1,\dots,N^{3}\}italic_i ∈ { 1 , … , italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT }, is assigned to the l𝑙litalic_l-th component of the mixture, 1lL1𝑙𝐿1\leq l\leq L1 ≤ italic_l ≤ italic_L, if the categorical variable ci{1,,L}subscript𝑐𝑖1𝐿c_{i}\in\{1,\dots,L\}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_L } is equal to l𝑙litalic_l, in which case it follows a normal distribution 𝒩(0,βl1)𝒩0superscriptsubscript𝛽𝑙1\mathcal{N}\left(0,\beta_{l}^{-1}\right)caligraphic_N ( 0 , italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )111In order to reduce the notation clutter we omitted the voxel index for the fixed and moving images.. The component assignment cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT follows a categorical distribution and takes value l𝑙litalic_l with probability ϱlsubscriptitalic-ϱ𝑙\varrho_{l}italic_ϱ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. We use the same GMMof intensity residuals on a global basis rather than per neighbourhood. In all experiments it has L=4𝐿4L=4italic_L = 4 components, which we determine to be sufficient for a good model fit.

We also use the scalar virtual decimation factor α𝛼\alphaitalic_α to account for the fact that voxel-wise residuals are not independent. This prevents over-emphasis on the data term and allows to better calibrate uncertainty estimates (Groves et al., 2011; Simpson et al., 2012). The full expression of the image similarity term is given by:

data(𝒟,w;β,ϱ)=α×i=1N3logl=1Lϱlβl2πexp(βl2ri2)subscriptdata𝒟𝑤𝛽italic-ϱ𝛼superscriptsubscript𝑖1superscript𝑁3superscriptsubscript𝑙1𝐿subscriptitalic-ϱ𝑙subscript𝛽𝑙2𝜋subscript𝛽𝑙2superscriptsubscript𝑟𝑖2\mathcal{E}_{\text{data}}\left(\mathcal{D},w;\beta,\varrho\right)=-\alpha% \times\sum\limits_{i=1}^{N^{3}}\log\sum\limits_{l=1}^{L}\varrho_{l}\sqrt{\frac% {\beta_{l}}{2\pi}}\exp\left(-\frac{\beta_{l}}{2}r_{i}^{2}\right)caligraphic_E start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( caligraphic_D , italic_w ; italic_β , italic_ϱ ) = - italic_α × ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_log ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )(4)

3.2 Transformation priors

In Bayesian models, the transformation parameters are typically regularised with use of a multivariate normal prior that ensures smoothness:

logp(w;λreg)12λreg(Lw)Lwproportional-to𝑝𝑤subscript𝜆reg12subscript𝜆regsuperscriptL𝑤L𝑤\log p\left(w;\lambda_{\text{reg}}\right)\varpropto-\frac{1}{2}\lambda_{\text{% reg}}\left(\mathrm{L}w\right)^{\intercal}\mathrm{L}wroman_log italic_p ( italic_w ; italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ) ∝ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( roman_L italic_w ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT roman_L italic_w(5)

where λregsubscript𝜆reg\lambda_{\text{reg}}italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT is a scalar parameter that controls the regularisation strength, and LL\mathrm{L}roman_L is the matrix of a differential operator. Here we assume that LL\mathrm{L}roman_L represents the gradient operator, which penalises the magnitude of the 1 derivative of a velocity field. Note that (Lw)Lw=Lw2χ2superscriptL𝑤L𝑤superscriptnormL𝑤2superscript𝜒2\left(\mathrm{L}w\right)^{\intercal}\mathrm{L}w=\|\mathrm{L}w\|^{2}\coloneqq% \chi^{2}( roman_L italic_w ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT roman_L italic_w = ∥ roman_L italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≔ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

The regularisation weight λregsubscript𝜆reg\lambda_{\text{reg}}italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT can either be fixed or estimated from data. The latter has been done successfully only for transformation parametrisations with a relatively low number of degrees of freedom, e.g. B-splines (Simpson et al., 2012) and a sparse parametrisation (Le Folgoc et al., 2017). In case of an SVF, where the number of degrees of freedom is orders of magnitude higher, the problem is more difficult. However, a reliable method to adjust regularisation strength based on data is crucial, as both the output transformation and registration uncertainty are highly sensitive to regularisation. In order to infer the regularisation strength, we specify a log-normal prior on the scalar regularisation energy χ2Lognormal(μχ2,σχ22)similar-tosuperscript𝜒2Lognormalsubscript𝜇superscript𝜒2superscriptsubscript𝜎superscript𝜒22\chi^{2}\sim\text{Lognormal}\left(\mu_{\chi^{2}},\sigma_{\chi^{2}}^{2}\right)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ Lognormal ( italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and derive a prior on the underlying SVF:

logp(χ2)𝑝superscript𝜒2\displaystyle\log p\left(\chi^{2}\right)roman_log italic_p ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )logχ2logσχ2(logχ2μχ2)22σχ22proportional-toabsentsuperscript𝜒2subscript𝜎superscript𝜒2superscriptsuperscript𝜒2subscript𝜇superscript𝜒222superscriptsubscript𝜎superscript𝜒22\displaystyle\varpropto-\log\chi^{2}-\log\sigma_{\chi^{2}}-\frac{\left(\log% \chi^{2}-\mu_{\chi^{2}}\right)^{2}}{2\sigma_{\chi^{2}}^{2}}∝ - roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_log italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG ( roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG(6)
logp(w)𝑝𝑤\displaystyle\log p(w)roman_log italic_p ( italic_w )(ν21)logχ2+logp(χ2)proportional-toabsent𝜈21superscript𝜒2𝑝superscript𝜒2\displaystyle\varpropto-\left(\frac{\nu}{2}-1\right)\log\chi^{2}+\log p\left(% \chi^{2}\right)∝ - ( divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG - 1 ) roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log italic_p ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )(7)

where ν=3N3𝜈3superscript𝑁3\nu=3N^{3}italic_ν = 3 italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the number of degrees of freedom, i.e. the count of transformation parameters in all three directions. Given (semi-)informative hyperpriors on μχ2subscript𝜇superscript𝜒2\mu_{\chi^{2}}italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and σχ22subscriptsuperscript𝜎2superscript𝜒2\sigma^{2}_{\chi^{2}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which we discuss in the next section, we can adjust the regularisation strength to the input images. The full expression of the regularisation term is given by:

reg(w)=ν2logχ2+logσχ2+(logχ2μχ2)22σχ22subscriptreg𝑤𝜈2superscript𝜒2subscript𝜎superscript𝜒2superscriptsuperscript𝜒2subscript𝜇superscript𝜒222superscriptsubscript𝜎superscript𝜒22\mathcal{E}_{\text{reg}}\left(w\right)=\frac{\nu}{2}\log\chi^{2}+\log\sigma_{% \chi^{2}}+\frac{\left(\log\chi^{2}-\mu_{\chi^{2}}\right)^{2}}{2\sigma_{\chi^{2% }}^{2}}caligraphic_E start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_w ) = divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG ( roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG(8)

It is worth noting that the traditional L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularisation with a fixed regularisation weight in Equation 5 actually belongs to this family of regularisation losses. If we specify a gamma prior instead of a log-normal prior on the scalar regularisation energy χ2Γ(ν/2,λreg/2)similar-tosuperscript𝜒2Γ𝜈2subscript𝜆reg2\chi^{2}\sim\Gamma\left(\nicefrac{{\nu}}{{2}},\nicefrac{{\lambda_{\text{reg}}}% }{{2}}\right)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ roman_Γ ( / start_ARG italic_ν end_ARG start_ARG 2 end_ARG , / start_ARG italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ), we get:

logp(χ2)𝑝superscript𝜒2\displaystyle\log p\left(\chi^{2}\right)roman_log italic_p ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )proportional-to\displaystyle\varpropto(ν21)logχ2𝜈21superscript𝜒2\displaystyle\left(\frac{\nu}{2}-1\right)\log\chi^{2}( divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG - 1 ) roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT12λregχ212subscript𝜆regsuperscript𝜒2\displaystyle-\frac{1}{2}\lambda_{\text{reg}}\cdot\chi^{2}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ⋅ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(9)
logp(w)𝑝𝑤\displaystyle\log p\left(w\right)roman_log italic_p ( italic_w )proportional-to\displaystyle\varpropto(ν21)logχ2𝜈21superscript𝜒2\displaystyle\left(\frac{\nu}{2}-1\right)\log\chi^{2}( divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG - 1 ) roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT+(ν21)logχ212λregχ2𝜈21superscript𝜒212subscript𝜆regsuperscript𝜒2\displaystyle+\left(\frac{\nu}{2}-1\right)\log\chi^{2}-\frac{1}{2}\lambda_{% \text{reg}}\cdot\chi^{2}+ ( divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG - 1 ) roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ⋅ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(10)
proportional-to\displaystyle\varpropto12λreg(Lw)Lw12subscript𝜆regsuperscriptL𝑤L𝑤\displaystyle-\frac{1}{2}\lambda_{\text{reg}}(\mathrm{L}w)^{\intercal}\mathrm{% L}w- divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( roman_L italic_w ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT roman_L italic_w(11)

3.3 Hyperpriors

We set the likelihood GMMhyperpriors similarly to Le Folgoc et al. (2017), with the mixture precision parameters β=(β1,,βL)𝛽subscript𝛽1subscript𝛽𝐿\beta=\left(\beta_{1},\dots,\beta_{L}\right)italic_β = ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) assigned independent log-normal priors βlLognormal(μβl,σβl2)similar-tosubscript𝛽𝑙Lognormalsubscript𝜇subscript𝛽𝑙subscriptsuperscript𝜎2subscript𝛽𝑙\beta_{l}\sim\text{Lognormal}\left(\mu_{\beta_{l}},\sigma^{2}_{\beta_{l}}\right)italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ Lognormal ( italic_μ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and the mixture proportions ϱ=(ϱ1,,ϱL)italic-ϱsubscriptitalic-ϱ1subscriptitalic-ϱ𝐿\varrho=\left(\varrho_{1},\dots,\varrho_{L}\right)italic_ϱ = ( italic_ϱ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϱ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) with an uninformative Dirichlet prior ϱDir(κ)similar-toitalic-ϱDir𝜅\varrho\sim\text{Dir}\left(\kappa\right)italic_ϱ ∼ Dir ( italic_κ ), where κ=(κ1,,κL)𝜅subscript𝜅1subscript𝜅𝐿\kappa=\left(\kappa_{1},\dots,\kappa_{L}\right)italic_κ = ( italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_κ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ).

Regularisation parameters require informative priors due to the difficulty of learning the regularisation strength based on a single image pair. Because of a gamma prior on the regularisation energy exp(μχ2)Γ(ν/2,λinit/2)similar-tosubscript𝜇superscript𝜒2Γ𝜈2subscript𝜆init2\exp\left(\mu_{\chi^{2}}\right)\sim\Gamma\left(\nicefrac{{\nu}}{{2}},\nicefrac% {{\lambda_{\text{init}}}}{{2}}\right)roman_exp ( italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ∼ roman_Γ ( / start_ARG italic_ν end_ARG start_ARG 2 end_ARG , / start_ARG italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ), we can rely on the familiar regularisation weight λinitsubscript𝜆init\lambda_{\text{init}}italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT to initialise the logarithm of the regularisation energy μχ2subscript𝜇superscript𝜒2\mu_{\chi^{2}}italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to the expected value of the logarithm of the gamma distribution, i.e. 𝔼[μχ2]=ψ(ν/2)log(λinit/2)𝔼delimited-[]subscript𝜇superscript𝜒2𝜓𝜈2subscript𝜆init2\mathbb{E}\left[\mu_{\chi^{2}}\right]=\psi\left(\nicefrac{{\nu}}{{2}}\right)-% \log\left(\nicefrac{{\lambda_{\text{init}}}}{{2}}\right)blackboard_E [ italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] = italic_ψ ( / start_ARG italic_ν end_ARG start_ARG 2 end_ARG ) - roman_log ( / start_ARG italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ), where ψ𝜓\psiitalic_ψ is the digamma function. The value of this expression is sharply peaked if the number of degrees of freedom ν𝜈\nuitalic_ν is large, which yields a very informative prior on μχ2subscript𝜇superscript𝜒2\mu_{\chi^{2}}italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. More details on how to calculate the expected value of the logarithm of a gamma distribution can be found in Section A.

The choice of a hyperprior on the scale parameter σχ2subscript𝜎superscript𝜒2\sigma_{\chi^{2}}italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which controls the amount of deviation of logχ2superscript𝜒2\log\chi^{2}roman_log italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT from the location parameter μχ2subscript𝜇superscript𝜒2\mu_{\chi^{2}}italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, is more intuitive. Here we use a log-normal prior σχ22Lognormal(η,ς2)similar-tosuperscriptsubscript𝜎superscript𝜒22Lognormal𝜂superscript𝜍2\sigma_{\chi^{2}}^{2}\sim\text{Lognormal}\left(\eta,\varsigma^{2}\right)italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ Lognormal ( italic_η , italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

4 Variational inference

Image registration methods often rely on VIfor uncertainty quantification. We only use VIto initialise the SG-MCMCalgorithm, which also lets us compare the uncertainty estimates output by approximate and asymptotically exact methods for sampling from the posterior distribution of transformation parameters.

We assume that the posterior distribution p(w𝒟)𝑝conditional𝑤𝒟p\left(w\mid{\cal D}\right)italic_p ( italic_w ∣ caligraphic_D ) is a multivariate normal distribution q(w)𝒩(μw,Σw)similar-to𝑞𝑤𝒩subscript𝜇𝑤subscriptΣ𝑤q\left(w\right)\sim\mathcal{N}\left(\mu_{w},\Sigma_{w}\right)italic_q ( italic_w ) ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ). To find parameters μwsubscript𝜇𝑤\mu_{w}italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ΣwsubscriptΣ𝑤\Sigma_{w}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, we maximise the evidence lower bound (ELBO)(Jordan et al., 1999):

(q)=𝔼q[logp(𝒟w)]DKL(q||p)=data+regq+H(q)\mathcal{L}(q)=\mathbb{E}_{q}\left[\log p({\cal D}\mid w)\right]-\text{D}_{% \text{KL}}(q\;||\;p)=-\big{\langle}\mathcal{E}_{\text{data}}+\mathcal{E}_{% \text{reg}}\big{\rangle}_{q}+H(q)caligraphic_L ( italic_q ) = blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log italic_p ( caligraphic_D ∣ italic_w ) ] - D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q | | italic_p ) = - ⟨ caligraphic_E start_POSTSUBSCRIPT data end_POSTSUBSCRIPT + caligraphic_E start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + italic_H ( italic_q )(12)

where DKL(q||p)\text{D}_{\text{KL}}(q\;||\;p)D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q | | italic_p ) is the Kullback-Leibler divergence (KL-divergence)between the approximate posterior q𝑞qitalic_q and the prior p𝑝pitalic_p. Like in traditional image registration, the energy function consists of the sum of similarity and regularisation terms, with an additional term for the entropy of the posterior distribution H(q)𝐻𝑞H(q)italic_H ( italic_q ). We show how to calculate this term in Section B.

It is not possible to calculate every element of the covariance matrix ΣwsubscriptΣ𝑤\Sigma_{w}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT due to high dimensionality of the problem. Instead, we approximate the covariance matrix as a sum of diagonal and low-rank parts, i.e. Σwdiag(σw2)+uwuwsubscriptΣ𝑤diagsuperscriptsubscript𝜎𝑤2subscript𝑢𝑤superscriptsubscript𝑢𝑤\Sigma_{w}\approx\text{diag}\left(\sigma_{w}^{2}\right)+u_{w}u_{w}^{\intercal}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≈ diag ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT, with σw3N3×1subscript𝜎𝑤superscript3superscript𝑁31\sigma_{w}\in\mathbb{R}^{3N^{3}\times 1}italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT and uw3N3×Rsubscript𝑢𝑤superscript3superscript𝑁3𝑅u_{w}\in\mathbb{R}^{3N^{3}\times R}italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × italic_R end_POSTSUPERSCRIPT, where R𝑅Ritalic_R is a hyperparameter which determines the parametrisation rank. Using a multivariate normal distribution as the approximate posterior distribution of transformation parameters is common in probabilistic image registration. The key difference in our work is the diagonal + low-rank parametrisation of the covariance matrix. Most recent image registration models with an SVFtransformation parametrisation are based on deep learning and make the assumption of a diagonal covariance matrix (Dalca et al., 2018; Krebs et al., 2019).

We use the reparametrisation trick with two samples per update to backpropagate with respect to the variational posterior parameters:

w𝑤\displaystyle witalic_w=μw±(diag(σw)ϵ+uwx)absentplus-or-minussubscript𝜇𝑤diagsubscript𝜎𝑤italic-ϵsubscript𝑢𝑤𝑥\displaystyle=\mu_{w}\pm\left(\text{diag}\left(\sigma_{w}\right)\cdot\epsilon+% u_{w}\cdot x\right)= italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ± ( diag ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ⋅ italic_ϵ + italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⋅ italic_x )(13)
ϵitalic-ϵ\displaystyle\epsilonitalic_ϵ𝒩(0,I3N3),x𝒩(0,IR)formulae-sequencesimilar-toabsent𝒩0subscript𝐼3superscript𝑁3similar-to𝑥𝒩0subscript𝐼𝑅\displaystyle\sim\mathcal{N}\left(0,I_{3N^{3}}\right),\ x\sim\mathcal{N}\left(% 0,I_{R}\right)∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT 3 italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_x ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT )

In order to make the optimisation less susceptible to undesired local maxima of the ELBO, we take advantage of Sobolev gradients (Neuberger et al., 1997). Samples from the posterior are convolved with a Sobolev kernel. We approximate the 3D kernel by three separable 1D kernels to lower the computational overhead. Using the notation in Slavcheva et al. (2018), we set the kernel width to sH1=7subscript𝑠superscript𝐻17s_{H^{1}}=7italic_s start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 7 and the smoothing parameter to λH1=0.5subscript𝜆superscript𝐻10.5\lambda_{H^{1}}=0.5italic_λ start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0.5.

The GMMand regularisation hyperparameters are fit using the stochastic approximation expectation maximisation (SAEM)algorithm (Richard et al., 2009; Zhang et al., 2013). The mixture precision hyperparameters β𝛽\betaitalic_β and proportion hyperparameters ϱitalic-ϱ\varrhoitalic_ϱ are updated by solving the optimisation problem:

β(k),ϱ(k)=argmaxβ,ϱ𝔼q[logp(𝒟,w;β,ϱ)β(k1),ϱ(k1)]+logp(β)+logp(ϱ)superscript𝛽𝑘superscriptitalic-ϱ𝑘subscriptargmax𝛽italic-ϱsubscript𝔼𝑞delimited-[]conditional𝑝𝒟𝑤𝛽italic-ϱsuperscript𝛽𝑘1superscriptitalic-ϱ𝑘1𝑝𝛽𝑝italic-ϱ\beta^{(k)},\varrho^{(k)}=\operatorname*{arg\,max}_{\beta,\varrho}\mathbb{E}_{% q}\left[\log p\left({\cal D},w;\beta,\varrho\right)\mid\beta^{(k-1)},\varrho^{% (k-1)}\right]+\log p\left(\beta\right)+\log p\left(\varrho\right)italic_β start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_ϱ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_β , italic_ϱ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log italic_p ( caligraphic_D , italic_w ; italic_β , italic_ϱ ) ∣ italic_β start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , italic_ϱ start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ] + roman_log italic_p ( italic_β ) + roman_log italic_p ( italic_ϱ )(14)

This is done at each step of the iterative optimisation algorithm. We update the regularisation hyperparameters in an analogous way. Even though the hybrid VIand SAEMapproach is computationally efficient and requires minimal implementation effort, it disregards the uncertainty caused by hyperparameter variability.

5 Stochastic gradient Markov chain Monte Carlo

Image registration algorithms based on VIrestrict parametrisation of the posterior distribution to a specific family of probability distributions, which may not include the true posterior. To avoid this problem and sample the transformation parameters in an efficient way, we use stochastic gradient Langevin dynamics (SGLD)(Besag, 1993; Welling and Teh, 2011). The update equation is given by:

wk+1wk+τAlogπ(wk)+2τAξksubscript𝑤𝑘1subscript𝑤𝑘𝜏𝐴𝜋subscript𝑤𝑘2𝜏𝐴subscript𝜉𝑘w_{k+1}\leftarrow w_{k}+\tau A\nabla\log\pi\left(w_{k}\right)+\sqrt{2\tau A}% \xi_{k}italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_τ italic_A ∇ roman_log italic_π ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_τ italic_A end_ARG italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT(15)

where τ𝜏\tauitalic_τ is the step size, A𝐴Aitalic_A is an optional preconditioning matrix, logπ(wk)𝜋subscript𝑤𝑘\nabla\log\pi\left(w_{k}\right)∇ roman_log italic_π ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is the gradient of the logarithm of the posterior probability density function (PDF), and ξk𝒩(0,I3N3)similar-tosubscript𝜉𝑘𝒩0subscript𝐼3superscript𝑁3\xi_{k}\sim\mathcal{N}\left(0,I_{3N^{3}}\right)italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT 3 italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). SGLDdoes not require a particular initialisation, so we study several different possibilities, including a sample w0𝒩(μw,Σw)similar-tosubscript𝑤0𝒩subscript𝜇𝑤subscriptΣ𝑤w_{0}\sim\mathcal{N}\left(\mu_{w},\Sigma_{w}\right)italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) from the approximate variational posterior, in which case we set A=diag(σw2)𝐴diagsubscriptsuperscript𝜎2𝑤A=\text{diag}\left(\sigma^{2}_{w}\right)italic_A = diag ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ). The preconditioning helps with the MCMCmixing time in case the target distribution is strongly anisotropic.

It is worth noting that, except for the preconditioning matrix A𝐴Aitalic_A and the noise term ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Equation 15 is equivalent to a gradient descent update when minimising the maximum a posteriori (MAP)objective function logp(w𝒟)=logp(𝒟w)logp(w)𝑝conditional𝑤𝒟𝑝conditional𝒟𝑤𝑝𝑤-\log p\left(w\mid{\cal D}\right)=-\log p\left({\cal D}\mid w\right)-\log p% \left(w\right)- roman_log italic_p ( italic_w ∣ caligraphic_D ) = - roman_log italic_p ( caligraphic_D ∣ italic_w ) - roman_log italic_p ( italic_w ). When drawing samples from SGLD, we continue to update the GMMas well as regularisation hyperparameters like in Equation 14, except that the expected value is calculated with respect to the new posterior π(w)𝜋𝑤\pi\left(w\right)italic_π ( italic_w ).

In the limit as k𝑘k\to\inftyitalic_k → ∞ and τ0𝜏0\tau\to 0italic_τ → 0, SGLDcan be used to draw exact samples from the posterior of the transformation parameters without Metropolis-Hastings accept-reject tests, which are computationally expensive. Indeed, these costs prevent the use of other MCMCalgorithms for the registration of large 3D images. In practice, the step size needs to be adjusted to avoid high autocorrelation between samples yet remain smaller than the width of the most constrained direction in the local energy landscape (Neal, 2011). The step size can also be used to control the trade-off between accuracy and computation time. We can quantify uncertainty either quickly in a coarse manner or slowly, with more detail.

Despite the fact that the term logπ(w)𝜋𝑤\nabla\log\pi\left(w\right)∇ roman_log italic_π ( italic_w ) allows to traverse the energy landscape in an efficient way, SGLDsuffers from high autocorrelation and slow mixing between modes (Hill, 2020). However, simplicity of the formulation makes it better suited than other MCMCmethods for high-dimensional problems like three-dimensional image registration.

6 Experiments

6.1 Setup

The model is implemented in PyTorch. For all experiments we use three-dimensional T2-FLAIR MRIbrain scans and subcortical structure segmentations from the UK Biobank dataset (Sudlow et al., 2015). Input images are pre-registered with the affine component of drop2 (Glocker et al., 2008)222https://github.com/biomedia-mira/drop2 and resampled to N=128𝑁128N=128italic_N = 128 isotropic voxels of length 1.82 mmtimes1.82millimeter1.82\text{\,}\mathrm{mm}start_ARG 1.82 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG along every dimension.

We use 212superscript2122^{12}2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT steps to integrate SVFs. In order to start optimisation with small displacements, μwsubscript𝜇𝑤\mu_{w}italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is initialised to zero, which corresponds to an identity transformation, σwsubscript𝜎𝑤\sigma_{w}italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to half a voxel length in each direction and uwsubscript𝑢𝑤u_{w}italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to a tenth of a voxel length in each direction. We are mainly interested in the approximate variational posterior in order to initialise the SG-MCMCalgorithm, so the rank parameter is set to R=1𝑅1R=1italic_R = 1. We use the Adam optimiser with a step size of 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for the approximate variational posterior parameters μwsubscript𝜇𝑤\mu_{w}italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, logσw2superscriptsubscript𝜎𝑤2\log\sigma_{w}^{2}roman_log italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and uwsubscript𝑢𝑤u_{w}italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Training is run until the ELBOvalue stops to increase, which requires approximately 1,024 iterations.

In the likelihood model, we use κ=0.5𝜅0.5\kappa=0.5italic_κ = 0.5 for an uninformative Jeffreys prior on the mixture proportions, while the mixture precision hyperparameters are set to μβl=0.0subscript𝜇subscript𝛽𝑙0.0\mu_{\beta_{l}}=0.0italic_μ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.0 and σβl=2.3subscript𝜎subscript𝛽𝑙2.3\sigma_{\beta_{l}}=2.3italic_σ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 2.3. The model is much less sensitive to the value of the likelihood hyperparameters than the regularisation hyperparameters, which are calibrated to guarantee diffeomorphic transformations sampled from the approximate variational posterior q(w)𝒩(μw,Σw)similar-to𝑞𝑤𝒩subscript𝜇𝑤subscriptΣ𝑤q\left(w\right)\sim\mathcal{N}\left(\mu_{w},\Sigma_{w}\right)italic_q ( italic_w ) ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ). The local transformation is diffeomorphic only in locations where the Jacobian determinant is positive (Ashburner, 2007), so we aim to keep the number of voxels where the Jacobian determinant is non-positive |detJφ1|0subscript𝐽superscript𝜑10|\det J_{\varphi^{-1}}|\leq 0| roman_det italic_J start_POSTSUBSCRIPT italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ≤ 0 close to zero. We calibrate the location hyperparameter λinitsubscript𝜆init\lambda_{\text{init}}italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT in every experiment, while the scale hyperparameters are set to η=2.8𝜂2.8\eta=2.8italic_η = 2.8 and ς=5.0𝜍5.0\varsigma=5.0italic_ς = 5.0.

SAEMconvergence is known to be conditional on decreasing step sizes (Delyon et al., 1999). For this reason, we use a small step size decay of 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and the Adam optimiser with a step size of 2×1012superscript1012\times 10^{-1}2 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for the GMM hyperparameters logβ0.5superscript𝛽0.5\log\beta^{-0.5}roman_log italic_β start_POSTSUPERSCRIPT - 0.5 end_POSTSUPERSCRIPT and logϱitalic-ϱ\log\varrhoroman_log italic_ϱ, and 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for the regularisation hyperparameters μχ2subscript𝜇superscript𝜒2\mu_{\chi^{2}}italic_μ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and logσχ2subscript𝜎superscript𝜒2\log\sigma_{\chi^{2}}roman_log italic_σ start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In case of parameters whose value is constrained to be positive, we state the step size used on the logarithms. In practice, we did not observe the result to be dependent on these step sizes.

6.2 Regularisation strength

First we evaluate the proposed regularisation. We compare it to a gamma prior on λ𝜆\lambdaitalic_λ, i.e. λΓ(s,r)similar-to𝜆Γ𝑠𝑟\lambda\sim\Gamma\left(s,r\right)italic_λ ∼ roman_Γ ( italic_s , italic_r ), where s𝑠sitalic_s and r𝑟ritalic_r are the shape and the rate parameters respectively, set to uninformative values s=r=ν/2𝑠𝑟𝜈2s=r=\nicefrac{{\nu}}{{2}}italic_s = italic_r = / start_ARG italic_ν end_ARG start_ARG 2 end_ARG (Simpson et al., 2012).

We compare the output of VIwhen using fixed regularisation weights λreg{0.1,1.2}subscript𝜆reg0.11.2\lambda_{\text{reg}}\in\{0.1,1.2\}italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ∈ { 0.1 , 1.2 }, the baseline method for learnable regularisation strength, and our regularisation loss. The result on a sample pair of input images is shown in Figure 1. For the baseline method, the learnt regularisation strength is too high, which effectively prevents the alignment of images. This indicates that previous schemes for inference of regularisation strength from data are inadequate when the transformation parametrisation involves a very large number of degrees of freedom. In case of λreg=0.1subscript𝜆reg0.1\lambda_{\text{reg}}=0.1italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.1, the resulting transformation is not diffeomorphic. The output when using our regularisation loss with λinit=1.2subscript𝜆init1.2\lambda_{\text{init}}=1.2italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = 1.2 strikes a balance between the baseline and λreg=0.1subscript𝜆reg0.1\lambda_{\text{reg}}=0.1italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.1, where there is an overemphasis on the data term.

Fixed image F𝐹Fitalic_F and mean displacement
Warped moving image M(μw)𝑀subscript𝜇𝑤M\left(\mu_{w}\right)italic_M ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT )
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to caption(a) λreg=0.1subscript𝜆reg0.1\lambda_{\text{reg}}=0.1italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.1Refer to caption(b) λreg=1.2subscript𝜆reg1.2\lambda_{\text{reg}}=1.2italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 1.2Refer to caption(c) BaselineRefer to caption(d) ProposedRefer to caption
Warped moving image M(μw)𝑀subscript𝜇𝑤M\left(\mu_{w}\right)italic_M ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT )
Figure 1: Output when using two fixed regularisation weights λreg{0.1,1.2}subscript𝜆reg0.11.2\lambda_{\text{reg}}\in\{0.1,1.2\}italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ∈ { 0.1 , 1.2 }, the baseline method for learnable regularisation strength, and the proposed learnable regularisation loss with λinit=1.2subscript𝜆init1.2\lambda_{\text{init}}=1.2italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = 1.2. For fixed regularisation weight λreg=0.1subscript𝜆reg0.1\lambda_{\text{reg}}=0.1italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.1, the sampled transformations are not diffeomorphic. In case of the baseline method, the learnt regularisation strength is too high, which effectively prevents the alignment of images. When using the proposed learnable regularisation loss, we strike a balance between the baseline method and fixed regularisation weight λreg=0.1subscript𝜆reg0.1\lambda_{\text{reg}}=0.1italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.1, where the regularisation strength is too low. The figure shows the middle axial slice of 3D images.

In Figure 2, we show the output of VIfor two pairs of images which require different regularisation strengths for accurate alignment. We choose a fixed image and two moving images M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with one visibly different and the other similar to the fixed image. We also analyse the result when using fixed regularisation weights λreg=0.2subscript𝜆reg0.2\lambda_{\text{reg}}=0.2italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.2, which leads to non-diffeomorphic transformations, and λreg=2.0subscript𝜆reg2.0\lambda_{\text{reg}}=2.0italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 2.0, which produces smooth transformations but, in case of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, at the expense of accuracy. The proposed regularisation, initialised with λinit=2.0subscript𝜆init2.0\lambda_{\text{init}}=2.0italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = 2.0, helps to prevent oversmoothing.

Fixed image F𝐹Fitalic_F and mean displacement
Zoom on the lateral ventricle area
Warped moving image M1(μw)subscript𝑀1subscript𝜇𝑤M_{1}\left(\mu_{w}\right)italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT )
Fixed image F𝐹Fitalic_F and mean displacement
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to captionM1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTRefer to caption(a) λreg=0.2subscript𝜆reg0.2\lambda_{\text{reg}}=0.2italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.2Refer to caption(b) λreg=2.0subscript𝜆reg2.0\lambda_{\text{reg}}=2.0italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 2.0Refer to caption(c) Proposed
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to captionM2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTRefer to caption(d) λreg=0.2subscript𝜆reg0.2\lambda_{\text{reg}}=0.2italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.2Refer to caption(e) λreg=2.0subscript𝜆reg2.0\lambda_{\text{reg}}=2.0italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 2.0Refer to caption(f) Proposed
Fixed image F𝐹Fitalic_F and mean displacement
Zoom on the lateral ventricle area
Warped moving image M1(μw)subscript𝑀1subscript𝜇𝑤M_{1}\left(\mu_{w}\right)italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT )
Fixed image F𝐹Fitalic_F and mean displacement
Warped moving image M2(μw)subscript𝑀2subscript𝜇𝑤M_{2}\left(\mu_{w}\right)italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT )
Figure 2: Output of VIfor two image pairs which require different regularisation strengths. M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is visibly different to the fixed image and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is similar to it. The proposed regularisation is initialised with λinit=2.0subscript𝜆init2.0\lambda_{\text{init}}=2.0italic_λ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = 2.0. For both image pairs, the alignment is best in case of the fixed regularisation strength λreg=0.2subscript𝜆reg0.2\lambda_{\text{reg}}=0.2italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 0.2 but the output transformation is not diffeomorphic. For M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the proposed learnable regularisation loss is almost identical to λreg=2.0subscript𝜆reg2.0\lambda_{\text{reg}}=2.0italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 2.0, which ensures good accuracy and smoothness of the transformation. For M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the proposed loss outputs somewhat higher accuracy than λreg=2.0subscript𝜆reg2.0\lambda_{\text{reg}}=2.0italic_λ start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = 2.0. The figure shows the middle axial slice of 3D images.

6.3 Uncertainty quantification

We run a number of experiments to evaluate the uncertainty estimates and better understand the differences between uncertainty output by various non-rigid registration methods in practice:

  1. 1.

    We compare the uncertainty estimates output by VI, SG-MCMC, and VoxelMorph on inter-subject brain MRIdata from UK Biobank qualitatively and quantitatively, by calculating the Pearson correlation coefficient between the displacement and label uncertainties;

  2. 2.

    We compare the uncertainty estimates when the SG-MCMCalgorithm is initialised to different transformations;

  3. 3.

    We compare the result when using non-parametric SVFsand SVFsbased on B-splines to parametrise the transformation.

In order to make sampling from SG-MCMCefficient, we determine the largest step size that guarantees diffeomorphic transformations as defined in Section 6.1 and set it to τ=4×101𝜏4superscript101\tau=4\times 10^{-1}italic_τ = 4 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. A single Markov chain requires only 4 GBtimes4gigabyte4\text{\,}\mathrm{GB}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG roman_GB end_ARG of memory, so two Markov chains are run in parallel, each initialised to a different sample from the approximate variational posterior. We discard the first 100,000 samples from each chain to allow the Markov chains to reach the stationary distribution.

6.3.1 Comparison of uncertainty estimates output by different models

VoxelMorph. To fill the knowledge gaps on differences between uncertainty estimates produced by non-rigid image registration algorithms, we train the probabilistic VoxelMorph (Dalca et al., 2018) in atlas mode on a random 80/20802080/2080 / 20 split of 13,401 brain MRIscans in the UK Biobank333The official VoxelMorph implementation used in the experiments is available on GitHub: https://github.com/voxelmorph/voxelmorph.. We use the same fixed image as in the experiments and exclude the moving images from the training data. To enable a fair comparison, the chosen similarity metric is the sum of squared differences (SSD). We also study the differences between uncertainty estimates output by VIand SG-MCMC, based on 500 samples output by each model. In order to reduce auto-correlation, samples output by SG-MCMCare selected at regular intervals from the one million samples drawn from each chain.

Like in VI, which we use to initialise the SG-MCMCalgorithm, the approximate variational posterior of transformation parameters output by VoxelMorph is assumed to be a multivariate normal distribution qVXM(w)𝒩(μVXM,ΣVXM)similar-tosubscript𝑞VXM𝑤𝒩subscript𝜇VXMsubscriptΣVXMq_{\text{VXM}}\left(w\right)\sim\mathcal{N}\left(\mu_{\text{VXM}},\Sigma_{% \text{VXM}}\right)italic_q start_POSTSUBSCRIPT VXM end_POSTSUBSCRIPT ( italic_w ) ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT VXM end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT VXM end_POSTSUBSCRIPT ). The only difference is the covariance matrix ΣVXMsubscriptΣVXM\Sigma_{\text{VXM}}roman_Σ start_POSTSUBSCRIPT VXM end_POSTSUBSCRIPT, which is assumed to be diagonal rather than diagonal + low-rank. In order to set the model hyperparameters, we analyse the average surface distances (ASDs)and Dice scores (DSCs)on subcortical structure segmentations and the Jacobian determinants of sample transformations, with the aim of striking a balance between accuracy and smoothness. The most important hyperparameters are the the fixed regularisation strength parameter, set to λVXM=10.0subscript𝜆VXM10.0\lambda_{\text{VXM}}=10.0italic_λ start_POSTSUBSCRIPT VXM end_POSTSUBSCRIPT = 10.0, and the initial value of the diagonal covariance matrix, set to ΣVXM=diag(0.01)subscriptΣVXMdiag0.01\Sigma_{\text{VXM}}=\text{diag}\left(0.01\right)roman_Σ start_POSTSUBSCRIPT VXM end_POSTSUBSCRIPT = diag ( 0.01 ).

Qualitative comparison of uncertainty estimates. The output of VI, SG-MCMC, and VoxelMorph on a sample image pair is shown in Figure 3. It should be noted that, due to the fact that direct correspondence between regions is hard to determine, the problem of registering inter-subject brain MRIscans is more challenging than problems where non-rigid registration uncertainty had been studied previously, e.g. intra-subject brain MRI(Risholm et al., 2010; Simpson et al., 2012) and intra-subject cardiac MRI(Le Folgoc et al., 2017). Nonetheless, the uncertainty estimates output by VIand SG-MCMCare consistent with previous findings, e.g. higher uncertainty in homogeneous regions (Simpson et al., 2012; Dalca et al., 2018).

Unlike in Le Folgoc et al. (2017), where Gaussian RBFswere used to parametrise the transformation on intra-subject cardiac MRIscans, VIoutputs higher uncertainty than MCMC. SGLDis known to overestimate the posterior covariance due to non-vanishing learning rates at long times (Mandt et al., 2017), which further suggests that the uncertainty output by SGLDmight be underestimated. The uncertainty estimates produced by VIand SG-MCMCare consistent but different in magnitude, while those output by VoxelMorph are noticeably different, with the values much smaller. This indicates the need for further research into calibration of uncertainty estimates for image registration methods based on deep learning.

Refer to captionRefer to captionVIRefer to caption
Refer to captionRefer to captionSG-MCMCRefer to caption
Refer to captionRefer to captionVoxelMorphRefer to captionRefer to caption
Figure 3: Uncertainty output by the models on the input image pair shown in Figure 1. The standard deviation of the displacement field magnitude is calculated using 500 samples. In case of SG-MCMC, samples are selected at regular intervals from the one million samples output by each of the two Markov chains, which is needed to prevent autocorrelation between samples. SG-MCMCoutputs lower uncertainty than VI. The uncertainty estimates output by VIand VoxelMorph are very dissimilar, even though the two models assume a similar approximate variational posterior. In case of SGLD, visualising standard deviation of the displacement field magnitude is valid under the assumption that the posterior is approximately Gaussian and mono-modal. The standard deviations of displacement field magnitudes sampled from VIand SG-MCMCare very different, so the visualisation in case of SG-MCMCshould not be treated as an accurate description of the true uncertainty, but rather as evidence that the true posterior distribution of transformation parameters likely is not Gaussian with a diagonal + low-rank covariance matrix.

Image registration accuracy. To evaluate image registration accuracy, we calculate ASDsand DSCson the subcortical structure segmentations using the fixed and the moving segmentation warped with transformations sampled from the models. The metric comparison between our model and VoxelMorph on the image pair in Figure 3 is shown in Table 1 for segmentations grouped by volume and in Figure 4 for individual segmentations.

Table 1: Mean and standard deviation of ASDand DSCon small, medium, and large subcortical structures, calculated using 500 output samples. Large structures include the brain stem and thalamus, medium structures—the caudate, hippocampus, and putamen, and small structures—the accumbens, amygdala, and pallidum. The values for the best performing model are underlined.
average surface distance(mm) Dice score
model smallstructures mediumstructures largestructures smallstructures mediumstructures largestructures
VI1.121.121.121.12 (0.16)0.16(0.16)( 0.16 )0.970.970.970.97 (0.08)0.08(0.08)( 0.08 )1.04¯¯1.04\underline{1.04}under¯ start_ARG 1.04 end_ARG (0.26)0.26(0.26)( 0.26 )0.680.680.680.68 (0.07)0.07(0.07)( 0.07 )0.790.790.790.79 (0.03)0.03(0.03)( 0.03 )0.88¯¯0.88\underline{0.88}under¯ start_ARG 0.88 end_ARG (0.02)0.02(0.02)( 0.02 )
SG-MCMC1.101.101.101.10 (0.15)0.15(0.15)( 0.15 )0.95¯¯0.95\underline{0.95}under¯ start_ARG 0.95 end_ARG (0.07)0.07(0.07)( 0.07 )1.051.051.051.05 (0.24)0.24(0.24)( 0.24 )0.680.680.680.68 (0.07)0.07(0.07)( 0.07 )0.790.790.790.79 (0.03)0.03(0.03)( 0.03 )0.88¯¯0.88\underline{0.88}under¯ start_ARG 0.88 end_ARG (0.02)0.02(0.02)( 0.02 )
VoxelMorph1.06¯¯1.06\underline{1.06}under¯ start_ARG 1.06 end_ARG (0.14)0.14(0.14)( 0.14 )0.960.960.960.96 (0.09)0.09(0.09)( 0.09 )1.051.051.051.05 (0.11)0.11(0.11)( 0.11 )0.70¯¯0.70\underline{0.70}under¯ start_ARG 0.70 end_ARG (0.05)0.05(0.05)( 0.05 )0.80¯¯0.80\underline{0.80}under¯ start_ARG 0.80 end_ARG (0.02)0.02(0.02)( 0.02 )0.870.870.870.87 (0.01)0.01(0.01)( 0.01 )
Refer to caption
Figure 4: ASDand DSCon each subcortical structure. Dashed lines show the metric values prior to non-rigid registration. The label uncertainty of VIand SG-MCMCis comparable, despite different transformation uncertainty.

The metrics show significant improvement over affine registration on most subcortical structures. The differences between label uncertainties are less pronounced than between transformation uncertainties. ASDsand DSCsare generally marginally better when using samples from SG-MCMCthan from the approximate variational posterior. Better accuracy in case of SGLDis expected, given restrictive assumptions of the approximate variational model. VI, SG-MCMC, and VoxelMorph produce similar accuracy, despite different uncertainty estimates.

Quantitative comparison of uncertainty estimates. It is difficult to define ground truth uncertainty with regards to image registration. Luo et al. (2019) suggested that well-calibrated image registration uncertainty estimates need to be informative of anatomical features, which are important in neurosurgery. To compare the uncertainty output by different models quantitatively and evaluate whether the uncertainty estimates are clinically useful, we adopt a method similar to that used by Luo et al. (2019) and calculate the Pearson correlation coefficient rudulsubscript𝑟subscript𝑢𝑑subscript𝑢𝑙r_{u_{d}u_{l}}italic_r start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT between the displacement uncertainties udsubscript𝑢𝑑u_{d}italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and label uncertainties ulsubscript𝑢𝑙u_{l}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Here we define label uncertainty ulsubscript𝑢𝑙u_{l}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to be the standard deviation of the DSCfor each subcortical structure, and displacement uncertainty udsubscript𝑢𝑑u_{d}italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to be the voxel-wise mean of the displacement field standard deviation magnitude within the region that corresponds to a given subcortical structure.

Table 2: Pearson correlation coefficient rudulsubscript𝑟subscript𝑢𝑑subscript𝑢𝑙r_{u_{d}u_{l}}italic_r start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT between the displacement uncertainties udsubscript𝑢𝑑u_{d}italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and label uncertainties ulsubscript𝑢𝑙u_{l}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. We define label uncertainty ulsubscript𝑢𝑙u_{l}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to be the standard deviation of the DSCfor each subcortical structure, and displacement uncertainty udsubscript𝑢𝑑u_{d}italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to be the voxel-wise mean of the displacement field standard deviation magnitude within the region which corresponds to a given subcortical structure. The value of the correlation coefficient is highest in case of VoxelMorph but qualitative evidence suggests that the uncertainty estimates output by the model are not well calibrated. The correlation coefficient is also higher for SG-MCMCthan for VI.
modelrudulsubscript𝑟subscript𝑢𝑑subscript𝑢𝑙r_{u_{d}u_{l}}italic_r start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT
VI0.070.070.070.07
SG-MCMC0.100.100.100.10
VoxelMorph0.120.120.120.12

The correlation coefficient is highest in case of VoxelMorph but qualitative evidence suggests that image registration uncertainty output by the model is not well calibrated. The displacement field uncertainty is more informative of label uncertainty in case of SG-MCMCthan VI, so the uncertainty estimates output by SG-MCMCare likely to be more useful for clinical purposes than those output by VIor VoxelMorph.

Transformation smoothness. Finally, in order to evaluate the quality of the model output, in Table 3 we report the number of voxels where the sampled transformations are not diffeomorphic. Each model produces transformations where the number of non-positive Jacobian determinants is nearly zero. SG-MCMCslightly reduced the number of non-positive Jacobian determinants compared to VI. The transformations output by VoxelMorph appear more smooth, which is directly related to the lower transformation uncertainty shown in Figure 3.

VI
MCMC
VoxelMorph
Table 3: Mean and standard deviation of the number and percentage of voxels where the sampled transformation is not diffeomorphic as defined in Section 6.1. The values are based on 500 samples.
model|detJφ1|0subscript𝐽superscript𝜑10|\det J_{\varphi^{-1}}|\leq 0| roman_det italic_J start_POSTSUBSCRIPT italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ≤ 0% (×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT)
VI0.000.000.000.00 (0.04)0.04(0.04)( 0.04 )0.00.00.00.0 (0.2)0.2(0.2)( 0.2 )
SG-MCMC0.000.000.000.00 (0.00)0.00(0.00)( 0.00 )0.00.00.00.0 (0.0)0.0(0.0)( 0.0 )
VoxelMorph0.000.000.000.00 (0.00)0.00(0.00)( 0.00 )0.00.00.00.0 (0.0)0.0(0.0)( 0.0 )
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Jacobian determinant of a sample transformation from each model. Even though the output transformations are diffeomorphic, they are not convincingly smooth due to parametrisation of the approximate variational posterior of transformation parameters, whose covariance matrix is diagonal + low-rank in case of VIand diagonal in case of VoxelMorph. Middle slice of a 3D image in the axial plane.
VI
MCMC
VoxelMorph

6.3.2 Comparison of the output of SG-MCMC for different initialisations

To study the potential impact of initialisation on the output uncertainty estimates, we analyse the transformations sampled from SGLDrun with different initial velocity fields w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We experiment with a sample w0𝒩(μw,Σw)similar-tosubscript𝑤0𝒩subscript𝜇𝑤subscriptΣ𝑤w_{0}\sim\mathcal{N}\left(\mu_{w},\Sigma_{w}\right)italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) from VI, a zero velocity field which corresponds to an identity transformation, and a random velocity field w0𝒩(0,I3N3)similar-tosubscript𝑤0𝒩0subscript𝐼3superscript𝑁3w_{0}\sim\mathcal{N}\left(0,I_{3N^{3}}\right)italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT 3 italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) sampled from a standard multivariate normal distribution. The first 100,000 samples from each chain are discarded to allow MCMCto reach the stationary distribution, and the two Markov chains are run for one million transitions each, from which we extract 500 total samples at regular intervals.

We observed no strong dependence of the result on the initialisation. The uncertainty values are visually identical to those in Figure 3. The mean voxel-wise discrepancy between the magnitudes of uncertainty estimates is approximately 0.1 mmtimes0.1millimeter0.1\text{\,}\mathrm{mm}start_ARG 0.1 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG, while the maximum is approximately 0.2 mmtimes0.2millimeter0.2\text{\,}\mathrm{mm}start_ARG 0.2 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG. This suggests that the Markov chains mix well.

6.3.3 Comparison of the output of SG-MCMC for non-parametric SVFs and SVFs based on B-splines

One of the common features of recent state-of-the-art models for non-rigid registration that enable uncertainty quantification, e.g. Dalca et al. (2018) and Krebs et al. (2019), is an SVFtransformation parametrisation as well as a diagonal covariance matrix of the approximate variational posterior of transformation parameters, which ignores spatial correlations. We showed in Section 6.3.1 that this assumption can lead to diffeomorphic transformations that are not smooth even in case of image registration that is not based on deep learning. Furthermore, previous work on uncertainty quantification in image registration made the assumption of independence between control points but not between directions and used transformation parametrisations that guaranteed smoothness in an implicit manner, e.g. B-splines (Simpson et al., 2012) or Gaussian RBFs(Le Folgoc et al., 2017).

To better understand the impact of transformation parametrisations on uncertainty quantification, we analyse the output transformations and uncertainty estimates when using sparse SVFsbased on cubic B-splines. The control point spacing is set to two or four voxels along each dimension, which gives 3.64 mmtimes3.64millimeter3.64\text{\,}\mathrm{mm}start_ARG 3.64 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG or 7.28 mmtimes7.28millimeter7.28\text{\,}\mathrm{mm}start_ARG 7.28 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG. The SGLDstep size is set to 5×1025superscript1025\times 10^{-2}5 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, using the heuristic in Section 6.3.

In Figure 6, we show the output uncertainties and sample Jacobian determinants. Using SVFsbased on B-splines results in smoother transformations. In fact, the implicit smoothness of B-splines helps both with the crude assumption of a diagonal covariance matrix in VI, which translates to a diagonal pre-conditioning matrix in SGLD, and with computational and memory efficiency. We also observed a positive impact on the image registration accuracy. For both control point spacings, SVFsbased on B-splines are more accurate than non-parametric SVFson the majority of structures. Despite comparable accuracy, the uncertainty estimates differ.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) δ=2𝛿2\delta=2italic_δ = 2
Refer to caption
(b) δ=4𝛿4\delta=4italic_δ = 4
(c) *[]Refer to caption
Figure 6: Uncertainty and the Jacobian determinant of transformations sampled from the models using SVFsbased on cubic B-splines. The transformations are visibly smoother than in case of non-parametric SVFs. The uncertainty estimates are also shown to be highly dependent on the transformation parametrisation. The figure shows the middle axial slice of 3D images.

7 Discussion

7.1 Modelling assumptions

To draw samples from the true posterior of transformation parameters remains a difficult problem even with a large number of simplifying model assumptions. If the true posterior were approximately Gaussian, VIwould provide a good approximation thereof and the lower uncertainty output by SG-MCMCwould indicate that the output samples are autocorrelated even with subsampling. However, if the true posterior is not Gaussian, then the posterior output by VIis ill-fitting, and the samples output by SG-MCMCcover multiple modes near the mode that corresponds to VI, which makes a good case for the use of SGLD.

In practice, the quality of uncertainty estimates is also sensitive to the validity of model assumptions. These include coinciding image intensities up to the expected spatial noise offsets and ignoring spatial correlations between residuals. The first assumption is valid in case of mono-modal registration but the model can be easily adapted to other settings through use of a different data loss. We manage the second assumption by use of virtual decimation (Groves et al., 2011; Simpson et al., 2012), which calculates the effective degrees of freedom of the residual map with stationary covariance by proxy of the correlation between neighbouring voxels in each direction.

We showed how good modelling choices, such as a careful choice of priors and transformation parametrisation can mitigate some of the issues caused by approximations necessary in practical applications. The main problem in inter-subject brain MRIregistration remains accuracy but the trade-off between the quality of the transformation and registration accuracy can be managed effectively.

7.2 Runtime

The experiments were run on a system with an Intel i9-10920X CPU and a GeForce RTX 3090 GPU. Table 4 shows the runtime of the models. VItakes approximately 3 mintimes3minute3\text{\,}\mathrm{min}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG to register a pair of images and produces 140 samples/sec, while SG-MCMCproduces 25 autocorrelated samples/sec. The registration time for SG-MCMCis absent because it is difficult to pin down the mixing time, so a fair comparison of the relative efficiency of VIand SG-MCMCcan be done only on the basis of the number of samples per second.

Due to lack of publicly available information we cannot directly compare the efficiency of our model to other Bayesian image registration methods. The speed of our model is an order of magnitude better than reported by Le Folgoc et al. (2017), while also being three- rather than two-dimensional. Thus, the proposed method based on SG-MCMCis very efficient given the Bayesian constraint. It is not as efficient as feed-forward neural networks, such as VoxelMorph. However, the proposed model is fully Bayesian and enables asymptotically exact sampling from the posterior of transformation parameters.

Table 4: Comparison of VI, SG-MCMC, and VoxelMorph vis-à-vis computational efficiency. In contrast to VIand VoxelMorph, SG-MCMCrequires burn-in and produces samples which need to be further subsampled in order to avoid autocorrelation. For this reason, in case of SG-MCMCwe report the sampling speed based on the time needed to draw 4,000 samples. Note that the ratio used to subsample the output of SG-MCMCmay vary depending on the application (cf. Section 5).
modeltraining timeregistration timesamples/sec
VI3 mintimes3minute3\text{\,}\mathrm{min}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG roman_min end_ARG1.4×1021.4E21.4\text{\times}{10}^{2}start_ARG 1.4 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 2 end_ARG end_ARG
SG-MCMC<1.0absent1.0<1.0< 1.0
VoxelMorph38 htimes38hour38\text{\,}\mathrm{h}start_ARG 38 end_ARG start_ARG times end_ARG start_ARG roman_h end_ARG55 mstimes55millisecond55\text{\,}\mathrm{ms}start_ARG 55 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG2.6×1012.6E12.6\text{\times}{10}^{1}start_ARG 2.6 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 1 end_ARG end_ARG

8 Conclusion

In this paper we presented a new Bayesian model for three-dimensional medical image registration. The proposed regularisation loss allows to adjust regularisation strength to the data when using an SVFtransformation parametrisation, which involves a very large number of degrees of freedom. Sampling from the posterior distribution via SG-MCMCmakes it possible to quantify registration uncertainty even for large images. The computational efficiency and theoretical guarantees regarding samples output by SG-MCMCmake our model an attractive alternative for uncertainty quantification in non-rigid image registration compared to methods based on VI.


Acknowledgments

This research used UK Biobank resources under the application number 12579. Daniel Grzech is funded by the EPSRC CDT for Medical Imaging EP/L015226/1 and Loïc Le Folgoc by EP/P023509/1. We are very grateful to Prof. Wells and the Reviewers for their feedback during the review process.


Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.


Conflicts of Interest

We declare we do not have conflicts of interest.

References

  • Arsigny et al. (2006) Vincent Arsigny, Olivier Commowick, Xavier Pennec, and Nicholas Ayache. A Log-Euclidean Framework for Statistics on Diffeomorphisms. In MICCAI, 2006.
  • Ashburner (2007) John Ashburner. A fast diffeomorphic image registration algorithm. NeuroImage, 2007.
  • Avants et al. (2014) Brian B. Avants, Nicholas J. Tustison, Michael Stauffer, Gang Song, Baohua Wu, and James C. Gee. The Insight ToolKit image registration framework. Frontiers in Neuroinformatics, 8, 2014.
  • Balakrishnan et al. (2018) Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. An Unsupervised Learning Model for Deformable Medical Image Registration. In CVPR, 2018.
  • Balakrishnan et al. (2019) Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging, 2019.
  • Besag (1993) Julian Besag. Comments on ”Representations of knowledge in complex systems” by U. Grenander and MI Miller. Journal of the Royal Statistical Society, 56:591–592, 1993.
  • Chen et al. (2016) Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, and Lawrence Carin. Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization. In AISTATS, 2016.
  • Dalca et al. (2018) Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised Learning for Fast Probabilistic Diffeomorphic Registration. In MICCAI, 2018.
  • Dalca et al. (2019) Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised Learning of Probabilistic Diffeomorphic Registration for Images and Surfaces. MedIA, 57:226–236, 2019.
  • de Vos et al. (2019) Bob D. de Vos, Floris F. Berendsen, Max A. Viergever, Hessam Sokooti, Marius Staring, and Ivana Išgum. A deep learning framework for unsupervised affine and deformable image registration. MedIA, 52:128–143, 2019.
  • Delyon et al. (1999) Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics, 27(1):94–128, 1999.
  • Du and Mordatch (2019) Yilun Du and Igor Mordatch. Implicit Generation and Generalization in Energy-Based Models. In NeurIPS, 2019.
  • Glocker et al. (2008) Ben Glocker, Nikos Komodakis, Georgios Tziritas, Nassir Navab, and Nikos Paragios. Dense image registration through MRFs and efficient linear programming. MedIA, 12(6):731–741, 2008.
  • Groves et al. (2011) Adrian R. Groves, Christian F. Beckmann, Steve M. Smith, and Mark W. Woolrich. Linked independent component analysis for multimodal data fusion. NeuroImage, 54(3):2198–2217, 2011.
  • Grzech et al. (2020) Daniel Grzech, Bernhard Kainz, Ben Glocker, and Loïc Le Folgoc. Image registration via stochastic gradient Markov chain Monte Carlo. In UNSURE MICCAI, 2020.
  • Hachama et al. (2012) Mohamed Hachama, Agnès Desolneux, and Frédéric J.P. Richard. Bayesian technique for image classifying registration. IEEE Transactions on Image Processing, 21(9):4080–4091, 2012.
  • Heinrich (2019) Mattias P. Heinrich. Closing the Gap Between Deep and Conventional Image Registration Using Probabilistic Dense Displacement Networks. In MICCAI, 2019.
  • Hill (2020) Mitchell Krupiarz Hill. Learning and Mapping Energy Functions of High-Dimensional Image Data. PhD thesis, UCLA, 2020.
  • Janoos et al. (2012) Firdaus Janoos, Petter Risholm, and William Wells. Bayesian characterization of uncertainty in multi-modal image registration. In WBIR, 2012.
  • Jordan et al. (1999) Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. Introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.
  • Karabulut et al. (2017) Navdar Karabulut, Ertunç Erdil, and Çetin Müjdat. A Markov Chain Monte Carlo based Rigid Image Registration Method. In Signal Processing and Communications Applications Conference, 2017.
  • Khawaled and Freiman (2020) Samah Khawaled and Moti Freiman. Unsupervised deep-learning based deformable image registration: A Bayesian framework. In Medical Imaging Meets NeurIPS, 2020.
  • Klein et al. (2009) Stefan Klein, Marius Staring, Keelin Murphy, Max A. Viergever, and Josien P.W. Pluim. Elastix: a toolbox for intensity-based medical image registration. IEEE Transactions on Medical Imaging, 29(1):196–205, 2009.
  • Krebs et al. (2019) Julian Krebs, Hervé Delingette, Boris Mailhé, Nicholas Ayache, and Tommaso Mansi. Learning a Probabilistic Model for Diffeomorphic Registration. IEEE Transactions on Medical Imaging, 38(9), 2019.
  • Le Folgoc et al. (2017) Loïc Le Folgoc, Hervé Delingette, Antonio Criminisi, and Nicholas Ayache. Quantifying Registration Uncertainty With Sparse Bayesian Modelling. IEEE Transactions on Medical Imaging, 36(2):607–617, 2017.
  • Lee et al. (2019) Matthew C. H. Lee, Ozan Oktay, Andreas Schuh, Michiel Schaap, and Ben Glocker. Image-and-Spatial Transformer Networks for Structure-Guided Image Registration. In MICCAI, 2019.
  • Liu et al. (2019) Lihao Liu, Xiaowei Hu, Lei Zhu, and Pheng Ann Heng. Probabilistic Multilayer Regularization Network for Unsupervised 3D Brain Image Registration. In MICCAI, 2019.
  • Luo et al. (2019) Jie Luo, Alireza Sedghi, Karteek Popuri, Dana Cobzas, Miaomiao Zhang, Frank Preiswerk, Matthew Toews, Alexandra Golby, Masashi Sugiyama, William M. Wells III, and Sarah Frisken. On the Applicability of Registration Uncertainty. In MICCAI, 2019.
  • Mandt et al. (2017) Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference. Journal of Machine Learning Research, 18:1–35, 2017.
  • Modat et al. (2012) Marc Modat, Pankaj Daga, M. Jorge Cardoso, Sebastien Ourselin, Gerard R. Ridgway, and John Ashburner. Parametric non-rigid registration using a stationary velocity field. In Proceedings of the Workshop on Mathematical Methods in Biomedical Image Analysis, 2012.
  • Neal (2011) Radford M. Neal. MCMC Using Hamiltonian Dynamics, chapter 5. CRC Press, 2011.
  • Neuberger et al. (1997) J. W. Neuberger, A. Dold, and F. Takens. Sobolev Gradients and Differential Equations. Lecture Notes in Computer Science. Springer, 1997.
  • Niethammer et al. (2019) Marc Niethammer, Roland Kwitt, and François-Xavier Vialard. Metric Learning for Image Registration. In CVPR, 2019.
  • Nijkamp et al. (2019) Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model. In NeurIPS, 2019.
  • Richard et al. (2009) Frédéric J.P. Richard, Adeline M.M. Samson, and Charles A. Cuénod. A SAEM algorithm for the estimation of template and deformation parameters in medical image sequences. Statistics and Computing, 19(4):465–478, 2009.
  • Risholm et al. (2010) Petter Risholm, Steve Pieper, Eigil Samset, and William M. Wells. Summarizing and visualizing uncertainty in non-rigid registration. In MICCAI, 2010.
  • Risholm et al. (2011) Petter Risholm, James Balter, and William M. Wells III. Estimation of Delivered Dose in Radiotherapy: The Influence of Registration Uncertainty. In MICCAI, 2011.
  • Risholm et al. (2013) Petter Risholm, Firdaus Janoos, Isaiah Norton, Alex J. Golby, and William M. Wells. Bayesian characterization of uncertainty in intra-subject non-rigid registration. MedIA, 2013.
  • Rueckert et al. (1999) Daniel Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and David J. Hawkes. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Transactions on Medical Imaging, 18(8):712–721, 1999.
  • Schnabel et al. (2001) Julia A. Schnabel, Daniel Rueckert, Marcel Quist, Jane M. Blackall, Andy D. Castellano-Smith, Thomas Hartkens, Graeme P. Penney, Walter A. Hall, Haiying Liu, Charles L. Truwit, Frans A. Gerritsen, Derek L. G. Hill, and David J. Hawkes. A Generic Framework for Non-rigid Registration Based on Non-uniform Multi-level Free-Form Deformations. In MICCAI, 2001.
  • Schultz et al. (2018) Sandra Schultz, Heinz Handels, and Jan Ehrhardt. A multilevel Markov Chain Monte Carlo approach for uncertainty quantification in deformable registration. In SPIE Medical Imaging, 2018.
  • Schultz et al. (2019) Sandra Schultz, Julia Krüger, Heinz Handels, and Jan Ehrhardt. Bayesian inference for uncertainty quantification in point-based deformable image registration. In SPIE Medical Imaging, 2019.
  • Sedghi et al. (2019) Alireza Sedghi, Tina Kapur, Jie Luo, Parvin Mousavi, and William M. Wells. Probabilistic Image Registration via Deep Multi-class Classification: Characterizing Uncertainty. In UNSURE MICCAI, 2019.
  • Shen et al. (2019) Zhengyang Shen, François Xavier Vialard, and Marc Niethammer. Region-specific diffeomorphic metric mapping. In NeurIPS, 2019.
  • Sherman and Morrison (1950) Jack Sherman and Winifred J. Morrison. Adjustment of an Inverse Matrix Corresponding to Changes in the Elements of a Given Column or a Given Row of the Original Matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
  • Simpson et al. (2015) I. J.A. Simpson, M. J. Cardoso, M. Modat, D. M. Cash, M. W. Woolrich, J. L.R. Andersson, J. A. Schnabel, and S. Ourselin. Probabilistic non-linear registration with spatially adaptive regularisation. MedIA, 26(1):203–216, 2015.
  • Simpson et al. (2012) Ivor J.A. Simpson, Julia A. Schnabel, Adrian R. Groves, Jesper L.R. Andersson, and Mark W. Woolrich. Probabilistic inference of regularisation in non-rigid registration. NeuroImage, 59(3):2438–2451, 2012.
  • Simpson et al. (2013) Ivor J.A. Simpson, Mark W. Woolrich, Jesperl R. Andersson, Adrian R. Groves, and Julia Schnabel. Ensemble learning incorporating uncertain registration. IEEE Transactions on Medical Imaging, 32(4):748–756, 2013.
  • Slavcheva et al. (2018) Miroslava Slavcheva, Maximilian Baust, and Slobodan Ilic. SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion. In CVPR, 2018.
  • Sudlow et al. (2015) Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine, 12(3), 2015.
  • Wassermann et al. (2014) Demian Wassermann, Matt Toews, Marc Niethammer, and William Wells III. Probabilistic Diffeomorphic Registration: Representing Uncertainty. In WBIR, 2014.
  • Welling and Teh (2011) Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In ICML, 2011.
  • whuber (2018) whuber. What is the expected value of the logarithm of gamma distribution?, Oct 2018. URL https://stats.stackexchange.com/questions/370880/what-is-the-expected-value-of-the-logarithm-of-gamma-distribution.
  • Zhang and Fletcher (2014) Miaomiao Zhang and P. Thomas Fletcher. Bayesian principal geodesic analysis in diffeomorphic image registration. In MICCAI, 2014.
  • Zhang et al. (2013) Miaomiao Zhang, Nikhil Singh, and P. Thomas Fletcher. Bayesian Estimation of Regularization and Atlas Building in Diffeomorphic Image Registration. In IPMI, 2013.
  • Zhang et al. (2020) Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical stochastic gradient MCMC for bayesian deep learning. In ICLR, 2020.
  • Zöllei et al. (2007) Lilla Zöllei, Mark Jenkinson, Samson Timoner, and William Wells. A marginalized MAP approach and EM optimization for pair-wise registration. In IPMI, 2007.

A

In this appendix we show how to calculate the mean of the logarithm of the gamma distribution (whuber, 2018).

Let X𝑋Xitalic_X be a random variable which follows the gamma distribution with the shape and rate parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β, i.e. XΓ(α,β)similar-to𝑋Γ𝛼𝛽X\sim\Gamma\left(\alpha,\beta\right)italic_X ∼ roman_Γ ( italic_α , italic_β ). We are interested in the expected value of Y=logX𝑌𝑋Y=\log Xitalic_Y = roman_log italic_X. If we assume that β=1𝛽1\beta=1italic_β = 1, then the PDFof X𝑋Xitalic_X is given by:

fX(x)=1Γ(α)xαex\difxxsubscript𝑓𝑋𝑥1Γ𝛼superscript𝑥𝛼superscript𝑒𝑥\dif𝑥𝑥f_{X}\left(x\right)=\frac{1}{\Gamma\left(\alpha\right)}x^{\alpha}e^{-x}\frac{% \dif x}{x}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG roman_Γ ( italic_α ) end_ARG italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT divide start_ARG italic_x end_ARG start_ARG italic_x end_ARG(16)

Note that Γ(α)Γ𝛼\Gamma\left(\alpha\right)roman_Γ ( italic_α ) is a constant and the integral of fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT must equal 1111, so we have:

Γ(α)=+xαex\difxxΓ𝛼subscriptsubscriptsuperscript𝑥𝛼superscript𝑒𝑥\dif𝑥𝑥\Gamma\left(\alpha\right)=\int_{\mathbb{R}_{+}}x^{\alpha}e^{-x}\frac{\dif x}{x}roman_Γ ( italic_α ) = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT divide start_ARG italic_x end_ARG start_ARG italic_x end_ARG(17)

Let x=expy𝑥𝑦x=\exp yitalic_x = roman_exp italic_y. This means that \difxx=\dify\dif𝑥𝑥\dif𝑦\frac{\dif x}{x}=\dif ydivide start_ARG italic_x end_ARG start_ARG italic_x end_ARG = italic_y, so the PDFof Y𝑌Yitalic_Y is given by:

fY(y)=1Γ(α)eαyey\difysubscript𝑓𝑌𝑦1Γ𝛼superscript𝑒𝛼𝑦superscript𝑒𝑦\dif𝑦f_{Y}\left(y\right)=\frac{1}{\Gamma\left(\alpha\right)}e^{\alpha y-e^{y}}\dif yitalic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG 1 end_ARG start_ARG roman_Γ ( italic_α ) end_ARG italic_e start_POSTSUPERSCRIPT italic_α italic_y - italic_e start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_y(18)

Again, because Γ(α)Γ𝛼\Gamma\left(\alpha\right)roman_Γ ( italic_α ) is a constant and the integral of the PDFof Y𝑌Yitalic_Y must equal 1111, we have:

Γ(α)=eαyey\difyΓ𝛼subscriptsuperscript𝑒𝛼𝑦superscript𝑒𝑦\dif𝑦\Gamma\left(\alpha\right)=\int_{\mathbb{R}}e^{\alpha y-e^{y}}\dif yroman_Γ ( italic_α ) = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_α italic_y - italic_e start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_y(19)

Now, using Feynman’s trick of differentiating under the integral sign, we see that:

𝔼(Y)𝔼𝑌\displaystyle\mathbb{E}\left(Y\right)blackboard_E ( italic_Y )=\displaystyle==yfY(y)subscript𝑦subscript𝑓𝑌𝑦\displaystyle\int_{\mathbb{R}}yf_{Y}\left(y\right)∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_y italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y )\dify\dif𝑦\displaystyle\dif yitalic_y=1Γ(α)absent1Γ𝛼\displaystyle=\frac{1}{\Gamma\left(\alpha\right)}= divide start_ARG 1 end_ARG start_ARG roman_Γ ( italic_α ) end_ARGΓ(α)yfY(y)subscriptΓ𝛼𝑦subscript𝑓𝑌𝑦\displaystyle\int_{\mathbb{R}}\Gamma\left(\alpha\right)yf_{Y}\left(y\right)∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT roman_Γ ( italic_α ) italic_y italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y )\dify\dif𝑦\displaystyle\dif yitalic_y(20)
=1Γ(α)absent1Γ𝛼\displaystyle=\frac{1}{\Gamma\left(\alpha\right)}= divide start_ARG 1 end_ARG start_ARG roman_Γ ( italic_α ) end_ARGddαeαyeysubscript𝑑𝑑𝛼superscript𝑒𝛼𝑦superscript𝑒𝑦\displaystyle\int_{\mathbb{R}}\frac{d}{d\alpha}e^{\alpha y-e^{y}}∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_d italic_α end_ARG italic_e start_POSTSUPERSCRIPT italic_α italic_y - italic_e start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT\dify\dif𝑦\displaystyle\dif yitalic_y=1Γ(α)ddαabsent1Γ𝛼𝑑𝑑𝛼\displaystyle=\frac{1}{\Gamma\left(\alpha\right)}\frac{d}{d\alpha}= divide start_ARG 1 end_ARG start_ARG roman_Γ ( italic_α ) end_ARG divide start_ARG italic_d end_ARG start_ARG italic_d italic_α end_ARGeαyeysubscriptsuperscript𝑒𝛼𝑦superscript𝑒𝑦\displaystyle\int_{\mathbb{R}}e^{\alpha y-e^{y}}∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_α italic_y - italic_e start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT\dify\dif𝑦\displaystyle\dif yitalic_y(21)
=1Γ(α)absent1Γ𝛼\displaystyle=\frac{1}{\Gamma\left(\alpha\right)}= divide start_ARG 1 end_ARG start_ARG roman_Γ ( italic_α ) end_ARGddαΓ(α)𝑑𝑑𝛼Γ𝛼\displaystyle\frac{d}{d\alpha}\Gamma\left(\alpha\right)divide start_ARG italic_d end_ARG start_ARG italic_d italic_α end_ARG roman_Γ ( italic_α )=ddαlogΓ(α)absent𝑑𝑑𝛼Γ𝛼\displaystyle=\frac{d}{d\alpha}\log\Gamma\left(\alpha\right)= divide start_ARG italic_d end_ARG start_ARG italic_d italic_α end_ARG roman_log roman_Γ ( italic_α )(22)
=ψ(α)absent𝜓𝛼\displaystyle=\psi\left(\alpha\right)= italic_ψ ( italic_α )(23)

where ψ𝜓\psiitalic_ψ is the digamma function. Finally, the rate parameter β𝛽\betaitalic_β shifts the logarithm by logβ𝛽-\log\beta- roman_log italic_β. Therefore, the expected value of logX𝑋\log Xroman_log italic_X is given by:

𝔼[logX]=ψ(α)logβ𝔼delimited-[]𝑋𝜓𝛼𝛽\mathbb{E}\left[\log X\right]=\psi\left(\alpha\right)-\log\betablackboard_E [ roman_log italic_X ] = italic_ψ ( italic_α ) - roman_log italic_β(24)

B

In this appendix we show how to calculate the KL-divergenceterm between the approximate variational posterior q(w)𝒩(μw,Σw)similar-to𝑞𝑤𝒩subscript𝜇𝑤subscriptΣ𝑤q\left(w\right)\sim\mathcal{N}\left(\mu_{w},\Sigma_{w}\right)italic_q ( italic_w ) ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) and the prior p(w)𝑝𝑤p\left(w\right)italic_p ( italic_w ), as well as the entropy H(q)𝐻𝑞H(q)italic_H ( italic_q ) of the approximate variational posterior, both of which are needed to maximise the ELBOin Equation 12.

The KL-divergencecan be evaluated in terms of the entropy H(q)𝐻𝑞H(q)italic_H ( italic_q ) of the approximate variatonal posterior and the regularisation energy regsubscriptreg\mathcal{E}_{\text{reg}}caligraphic_E start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT:

DKL(q||p)=wqw(w)logq(w)\difwwq(w)logp(w)\difw=H(q)+regq\text{D}_{\text{KL}}(q\;||\;p)=\int_{w}q_{w}\left(w\right)\log q\left(w\right)% \dif w-\int_{w}q\left(w\right)\log p\left(w\right)\dif w=-H(q)+\big{\langle}% \mathcal{E}_{\text{reg}}\big{\rangle}_{q}D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q | | italic_p ) = ∫ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_w ) roman_log italic_q ( italic_w ) italic_w - ∫ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_q ( italic_w ) roman_log italic_p ( italic_w ) italic_w = - italic_H ( italic_q ) + ⟨ caligraphic_E start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT(25)

where delimited-⟨⟩\big{\langle}\cdot\big{\rangle}⟨ ⋅ ⟩ denotes the expected value.

The entropy H(q)𝐻𝑞H(q)italic_H ( italic_q ) of the approximate variational posterior is calculated as follows:

H(q)=wq(w)logq(w)\difw=12logdet(Σw)+12(wμw)Σw1(wμw)q+const.𝐻𝑞subscript𝑤𝑞𝑤𝑞𝑤\dif𝑤12subscriptΣ𝑤12subscriptdelimited-⟨⟩superscript𝑤subscript𝜇𝑤subscriptsuperscriptΣ1𝑤𝑤subscript𝜇𝑤𝑞const.\displaystyle H(q)=-\int_{w}q\left(w\right)\log q\left(w\right)\dif w=\frac{1}% {2}\log\det\left(\Sigma_{w}\right)+\frac{1}{2}\big{\langle}\left(w-\mu_{w}% \right)^{\intercal}\Sigma^{-1}_{w}\left(w-\mu_{w}\right)\big{\rangle}_{q}+% \text{const.}italic_H ( italic_q ) = - ∫ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_q ( italic_w ) roman_log italic_q ( italic_w ) italic_w = divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log roman_det ( roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ ( italic_w - italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_w - italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + const.(26)

The first term on the left on the RHS is calculated using the matrix determinant lemma:

det(Σw)=det(diag(σw2)+uwuw)=(1+uwdiag(σw2)uw)×det(diag(σw2))subscriptΣ𝑤diagsubscriptsuperscript𝜎2𝑤subscript𝑢𝑤superscriptsubscript𝑢𝑤1superscriptsubscript𝑢𝑤diagsuperscriptsubscript𝜎𝑤2subscript𝑢𝑤diagsubscriptsuperscript𝜎2𝑤\det\left(\Sigma_{w}\right)=\det\left(\text{diag}\left(\sigma^{2}_{w}\right)+u% _{w}u_{w}^{\intercal}\right)=\left(1+u_{w}^{\intercal}\text{diag}\left(\sigma_% {w}^{-2}\right)u_{w}\right)\times\det\left(\text{diag}\left(\sigma^{2}_{w}% \right)\right)roman_det ( roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) = roman_det ( diag ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) = ( 1 + italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT diag ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) × roman_det ( diag ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) )(27)

To evaluate the other term, and in particular the precision matrix Σw1subscriptsuperscriptΣ1𝑤\Sigma^{-1}_{w}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, we can use the Sherman-Morrison formula (Sherman and Morrison, 1950), which states that:

Σw1=(diag(σw2)+uwuw)1=diag(σw2)diag(σw2)uwuwdiag(σw2)1+uwdiag(σw2)uwsubscriptsuperscriptΣ1𝑤superscriptdiagsubscriptsuperscript𝜎2𝑤subscript𝑢𝑤superscriptsubscript𝑢𝑤1diagsubscriptsuperscript𝜎2𝑤diagsubscriptsuperscript𝜎2𝑤subscript𝑢𝑤superscriptsubscript𝑢𝑤diagsubscriptsuperscript𝜎2𝑤1superscriptsubscript𝑢𝑤diagsubscriptsuperscript𝜎2𝑤subscript𝑢𝑤\Sigma^{-1}_{w}=\left(\text{diag}\left(\sigma^{2}_{w}\right)+u_{w}u_{w}^{% \intercal}\right)^{-1}=\text{diag}\left(\sigma^{-2}_{w}\right)-\frac{\text{% diag}\left(\sigma^{-2}_{w}\right)u_{w}u_{w}^{\intercal}\text{diag}\left(\sigma% ^{-2}_{w}\right)}{1+u_{w}^{\intercal}\text{diag}\left(\sigma^{-2}_{w}\right)u_% {w}}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ( diag ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = diag ( italic_σ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - divide start_ARG diag ( italic_σ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT diag ( italic_σ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG 1 + italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT diag ( italic_σ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG(28)