Node-level auxiliary variables for improved graphical model inference (navigm)

navigm is an R package designed to integrate node-level auxiliary variables into Gaussian graphical spike-and-slab models. Unlike traditional Gaussian graphical models, it features the option to incorporate and select node-level variables that could influence node centrality. This package implements a new variational Bayes expectation conditional maximisation (VBECM) algorithm for model inference. Additionally, it provides further options on simpler models and the expectation conditional maximisation (ECM) algorithm.

Set-up

First, load the package using the command

library(navigm)

The primary function navigm operates on a data matrix \(\mathbf{Y}\) with dimensions \(N \times P\), where \(N\) represents the number of samples and \(P\) corresponds to the number of nodes within the graph. In addition, it can also accommodate an extra input matrix \(\mathbf{V}\) with dimensions \(P \times Q\). This matrix comprises node-level auxiliary variables that may potentially influence node degrees and the dependence structure, where \(Q\) indicates the number of such auxiliary variables. In this vignette, let us assume \(N = 200\) samples, \(P = 50\) nodes, and \(Q = 10\) auxiliary variables, with one variable impacting node degrees and others remaining “inactive”.

Example data simulation

We create the matrix of node-level variables \(\mathbf{V}\) from a right-skewed beta distribution in an independent fashion. The effect size of the sole (randomly drawn) active auxiliary variable is generated using a log-normal distribution centered at 0.7. For the remaining variables, their effect sizes are all set to 0. These effect sizes are represented by the vector \(\boldsymbol{\beta}\). Both components contribute to the hub propensity via \(\mathbf{V}\boldsymbol{\beta}\).

Set \(\zeta = -1.2\). The probability of including edge \((i,j)\) is determined by: \[ \Phi (\zeta + \sum_{q=1}^Q V_{iq} \beta_q + \sum_{q=1}^Q V_{jq} \beta_q), \] where \(\Phi(\cdot)\) represents the normal cumulative density function. If the resulting probability is greater than or equal to 0.5, the edge \((i,j)\) exists; otherwise, it does not, thus defining the adjacency matrix. Subsequently, we generate \(N=100\) samples from a multivariate normal distribution with mean \(\mathbf{0}\) and a precision matrix that adheres to the specified adjacency structure.

set.seed(123)

# problem set-up
#
N <- 200
P <- 50
Q <- 10
Q0 <- 1 # Q0 <= Q
zeta <-  - 1.2


# auxiliary variables
#
V <- simulate_auxiliary_matrix(P, Q)


# effect sizes
#
beta0 <- 0.7
sig2_beta0 <- 0.1
beta_true <- rlnorm(Q0, log(beta0), sig2_beta0)
beta_true_gmss <- rep(0, Q)
beta_true_gmss[sample(1:Q, Q0)] <- beta_true


# adjacency matrix
#
theta <- V %*% matrix(beta_true_gmss, ncol = 1)
pe <- matrix(theta, nrow = P, ncol = P)
pe <- pe + t(pe) + zeta
pe <- pnorm(pe)
A <- 0 + (pe >= 0.5)
diag(A) <- 0

# 
net <- simulate_data_from_adjacency_matrix(N = N, A = A)

The sparsity of the simulated adjacency matrix is approximately 0.82%. The illustration below showcases the simulated data matrix \(\mathbf{Y}\) (left) and the node-level variable matrix \(\mathbf{V}\) (right). The vertical y-axes correspond to nodes, while the horizontal x-axes correspond to samples (on the left) and auxiliary variables (on the right).

Methodology

Model

The standard approach estimates the conditional independence structure between nodes using \(\mathbf{Y}\). This package novelly integrates an independent matrix \(\mathbf{V}\) for a dual purpose: assessing the significance of auxiliary variables and enhancing the accuracy of graph estimates. The default model within this package is thus a Gaussian graphical spike-and-slab model with a spike-and-slab prior for the coefficients of node-level auxiliary variables (GMSS). The formulation of a Gaussian graphical spike-and-slab model is as follows:

\[ \begin{aligned} \mathbf{y}_1, ..., \mathbf{y}_n &\overset{\text{iid}}{\sim} \text{Normal}(\mathbf{0}, \boldsymbol{\Omega}^{-1}), \quad n = 1, ..., N, \quad \boldsymbol{\Omega} \in \mathcal{M}^+, \\ \omega_{ij}\mid \delta_{ij}, \tau &\sim \delta_{ij}\text{Normal}\left(0, \nu_1^2\tau^{-1}\right) + (1-\delta_{ij}) \text{Normal}\left(0, \nu_0^2\tau^{-1}\right),\quad \nu_0 \ll \nu_1,\quad 1 \leq i<j \leq P,\\ \omega_{ii} &\sim \mathrm{Exp}\left({\lambda/2}\right), \quad i = 1, \ldots, P,\\ \delta_{ij} \mid \rho &\sim \text{Bernoulli}(\rho_{ij}),\\ \tau &\sim \text{Gamma}(a_\tau, b_\tau), \end{aligned} \] where \(\mathcal{M}^+\) denotes the set of symmetric positive definite matrices, \(\nu_0\) and \(\nu_1\) represent small and large constants respectively, and \(a_\tau, b_\tau\) are hyperparameters.

Subsequently, we establish the impact of node-level auxiliary variables on the edge inclusion by linking \(\mathbf{V}\) with \(\rho_{ij}\). The incorporation of a spike-and-slab prior for the coefficients of node-level auxiliary variables aids in the selection of influential variables, \[ \begin{aligned} \rho_{ij} &= \Phi\left(\zeta + \sum_{q=1}^Q V_{iq}\beta_q + \sum_{q=1}^Q V_{jq} \beta_q\right), \\ \beta_q \mid \gamma_q, \sigma^2 &\sim \gamma_{q}\text{Normal}(0,\sigma^2) + (1-\gamma_{q}) \delta(\beta_q) , \quad q=1,\ldots, Q, \quad \gamma_q\mid o \sim \text{Bernoulli}(o), \quad \\ \zeta &\sim \text{Normal}(n_0, t_0^2), \quad \sigma^{-2} \sim \text{Gamma}(a_\sigma, b_\sigma), \quad o \sim \text{Beta}(a_o, b_o), \end{aligned} \] where \(\Phi(\cdot)\) is the standard normal cumulative distribution function, \(n_0, t_0^2\) are hyperparameters to induce graph sparsity, \(a_o, b_o\) are hyperparameters to induce auxiliary variable sparsity, and \(a_\sigma, b_\sigma\) are hyperparameters on slab variances. While the package provides default choices for these hyperparameters, users retain the flexibility to modify them. For more comprehensive insights and explanations on hyperparameters, please consult our paper.

Inference

To infer model parameters, we have developed a VBECM algorithm that can be implemented using the following command:

gmss_vbecm <- navigm(Y = net$Y, V = V, numCore = 2)
#> == Parallel exploration of a  grid of spike standard deviations 
#>  v0 =  1e-04 0.06676 0.13342 0.20008 0.26674 0.3334 0.40006 0.46672 0.53338 0.60004 0.6667 0.73336 0.80002 0.86668 0.93334 1 
#>   on  2  cores ... 
#> 
#> ... done. == 
#> 
#> == Select from a grid of spike variance using AIC  ... 
#> 
#> ... done. == 
#> 
#> Select the index 7  i.e., v0 =  0.40006 , the best AIC  =  9146.972 .
#> 
#> Total runtime:  4.369285 secs .

Please note that in this vignette, the setting numCore = 2 is limited due to an issue discussed in this link. However, for real-world applications, users may use the default value numCore = NULL, allowing the automatic detection of available CPU cores on their devices.

Visualisation

The quantities of most interest in the model are the posterior probabilities of including edge \((i,j)\) within the graph and including the node-level auxiliary variable \(q\) as an influential factor on the graph. The package provides visualisations for both quantities.

Edge inclusion

An illustration involves plotting the simulated graph alongside the graph estimated via our VBECM algorithm (subject to thresholding). In this example, we employ a threshold derived from Bayesian false discovery rate.

# threshold
#
threshold_edge <- get_fdr_threshold(gmss_vbecm$estimates$m_delta[upper.tri(net$A)])

# plot
#
par(mfrow = c(1,2), mar = c(0,1,0,1))
plot_network(net$A, node_names = 1:P)
plot_network(gmss_vbecm$estimates$m_delta >= threshold_edge, node_names = 1:P)

The graph is accurately reconstructed.

Variable inclusion

The next plot displays the simulated regression coefficients and the estimated posterior inclusion probabilities of auxiliary variables.

# threshold
#
threshold_var <- get_fdr_threshold(gmss_vbecm$estimates$m_gamma)

# plot
#
par(mfrow = c(1,2))
plot_ppi(beta_true_gmss, condition = (beta_true_gmss!=0))
plot_ppi(gmss_vbecm$estimates$m_gamma, ylab = "posterior inclusion probability", 
         condition = (gmss_vbecm$estimates$m_gamma >= threshold_var))

Though all the posterior inclusion probabilities are small, the active variable is precisely identified using a threshold based on Bayesian false discovery rate.

Summary of Performance Based on Thresholds

We also provide performance summaries based on a threshold, including the true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), false negative rate (FNR), recall, precision, and F1 score.

# Median probability model with threshold 0.5
perf <- compute_perf(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)])
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:1,fpr:0,tnr:1,fnr:0,recall:1,precision:0.91,f1:0.95"

perf <- compute_perf(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0)
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:0,fpr:0,tnr:1,fnr:1,recall:0,precision:NaN,f1:NaN"

# Bayesian false discovery rates based threshold
perf <- compute_perf(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)],
                   threshold = threshold_edge)
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:1,fpr:0,tnr:1,fnr:0,recall:1,precision:0.91,f1:0.95"

perf <- compute_perf(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0,
                   threshold = threshold_var)
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:1,fpr:0,tnr:1,fnr:0,recall:1,precision:1,f1:1"

Threshold-free Performance

Next, we present a partial receiver operating characteristic (pROC) curve that summarises performance across thresholds.

par(mfrow = c(1,2))
plot_roc(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)],
         main = 'pROC curve (edge selection)', fpr_stop = 0.2)
plot_roc(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0,
         main = 'pROC curve (variable selection)', fpr_stop = 0.2)

Alternatively, we can evaluate a standardised partial area under the ROC curve (pAUC).

compute_pauc(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)], 
             fpr_stop = 0.2, standardise = T)
#> [1] 0.9988569
compute_pauc(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0,
             fpr_stop = 0.2, standardise = T)
#> [1] 1

Additional options

Simpler models

The package supports the following simpler models:

A model that includes node-level variables without performing selection can be employed by setting method = 'GMN'. This option proves useful when incorporating a limited number of node-level auxiliary variables.
You can opt for a standard graphical spike-and-slab model by setting method = 'GM'. Its inference through the ECM algorithm is akin to that of EMGS, except we enable learning the scales of continuous spike-and-slab variances \(\tau\).

A graphical spike-and-slab model with normal priors on auxiliary variable coefficients

A graphical spike-and-slab model with normal priors on auxiliary variable coefficients (GMN) imposes normal prior, instead of spike-and-slab priors in GMSS, on regression coefficients,

\[ \begin{aligned} \beta_q \mid \sigma^2 &\sim \text{Normal}(0,\sigma^2) , \quad q=1,\ldots, Q,\\ \sigma^{-2} & \sim \text{Gamma}(a_\sigma, b_\sigma), \end{aligned} \]

Parameters in GMN can be estimated by our VBECM algorithm using the following command:

gmn_vbecm <- navigm(Y = net$Y, V = V, method = 'GMN', numCore = 2)
#> == Parallel exploration of a  grid of spike standard deviations 
#>  v0 =  1e-04 0.06676 0.13342 0.20008 0.26674 0.3334 0.40006 0.46672 0.53338 0.60004 0.6667 0.73336 0.80002 0.86668 0.93334 1 
#>   on  2  cores ... 
#> 
#> ... done. == 
#> 
#> == Select from a grid of spike variance using AIC  ... 
#> 
#> ... done. == 
#> 
#> Select the index 7  i.e., v0 =  0.40006 , the best AIC  =  9147.011 .
#> 
#> Total runtime:  15.20818 secs .

Please be aware that GMSS is more appropriate for this specific data example, as a majority of the auxiliary variables do not significantly impact the graph structure.

A graphical spike-and-slab model without auxiliary variables

A graphical spike-and-slab model without auxiliary variables (GM) eliminates the dependence of edge inclusion probability on auxiliary variables, i.e., \[\delta_{ij} \mid \rho \sim \text{Bernoulli}(\rho). \] Two prior choices are available on \(\rho\),

version = 1: \[\rho \sim \text{Beta}(a_\rho, b_\rho), \] i.e., the same as EMGS.
version = 2: \[\Phi(\rho) \sim \text{Normal}(n_0, t_0^2), \] which is more comparable with our extensions.

The following estimates parameters in GM with a beta prior on edge inclusion by VBECM:

gm_vbecm <- navigm(Y = net$Y, method = 'GM', version = 1, numCore = 2)
#> == Parallel exploration of a  grid of spike standard deviations 
#>  v0 =  1e-04 0.06676 0.13342 0.20008 0.26674 0.3334 0.40006 0.46672 0.53338 0.60004 0.6667 0.73336 0.80002 0.86668 0.93334 1 
#>   on  2  cores ... 
#> 
#> ... done. == 
#> 
#> == Select from a grid of spike variance using AIC  ... 
#> 
#> ... done. == 
#> 
#> Select the index 7  i.e., v0 =  0.40006 , the best AIC  =  9146.949 .
#> 
#> Total runtime:  2.250996 secs .

Moreover, we have implemented an ECM algorithm as in the literature. However, this algorithm has a longer runtime and yields less accurate estimations, making it not recommended in practice.