navigm
is an R package designed to integrate node-level
auxiliary variables into Gaussian graphical spike-and-slab models.
Unlike traditional Gaussian graphical models, it features the option to
incorporate and select node-level variables that could influence node
centrality. This package implements a new variational Bayes expectation
conditional maximisation (VBECM) algorithm for model inference.
Additionally, it provides further options on simpler models and the
expectation conditional maximisation (ECM) algorithm.
First, load the package using the command
The primary function navigm
operates on a data matrix
\(\mathbf{Y}\) with dimensions \(N \times P\), where \(N\) represents the number of samples and
\(P\) corresponds to the number of
nodes within the graph. In addition, it can also accommodate an extra
input matrix \(\mathbf{V}\) with
dimensions \(P \times Q\). This matrix
comprises node-level auxiliary variables that may potentially influence
node degrees and the dependence structure, where \(Q\) indicates the number of such auxiliary
variables. In this vignette, let us assume \(N
= 200\) samples, \(P = 50\)
nodes, and \(Q = 10\) auxiliary
variables, with one variable impacting node degrees and others remaining
“inactive”.
We create the matrix of node-level variables \(\mathbf{V}\) from a right-skewed beta distribution in an independent fashion. The effect size of the sole (randomly drawn) active auxiliary variable is generated using a log-normal distribution centered at 0.7. For the remaining variables, their effect sizes are all set to 0. These effect sizes are represented by the vector \(\boldsymbol{\beta}\). Both components contribute to the hub propensity via \(\mathbf{V}\boldsymbol{\beta}\).
Set \(\zeta = -1.2\). The probability of including edge \((i,j)\) is determined by: \[ \Phi (\zeta + \sum_{q=1}^Q V_{iq} \beta_q + \sum_{q=1}^Q V_{jq} \beta_q), \] where \(\Phi(\cdot)\) represents the normal cumulative density function. If the resulting probability is greater than or equal to 0.5, the edge \((i,j)\) exists; otherwise, it does not, thus defining the adjacency matrix. Subsequently, we generate \(N=100\) samples from a multivariate normal distribution with mean \(\mathbf{0}\) and a precision matrix that adheres to the specified adjacency structure.
set.seed(123)
# problem set-up
#
N <- 200
P <- 50
Q <- 10
Q0 <- 1 # Q0 <= Q
zeta <- - 1.2
# auxiliary variables
#
V <- simulate_auxiliary_matrix(P, Q)
# effect sizes
#
beta0 <- 0.7
sig2_beta0 <- 0.1
beta_true <- rlnorm(Q0, log(beta0), sig2_beta0)
beta_true_gmss <- rep(0, Q)
beta_true_gmss[sample(1:Q, Q0)] <- beta_true
# adjacency matrix
#
theta <- V %*% matrix(beta_true_gmss, ncol = 1)
pe <- matrix(theta, nrow = P, ncol = P)
pe <- pe + t(pe) + zeta
pe <- pnorm(pe)
A <- 0 + (pe >= 0.5)
diag(A) <- 0
#
net <- simulate_data_from_adjacency_matrix(N = N, A = A)
The sparsity of the simulated adjacency matrix is approximately 0.82%. The illustration below showcases the simulated data matrix \(\mathbf{Y}\) (left) and the node-level variable matrix \(\mathbf{V}\) (right). The vertical y-axes correspond to nodes, while the horizontal x-axes correspond to samples (on the left) and auxiliary variables (on the right).
The standard approach estimates the conditional independence structure between nodes using \(\mathbf{Y}\). This package novelly integrates an independent matrix \(\mathbf{V}\) for a dual purpose: assessing the significance of auxiliary variables and enhancing the accuracy of graph estimates. The default model within this package is thus a Gaussian graphical spike-and-slab model with a spike-and-slab prior for the coefficients of node-level auxiliary variables (GMSS). The formulation of a Gaussian graphical spike-and-slab model is as follows:
\[ \begin{aligned} \mathbf{y}_1, ..., \mathbf{y}_n &\overset{\text{iid}}{\sim} \text{Normal}(\mathbf{0}, \boldsymbol{\Omega}^{-1}), \quad n = 1, ..., N, \quad \boldsymbol{\Omega} \in \mathcal{M}^+, \\ \omega_{ij}\mid \delta_{ij}, \tau &\sim \delta_{ij}\text{Normal}\left(0, \nu_1^2\tau^{-1}\right) + (1-\delta_{ij}) \text{Normal}\left(0, \nu_0^2\tau^{-1}\right),\quad \nu_0 \ll \nu_1,\quad 1 \leq i<j \leq P,\\ \omega_{ii} &\sim \mathrm{Exp}\left({\lambda/2}\right), \quad i = 1, \ldots, P,\\ \delta_{ij} \mid \rho &\sim \text{Bernoulli}(\rho_{ij}),\\ \tau &\sim \text{Gamma}(a_\tau, b_\tau), \end{aligned} \] where \(\mathcal{M}^+\) denotes the set of symmetric positive definite matrices, \(\nu_0\) and \(\nu_1\) represent small and large constants respectively, and \(a_\tau, b_\tau\) are hyperparameters.
Subsequently, we establish the impact of node-level auxiliary variables on the edge inclusion by linking \(\mathbf{V}\) with \(\rho_{ij}\). The incorporation of a spike-and-slab prior for the coefficients of node-level auxiliary variables aids in the selection of influential variables, \[ \begin{aligned} \rho_{ij} &= \Phi\left(\zeta + \sum_{q=1}^Q V_{iq}\beta_q + \sum_{q=1}^Q V_{jq} \beta_q\right), \\ \beta_q \mid \gamma_q, \sigma^2 &\sim \gamma_{q}\text{Normal}(0,\sigma^2) + (1-\gamma_{q}) \delta(\beta_q) , \quad q=1,\ldots, Q, \quad \gamma_q\mid o \sim \text{Bernoulli}(o), \quad \\ \zeta &\sim \text{Normal}(n_0, t_0^2), \quad \sigma^{-2} \sim \text{Gamma}(a_\sigma, b_\sigma), \quad o \sim \text{Beta}(a_o, b_o), \end{aligned} \] where \(\Phi(\cdot)\) is the standard normal cumulative distribution function, \(n_0, t_0^2\) are hyperparameters to induce graph sparsity, \(a_o, b_o\) are hyperparameters to induce auxiliary variable sparsity, and \(a_\sigma, b_\sigma\) are hyperparameters on slab variances. While the package provides default choices for these hyperparameters, users retain the flexibility to modify them. For more comprehensive insights and explanations on hyperparameters, please consult our paper.
To infer model parameters, we have developed a VBECM algorithm that can be implemented using the following command:
gmss_vbecm <- navigm(Y = net$Y, V = V, numCore = 2)
#> == Parallel exploration of a grid of spike standard deviations
#> v0 = 1e-04 0.06676 0.13342 0.20008 0.26674 0.3334 0.40006 0.46672 0.53338 0.60004 0.6667 0.73336 0.80002 0.86668 0.93334 1
#> on 2 cores ...
#>
#> ... done. ==
#>
#> == Select from a grid of spike variance using AIC ...
#>
#> ... done. ==
#>
#> Select the index 7 i.e., v0 = 0.40006 , the best AIC = 9146.972 .
#>
#> Total runtime: 4.369285 secs .
Please note that in this vignette, the setting
numCore = 2
is limited due to an issue discussed in this
link. However, for real-world applications, users may use the
default value numCore = NULL
, allowing the automatic
detection of available CPU cores on their devices.
The quantities of most interest in the model are the posterior probabilities of including edge \((i,j)\) within the graph and including the node-level auxiliary variable \(q\) as an influential factor on the graph. The package provides visualisations for both quantities.
An illustration involves plotting the simulated graph alongside the graph estimated via our VBECM algorithm (subject to thresholding). In this example, we employ a threshold derived from Bayesian false discovery rate.
# threshold
#
threshold_edge <- get_fdr_threshold(gmss_vbecm$estimates$m_delta[upper.tri(net$A)])
# plot
#
par(mfrow = c(1,2), mar = c(0,1,0,1))
plot_network(net$A, node_names = 1:P)
plot_network(gmss_vbecm$estimates$m_delta >= threshold_edge, node_names = 1:P)
The graph is accurately reconstructed.
The next plot displays the simulated regression coefficients and the estimated posterior inclusion probabilities of auxiliary variables.
# threshold
#
threshold_var <- get_fdr_threshold(gmss_vbecm$estimates$m_gamma)
# plot
#
par(mfrow = c(1,2))
plot_ppi(beta_true_gmss, condition = (beta_true_gmss!=0))
plot_ppi(gmss_vbecm$estimates$m_gamma, ylab = "posterior inclusion probability",
condition = (gmss_vbecm$estimates$m_gamma >= threshold_var))
Though all the posterior inclusion probabilities are small, the active variable is precisely identified using a threshold based on Bayesian false discovery rate.
We also provide performance summaries based on a threshold, including the true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), false negative rate (FNR), recall, precision, and F1 score.
# Median probability model with threshold 0.5
perf <- compute_perf(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)])
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:1,fpr:0,tnr:1,fnr:0,recall:1,precision:0.91,f1:0.95"
perf <- compute_perf(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0)
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:0,fpr:0,tnr:1,fnr:1,recall:0,precision:NaN,f1:NaN"
# Bayesian false discovery rates based threshold
perf <- compute_perf(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)],
threshold = threshold_edge)
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:1,fpr:0,tnr:1,fnr:0,recall:1,precision:0.91,f1:0.95"
perf <- compute_perf(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0,
threshold = threshold_var)
paste(names(perf), sapply(perf, function(x)round(x,2)), sep = ":", collapse = ",")
#> [1] "tpr:1,fpr:0,tnr:1,fnr:0,recall:1,precision:1,f1:1"
Next, we present a partial receiver operating characteristic (pROC) curve that summarises performance across thresholds.
par(mfrow = c(1,2))
plot_roc(gmss_vbecm$estimates$m_delta[upper.tri(net$A)], net$A[upper.tri(net$A)],
main = 'pROC curve (edge selection)', fpr_stop = 0.2)
plot_roc(gmss_vbecm$estimates$m_gamma, beta_true_gmss!=0,
main = 'pROC curve (variable selection)', fpr_stop = 0.2)
Alternatively, we can evaluate a standardised partial area under the ROC curve (pAUC).
The package supports the following simpler models:
A model that includes node-level variables without performing
selection can be employed by setting method = 'GMN'
. This
option proves useful when incorporating a limited number of node-level
auxiliary variables.
You can opt for a standard graphical spike-and-slab model by
setting method = 'GM'
. Its inference through the ECM
algorithm is akin to that of EMGS,
except we enable learning the scales of continuous spike-and-slab
variances \(\tau\).
A graphical spike-and-slab model with normal priors on auxiliary variable coefficients (GMN) imposes normal prior, instead of spike-and-slab priors in GMSS, on regression coefficients,
\[ \begin{aligned} \beta_q \mid \sigma^2 &\sim \text{Normal}(0,\sigma^2) , \quad q=1,\ldots, Q,\\ \sigma^{-2} & \sim \text{Gamma}(a_\sigma, b_\sigma), \end{aligned} \]
Parameters in GMN can be estimated by our VBECM algorithm using the following command:
gmn_vbecm <- navigm(Y = net$Y, V = V, method = 'GMN', numCore = 2)
#> == Parallel exploration of a grid of spike standard deviations
#> v0 = 1e-04 0.06676 0.13342 0.20008 0.26674 0.3334 0.40006 0.46672 0.53338 0.60004 0.6667 0.73336 0.80002 0.86668 0.93334 1
#> on 2 cores ...
#>
#> ... done. ==
#>
#> == Select from a grid of spike variance using AIC ...
#>
#> ... done. ==
#>
#> Select the index 7 i.e., v0 = 0.40006 , the best AIC = 9147.011 .
#>
#> Total runtime: 15.20818 secs .
Please be aware that GMSS is more appropriate for this specific data example, as a majority of the auxiliary variables do not significantly impact the graph structure.
A graphical spike-and-slab model without auxiliary variables (GM) eliminates the dependence of edge inclusion probability on auxiliary variables, i.e., \[\delta_{ij} \mid \rho \sim \text{Bernoulli}(\rho). \] Two prior choices are available on \(\rho\),
version = 1
: \[\rho \sim
\text{Beta}(a_\rho, b_\rho), \] i.e., the same as EMGS.
version = 2
: \[\Phi(\rho)
\sim \text{Normal}(n_0, t_0^2), \] which is more comparable with
our extensions.
The following estimates parameters in GM with a beta prior on edge inclusion by VBECM:
gm_vbecm <- navigm(Y = net$Y, method = 'GM', version = 1, numCore = 2)
#> == Parallel exploration of a grid of spike standard deviations
#> v0 = 1e-04 0.06676 0.13342 0.20008 0.26674 0.3334 0.40006 0.46672 0.53338 0.60004 0.6667 0.73336 0.80002 0.86668 0.93334 1
#> on 2 cores ...
#>
#> ... done. ==
#>
#> == Select from a grid of spike variance using AIC ...
#>
#> ... done. ==
#>
#> Select the index 7 i.e., v0 = 0.40006 , the best AIC = 9146.949 .
#>
#> Total runtime: 2.250996 secs .
Moreover, we have implemented an ECM algorithm as in the literature. However, this algorithm has a longer runtime and yields less accurate estimations, making it not recommended in practice.