## The D2C algorithm

The D2C (Dependence to Causality) algorithm aims to infer the existence of causal dependencies from static (i.e. non temporal) observational and multivariate data.

The D2C algorithm predicts the existence of a direct causal link between two variables in a multivariate setting by (i) creating a set of of features of the relationship based on asymmetric descriptors of the multivariate dependency and (ii) using a classifier (in this case a Random Forest) to learn a mapping between the features and the presence of a causal link. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected.

This vignette shows how to create a training set for D2C, how to create a D2C object and how to assess it with a number of simulated datasets. A comparison with two inference algorithms implemented in the bnearn package is provided too.

## Creation of training data

Let us start by creating a set of Directed Acyclic Graphs (DAGs) and the associated simulated datasets.

The object simulatedDAG can be initialized by setting the number NDAG of DAGs, the range noNodes of the number of nodes, the range N of observed samples, the type functionType of dependency and the range sdn of standard deviation of the additive noise.

rm(list=ls())
library(D2C)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
require(RBGL)
## Loading required package: RBGL
## Loading required package: graph
require(gRbase)
## Loading required package: gRbase
noNodes<-c(10,20)
## range of number of nodes

N<-c(50,200)
## range of number of samples

sd.noise<-c(0.2,1)
## range of values for standard deviation of additive noise

NDAG=100
## number of DAGs to be created and simulated

trainDAG<-new("simulatedDAG",NDAG=NDAG, N=N, noNodes=noNodes,
seed=0,sdn=sd.noise,quantize=c(TRUE,FALSE),verbose=FALSE)

We can now check the number of DAGs

print(trainDAG@NDAG)
## [1] 100

If it happens that the number is smaller than NDAG, this is due to the fact that DAGs with no edges are removed.

Let us visualize also the first simulated DAG and the dimension of the related datasets:

print(trainDAG@list.DAGs[[1]])
## A graphNEL graph with directed edges
## Number of Nodes = 16
## Number of Edges = 27
print(dim(trainDAG@list.observationsDAGs[[1]]))
## [1] 90 16

## Learning of a D2C object

Once we have a training set of DAGs, we can create a D2C object by storing for (a subset of) the edges of each DAG a vector of descriptors and the related binary label (0 or 1) about the existance of a causal link.

The parameters of the D2C descriptors are contained in a D2C.descriptor object

descr.example<-new("D2C.descriptor",bivariate=FALSE,ns=3,acc=TRUE,lin=TRUE)

trainD2C<-new("D2C",sDAG=trainDAG,
descr=descr.example,ratioEdges=1,max.features=30,verbose=FALSE)

Let us print some properties of the trainD2C object:

print(dim(trainD2C@X))
## [1] 3136   47
print(table(trainD2C@Y))
##
##    0    1
## 1434 1702

Let us print out the Random forest object stored within trainD2C

print(trainD2C@mod)
##
## Call:
##  randomForest(x = X[, rank], y = factor(Y))
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
##
##         OOB estimate of  error rate: 24.43%
## Confusion matrix:
##      0    1 class.error
## 0 1131  303      0.2113
## 1  463 1239      0.2720

## Creation of test data

Let us now create an independent test set (e.g. by creating
a second simulatedDAG object with a different seed of the random generator) and let us store it in the testDAG object:

NDAG.test=50
noNodes<-c(10,20)
## range of number of nodes

N<-c(50,100)

testDAG<-new("simulatedDAG",NDAG=NDAG.test, N=N, noNodes=noNodes,
functionType = c("linear","quadratic","sigmoid"), quantize=c(TRUE,FALSE),
seed=101,sdn=c(0.2,0.5),verbose=FALSE)

## Assessment of accuracy

In this section we assess the accuracy of D2C and we compare it with two algorithms implemented in the package bnlearn. Accuracy is measured by the Balanced Error Rate (BER). For each dataset associated to the simulated DAG contained in testDAG, we infer the directionality of the edges and we compare it with the ground truth.

require(foreach)
if (!require(bnlearn)){
install.packages("bnlearn", repos="http://cran.rstudio.com/")
library(bnlearn)
}
FF<-foreach (r=1:testDAG@NDAG) %do%{
set.seed(r)
observedData<-testDAG@list.observationsDAGs[[r]]
trueDAG<-testDAG@list.DAGs[[r]]

## inference of networks with bnlearn package
Ahat.GS<-amat(gs(data.frame(observedData),alpha=0.01))
Ahat.IAMB<-(amat(iamb(data.frame(observedData),alpha=0.01)))

## selection of a balanced subset of edges for the assessment
Nodes=nodes(trueDAG)
max.edges<-min(30,length(edgeList(trueDAG)))
subset.edges = matrix(unlist(sample(edgeList(trueDAG),
size = max.edges,replace = F)),ncol=2,byrow = TRUE)
subset.edges = rbind(subset.edges,t(replicate(n =max.edges,
sample(Nodes,size=2,replace = FALSE))))

Yhat.D2C<-NULL
Yhat.IAMB<-NULL
Yhat.GS<-NULL
Ytrue<-NULL
for(jj in 1:NROW(subset.edges)){
i =as(subset.edges[jj,1],"numeric");
j =as(subset.edges[jj,2],"numeric") ;
pred.D2C = predict(trainD2C,i,j, observedData)

Yhat.D2C<-c(Yhat.D2C,as.numeric(pred.D2C$response) -1) Yhat.IAMB<-c(Yhat.IAMB,Ahat.IAMB[i,j]) Yhat.GS<-c(Yhat.GS,Ahat.GS[i,j]) Ytrue<-c(Ytrue,graphTRUE[subset.edges[jj,1],subset.edges[jj,2]]) } list(Yhat.D2C=Yhat.D2C,Yhat.GS=Yhat.GS, Yhat.IAMB=Yhat.IAMB,Ytrue=Ytrue) } Yhat.D2C<-unlist(lapply(FF,"[[",1) ) Yhat.GS<-unlist(lapply(FF,"[[",2)) Yhat.IAMB<-unlist(lapply(FF,"[[",3)) Ytrue<-unlist(lapply(FF,"[[",4)) ## computation of Balanced Error Rate BER.D2C<-BER(Ytrue,Yhat.D2C) BER.GS<-BER(Ytrue,Yhat.GS) The average accuracy of the three algorithms (the smaller the BER the better) is shown by typing: cat("\n BER.D2C=",BER.D2C, "BER.IAMB=",BER.IAMB,"BER.GS=",BER.GS,"\n") We invite the user to assess the impact of the number of training samples (e.g. by increasing the value NDAG and retraining D2C) on the accuracy of the network construction. ## Assessment of accuracy with the alarm dataset Let us upload the alarm dataset and let us focus on a highly connected subnetwork of 100 nodes data(alarm) graphTRUE<-true.net set.seed(0) observedData<-dataset w.const<-which(apply(observedData,2,sd)<0.1) if (length(w.const)>0){ observedData<-observedData[,-w.const] graphTRUE<-graphTRUE[-w.const,-w.const] } indn<-sort(apply(graphTRUE,1,sum)+apply(graphTRUE,2,sum),decr=TRUE,index=TRUE)$ix[1:100]
observedData<-observedData[,indn]
graphTRUE<-graphTRUE[indn,indn]

Let us first make the inference with two bnlearn algorithms,

n<-NCOL(observedData)

Ahat.GS<-amat(gs(data.frame(observedData)))
Ahat.IAMB<-amat(iamb(data.frame(observedData),alpha=0.05))

and then with D2C…

Yhat.D2C<-NULL
Yhat.GS<-NULL
Yhat.IAMB<-NULL
Ytrue<-NULL

for (i in 1:n){
## creation of a balanced test set
ind1<-which(graphTRUE[i,]==1)
ind0<-setdiff(setdiff(1:n,i),ind1)
ind<-c(ind1,ind0[1:length(ind1)])

FF<-foreach(j=ind) %do%{
list(Yhat=as.numeric(predict(trainD2C,i,j,observedData)\$response)  -1,
Yhat2=Ahat.GS[i,j],Yhat3=Ahat.IAMB[i,j],Ytrue=graphTRUE[i,j])
}
Yhat.D2C<-c(Yhat.D2C,unlist(lapply(FF,"[[",1) ))
Yhat.GS<-c(Yhat.GS,unlist(lapply(FF,"[[",2)))
Yhat.IAMB<-c(Yhat.IAMB,unlist(lapply(FF,"[[",3)))
Ytrue<-c(Ytrue,unlist(lapply(FF,"[[",4)))
}

You can now print out the balanced error rates (BER) of the three techniques by using:

cat("\n BER.D2C=",BER(Ytrue,Yhat.D2C), "BER.GS=",BER(Ytrue,Yhat.GS),
"BER.IAMB=",BER(Ytrue,Yhat.IAMB),
"\n")