Open Chemical Information and Research

Tuesday, April 7, 2015

Link Prediction Using Network Regression Model

This time on link prediction, rather than using a regular Random walk o predict links i will be using a new approach to predict links like regression . We know that the main approach of predicting links uses measures of network proximity adapted from graph theory to determine which unconnected unconnected nodes are “close together” in the topology of the network.We can model the link prediction problem as a supervised classification task, where each data point corresponds to a pair of vertices in the social network graph.We assume u, v ∈ V are two vertices in the graph G(V, E) and the label of the data point (u,v) is $\tiny y^{u,v}$ . Also we assume that the interactions between u and v are symmetric, so the pair (u,v) and (v,u) represent the same data point,

$y^{u,v} = \begin{cases} & 1 , { if } (u,v) \in E \\ & 0 , { if } (v,u) \notin E \end{cases}$

Using the above labeling for a set of training data points, we build a classification model that can predict the unknown labels of a pair of vertices (u,v) . This is a typical binary classification task and any of the popular supervised classification tools, such as naive Bayes, neural networks and support vector machines (SVM) can be used. But, the major challenge in this approach is to choose a set of features for the classification task.

In selecting feature sets for link prediction, each data point corresponds (u,v) to a pair of vertices with the label denoting their link status, so the chosen features should represent some form of proximity between the pair of vertices . Features like the attributes of vertices and edges can be very good features for many application domains along with some graph topology features. Some of the feature include like described below :

Common Neighbors : For two nodes, x and y, the size of their common neighbors is defined as $\tiny \left | \Gamma (x) \cap \Gamma (y) \right |$ The idea of using the size of common neighbors is just an attestation to the network transitivity property.
Jaccard Coefficient : The common neighbors metric is not normalized, so one can use the Jaccard Coefficient, which normalizes the size of common neighbors as below: $\small Jaccard Coefficient = \frac{ \left | \Gamma (x) \cap \Gamma (y) \right |}{ \left | \Gamma (x) \cup \Gamma (y) \right |}$
Adamic/Adar : Adamic and Adar proposed a score as a metric of similarity between two web pages. For a set of features z, it is defined as

$\large \sum_{z: feature shared by u and v} \frac{1}{log(frequency(z))}$

However there are other types of paths like Shortest Path Distance, Katz , Hitting time, Rooted Page Rank , Preferential attachment and more. I am writing a R package to compute all of these features and apart from that includes algorithms for link prediction in bipartite networks.

However the challenge of this type of modeling is extreme class skewness. The number of possible links is quadratic in the number of vertices in a social network, however the number of actual links (the edges in the graph) added to the graph is only a tiny fraction of this number. Large class skewness, causing training and inference to become difficult tasks. We can think of poor performance in the learning algorithm due to class imbalances. To tackle class skewness, existing research suggests different approaches, such as altering the training sample by up-sampling or down-sampling or cost-sensitive learning and also more generally by treating the classifier score with different thresholds.

Another important challenge in link prediction is model calibration (Model calibration is the process to find the function that transforms the output score value of the model to a label)which is more crucial than finding the right algorithm to build the classification model. In this post i am neither going to discuss about all this but to give you the code to do link prediction . Other challenges for model performance is up to the reader , one can modify the code and work on model calibration and and skewness issues.

The code below uses Drug Target binary adjacency matrix to compute the jaccard coefficient , Dice and Adamic adar score and use chemical similarity to predict chemical - chemical links. From the adjacency matrix I projected it into chemical-chemical binary matrix where two drugs are connected if they share a single protein among themselves. I used the chemical chemical binary pairs as true labels From the original graph i sampled some nodes and remove some nodes from the graph in order to predict the those links from the original graph. I didn't do much on statistics part like ROC curve AUC may be on my next post. Let me know if you have any questions in understanding the R code. For the data i used Enzyme.rda data

Wednesday, February 25, 2015

Link Prediction using Network based Inference - A quick matrix based implementation

I explored a paper proposed by Zhou etal used Network based Inference(NBI) method to predict missing links in bipartite network and was thinking a lot how to implement using some simple matrix ways. I have taken the pic below from Zhou paper above to explain the idea .Given the bipartite graph , a two phase resource transfer Information from X(x,y,z) set of nodes gets distributed to Y set of nodes and then again goes back to resource X . This process allows us to define a technique for the calculation of the weight matrix W. In 2010 a modified version of this approach is proposed in Solving the apparent diversity-accuracy dilemma of recommender systems which used a modified Hybrid algorithm in which the functions defined in NBI and HeatS are combined in connection with a parameter called λ.

In this post i am going to implement the algorithm how does this work using simple matrix method in R. Interested readers must see those publications for the mathematical equations explained. Before going a bit further , if we are given a weight matrix W( which is calculated using the algorithms above) and the adjacency matrix A of the bipartite network, it is possible to compute the recommendation matrix R using the equation below, where W is n x n matrix and A is n x m matrix .

R = W.A (1)

The R list is then sorted in a descending order with respect to the score.

We use this kind of calculations in chemo-genomics predictions and also other bipartite type data. When doing Drug target prediction we can use W is as the sequence similarity matrix and A as the Drug target adjacency matrix to obtain recommendation of targets based on sequence similarity . Similarity W can be a compound similarity matrix and A the bipartite compound target matrix. Now we can use equation (1) above to get recommendations of compounds given a sequence of interest. This trick of using matrix just blowed my mind off !! Isn't it cool ?

Now for the functions here it goes below. If you are using the codes do let me know the results how does it work. My next post would be integrating similarity matrices information along with the degree information into W.

Tuesday, February 10, 2015

KEGG Data Errors. I am pissed off !!

I am onto my Phd thesis working day and night but after running my calculations I found the results were not as much promising as I was expecting. In order to build a predictive model you got to have a train set and a test set , so as I made. I was checking some pathway and disease associations from my model. I used a dataset from CPDB which is a collection of all pathways from different databases and a nice resource to do enrichment studies and network analysis.

As my results were not much promising I went to check the dataset whether they are ok or not. I am predicting association of an OMIM disease Glaucoma with the pathways . In the test set I had various Reactome pathways but not the pathway from KEGG hsa03008, this is unexpected. I went into the pathway page and saw that they mention about the disease Glaucoma . Now I was specifically interested on the OPTN gene because its one of the primary genes for the disease. If you go to the KEGG disease page of Glaucoma you can see the OPTN gene name exists. Moving onto the KEGG page for OPTN I didn't found any pathway associations mentioned. I went onto the pathway page hsa03008 where I didn't notice the name of the gene mentioned. Also other essential genes which are mentioned on the disease page MYOC, CYP1B1, NTF4 were not linked to the pathways except for WDR36.
People analyzing on these unaccounted data is missing a lot of information and even Data analyst's interpreting it wrong.

KEGG guys needs to map the data in right way and provide the right information . This was just an small example there are others also for which i am pissed off a lot !!

Sunday, January 11, 2015

Tuberculosis and the Global Clinical Study Map using Clinical Trials data from Web .

People around the world through blogs , newsletters and other forms of media showing how deadly diseases like TB are spreading and killing people. Between last 2-3 years lot of people have come up with ideas like open source development in drugs , lot of consortium have been formed to move with the research. With a bit of data mining on the clinical trial data we can easily see whats going on with the drug discovery arena . Using some r libraries we can easily access the clinical trial data from clinicaltrail.gov website and parse the xml files . I searched for "tuberculosis" with the start date and 1/1/2013 and end date 12/31/2014 and the trials with a "closed" status including all the age groups. After looking at this map I would ask the Indian Health ministry, Is India waiting for the west to develop its drugs or it does want the people to die of this lethal disease ? Only 3 clinical trails are being done but none of them are Phase I-IV trails for new drugs. Is the scientific community just abandon their hands on this ? In USA two medical centers(Lincoln and cleveland) already seem to be working on the new drugs, one is bedaquiline (NCT02216331) and other SQ109(NCT01874314). I didn't look on European centers and Africa. The code for this available @ git . I used rMaps library with Leaflet.js library. One can play with any disease and with different start and end date and with trial status. The Map below (2013-14) is iframe embedded on html which is done using the command using rMaps package -

map$show('iframesrc',cdn=TRUE)

I found some issues with with the geocode function from the ggmap package takes a lot of time to access the google api for latitude and longitude . If anyone has a faster way to get those let me know.

The first maps shows the clinical trials from 2009-2014 which designates a lot of Trials being done on TB in India . But from 2013-14 no further interest was made on TB .

Map from 2013 -14