Wednesday, February 25, 2015

Link Prediction using Network based Inference - A quick matrix based implementation

I explored a paper proposed by Zhou etal used Network based Inference(NBI) method to predict missing links in bipartite network and was thinking a lot how to implement using some simple matrix ways. I have taken the pic below from Zhou paper above  to explain the idea .Given the bipartite graph , a two phase resource transfer Information from  X(x,y,z) set of nodes gets distributed to Y set of nodes and then again goes back to resource X .  This process allows us to define a technique for the calculation of the weight matrix W.  In 2010 a modified version of this approach is proposed in Solving the apparent diversity-accuracy dilemma of recommender systems which used a modified Hybrid algorithm in which the functions defined in NBI and HeatS are combined in connection with a parameter called λ.

In this post i am going to implement the algorithm how does this work using simple matrix method in R. Interested readers must see those publications for the mathematical equations explained. Before going a bit further , if we are given a weight matrix W( which is calculated using the algorithms above) and the adjacency matrix A of the bipartite network, it is possible to compute the recommendation matrix R using the equation below, where W is n x n matrix and A is n x m matrix .

                                                                               R = W.A      (1)

The R list is then sorted in a descending order with respect to the score.

We use this kind of calculations in chemo-genomics predictions and also other bipartite type data. When doing Drug target prediction we can use W is as the sequence similarity matrix and A as the Drug target adjacency matrix to obtain recommendation of targets based on sequence similarity . Similarity W can be a compound similarity matrix and A the bipartite compound target matrix. Now we can use equation (1) above to get recommendations of compounds given a sequence of interest. This trick of using matrix just blowed my mind off !! Isn't it cool ?

Now for the functions here it goes below. If you are using the codes do let me know the results how does it work. My next post would be integrating similarity matrices information along with the degree information into W.

Tuesday, February 10, 2015

KEGG Data Errors. I am pissed off !!

I  am onto my Phd thesis working day and night but after running my calculations I found the results were not as much promising as I was expecting.  In order to build a predictive model you got to have a train set and a test set , so as I made. I was checking some pathway and disease associations from my model. I used a dataset from CPDB  which is a collection of all pathways from different databases and a nice resource to do enrichment studies and network analysis.

As my results were not much promising I went to check the dataset whether they are ok or not. I am predicting association of an OMIM disease Glaucoma with the pathways . In the test set I had various Reactome pathways but not the pathway from KEGG hsa03008, this is unexpected.  I went into the pathway page and saw that they mention about the disease Glaucoma . Now I was specifically interested on the OPTN gene because its one of the primary genes for the disease. If you go to the KEGG disease page of Glaucoma you can see the OPTN gene name exists. Moving onto the KEGG page for OPTN I didn't found any pathway associations mentioned. I went onto the pathway page hsa03008 where I didn't notice the name of the gene mentioned.  Also other essential genes which are mentioned on the disease page MYOC, CYP1B1, NTF4 were not linked to the pathways except for WDR36.
People analyzing on these unaccounted data is missing a lot of information and even Data analyst's interpreting it wrong.

KEGG guys needs to map the data in right way and provide the right information . This was just an small example there are others also for which i am pissed off a lot !!