This time on link prediction, rather than using a regular Random walk o predict links i will be using a new approach to predict links like regression . We know that the main approach of predicting links uses measures of network proximity adapted from graph theory to determine which unconnected unconnected nodes are “close together” in the topology of the network.We can model the link prediction problem as a supervised classification task, where each data point corresponds to a pair of vertices in the social network graph.We assume u, v ∈ V are two vertices in the graph G(V, E) and the label of the data point (u,v) is . Also we assume that the interactions between u and v are symmetric, so the pair (u,v) and (v,u) represent the same data point,
Using the above labeling for a set of training data points, we build a classification model that can predict the unknown labels of a pair of vertices (u,v) . This is a typical binary classification task and any of the popular supervised classification tools, such as naive Bayes, neural networks and support vector machines (SVM) can be used. But, the major challenge in this approach is to choose a set of features for the classification task.
In selecting feature sets for link prediction, each data point corresponds (u,v) to a pair of vertices with the label denoting their link status, so the chosen features should represent some form of proximity between the pair of vertices . Features like the attributes of vertices and edges can be very good features for many application domains along with some graph topology features. Some of the feature include like described below :
- Common Neighbors : For two nodes, x and y, the size of their common neighbors is defined as The idea of using the size of common neighbors is just an attestation to the network transitivity property.
- Jaccard Coefficient : The common neighbors metric is not normalized, so one can use the Jaccard Coefficient, which normalizes the size of common neighbors as below:
- Adamic/Adar : Adamic and Adar proposed a score as a metric of similarity between two web pages. For a set of features z, it is defined as
However there are other types of paths like Shortest Path Distance, Katz , Hitting time, Rooted Page Rank , Preferential attachment and more. I am writing a R package to compute all of these features and apart from that includes algorithms for link prediction in bipartite networks.
However the challenge of this type of modeling is extreme class skewness. The number of possible links is quadratic in the number of vertices in a social network, however the number of actual links (the edges in the graph) added to the graph is only a tiny fraction of this number. Large class skewness, causing training and inference to become difficult tasks. We can think of poor performance in the learning algorithm due to class imbalances. To tackle class skewness, existing research suggests different approaches, such as altering the training sample by up-sampling or down-sampling or cost-sensitive learning and also more generally by treating the classifier score with different thresholds.
Another important challenge in link prediction is model calibration (Model calibration is the process to find the function that transforms the output score value of the model to a label)which is more crucial than finding the right algorithm to build the classification model. In this post i am neither going to discuss about all this but to give you the code to do link prediction . Other challenges for model performance is up to the reader , one can modify the code and work on model calibration and and skewness issues.
The code below uses Drug Target binary adjacency matrix to compute the jaccard coefficient , Dice and Adamic adar score and use chemical similarity to predict chemical - chemical links. From the adjacency matrix I projected it into chemical-chemical binary matrix where two drugs are connected if they share a single protein among themselves. I used the chemical chemical binary pairs as true labels From the original graph i sampled some nodes and remove some nodes from the graph in order to predict the those links from the original graph. I didn't do much on statistics part like ROC curve AUC may be on my next post. Let me know if you have any questions in understanding the R code. For the data i used Enzyme.rda data
However the challenge of this type of modeling is extreme class skewness. The number of possible links is quadratic in the number of vertices in a social network, however the number of actual links (the edges in the graph) added to the graph is only a tiny fraction of this number. Large class skewness, causing training and inference to become difficult tasks. We can think of poor performance in the learning algorithm due to class imbalances. To tackle class skewness, existing research suggests different approaches, such as altering the training sample by up-sampling or down-sampling or cost-sensitive learning and also more generally by treating the classifier score with different thresholds.
Another important challenge in link prediction is model calibration (Model calibration is the process to find the function that transforms the output score value of the model to a label)which is more crucial than finding the right algorithm to build the classification model. In this post i am neither going to discuss about all this but to give you the code to do link prediction . Other challenges for model performance is up to the reader , one can modify the code and work on model calibration and and skewness issues.
The code below uses Drug Target binary adjacency matrix to compute the jaccard coefficient , Dice and Adamic adar score and use chemical similarity to predict chemical - chemical links. From the adjacency matrix I projected it into chemical-chemical binary matrix where two drugs are connected if they share a single protein among themselves. I used the chemical chemical binary pairs as true labels From the original graph i sampled some nodes and remove some nodes from the graph in order to predict the those links from the original graph. I didn't do much on statistics part like ROC curve AUC may be on my next post. Let me know if you have any questions in understanding the R code. For the data i used Enzyme.rda data