After a long time i am writing a blog . Though the material looks quite interesting to me for study the results indicate some essential fingerprints are related to side effect profile as well the compounds therapeutic area and also in target similarity between two compounds.
So lets see what the study tells....
I mapped around 746 compounds from drugbank and SIDER by CID and then by name of the compounds and manually checked and made a final dataset for my study. The recent SIDER provides 4500 side effects profiles.The side effect profiles are used to create a Compound - side effect binary matrix .
The side effects are used to calculate the compound similarity using matrix multiplication and normalization the algorithm i used is discussed in paper Metapath . It provides a fast and efficient way to calculate dice coefficient which i used to calculate the similarity more at Rajarshi's Guha Blog .
For the Final data of 746 compounds i removed metals and compounds molecular weight greater than 1000 and ultimately came down to 728 compounds.
I also created a compound- target matrix and compound - atc code matrix ( data is collected from drugbank) from which I calculated the similarity of compounds by side effect, ATC codes and protein using using the metapath algorithm respectively.
Earlier research from Yamanishi's work on predicting side effects from substructure profile and drug target prediction using learning models were very good papers . One of the important paper was from Campillos of drug target identification using side effect similarity which focused on similar side effect profile which is kind of similar to the off target profile. Another paper (SLAP) from our lab at Indiana University using semantics to identify target was a good improvement of integrating multiple heterogeneous network and predicting targets.
For my work i used pubchem fingerprints, maccs keys 166 , ECFP 4, ECFP 6 FCFP 4, FCFP 6 to find how much is the relation between the these fingerprints with the side effects and ATC codes and proteins.
I used Tanimoto similarity for the fingerprint based similarity of compounds and the pathsim similarity with side effect,ATC and protein of compounds. After performing a simple correlation study I found pubchem fingerprints are highly correlated with side effect though the correlation is about 0.16 but it was statically significant p-value < 2.2e-16 so as for maccs keys and extended connectivity fingerprints. Below i provide correlation matrix of the different fingerprints,side effects,ATC Code, protein similarity.
Below is the matrix shown
So lets see what the study tells....
I mapped around 746 compounds from drugbank and SIDER by CID and then by name of the compounds and manually checked and made a final dataset for my study. The recent SIDER provides 4500 side effects profiles.The side effect profiles are used to create a Compound - side effect binary matrix .
The side effects are used to calculate the compound similarity using matrix multiplication and normalization the algorithm i used is discussed in paper Metapath . It provides a fast and efficient way to calculate dice coefficient which i used to calculate the similarity more at Rajarshi's Guha Blog .
For the Final data of 746 compounds i removed metals and compounds molecular weight greater than 1000 and ultimately came down to 728 compounds.
I also created a compound- target matrix and compound - atc code matrix ( data is collected from drugbank) from which I calculated the similarity of compounds by side effect, ATC codes and protein using using the metapath algorithm respectively.
Earlier research from Yamanishi's work on predicting side effects from substructure profile and drug target prediction using learning models were very good papers . One of the important paper was from Campillos of drug target identification using side effect similarity which focused on similar side effect profile which is kind of similar to the off target profile. Another paper (SLAP) from our lab at Indiana University using semantics to identify target was a good improvement of integrating multiple heterogeneous network and predicting targets.
For my work i used pubchem fingerprints, maccs keys 166 , ECFP 4, ECFP 6 FCFP 4, FCFP 6 to find how much is the relation between the these fingerprints with the side effects and ATC codes and proteins.
I used Tanimoto similarity for the fingerprint based similarity of compounds and the pathsim similarity with side effect,ATC and protein of compounds. After performing a simple correlation study I found pubchem fingerprints are highly correlated with side effect though the correlation is about 0.16 but it was statically significant p-value < 2.2e-16 so as for maccs keys and extended connectivity fingerprints. Below i provide correlation matrix of the different fingerprints,side effects,ATC Code, protein similarity.
|  | 
| Pubchem Fingerprint similarity Distribution | 
|  | 
| side effect distribution | 
| sideffect | pubchem | maccs | ECFP4 | ECFP6 | FCFP4 | FCFP6 | atc | protein | |
| sideffect | 1 | 0.160535 | 0.156551 | 0.142548 | 0.140111 | 0.125838 | 0.12807 | 0.055374 | 0.097784 | 
| pubchem | 0.160535 | 1 | 0.585017 | 0.604037 | 0.566609 | 0.617526 | 0.576539 | 0.018368 | 0.132186 | 
| maccs | 0.156551 | 0.585017 | 1 | 0.523484 | 0.491311 | 0.477387 | 0.434868 | 0.014732 | 0.169165 | 
| ECFP4 | 0.142548 | 0.604037 | 0.523484 | 1 | 0.990811 | 0.757584 | 0.777071 | 0.096187 | 0.289458 | 
| ECFP6 | 0.140111 | 0.566609 | 0.491311 | 0.990811 | 1 | 0.747782 | 0.787673 | 0.102716 | 0.295364 | 
| FCFP4 | 0.125838 | 0.617526 | 0.477387 | 0.757584 | 0.747782 | 1 | 0.981399 | 0.094613 | 0.263978 | 
| FCFP6 | 0.12807 | 0.576539 | 0.434868 | 0.777071 | 0.787673 | 0.981399 | 1 | 0.106107 | 0.283874 | 
| atc | 0.055374 | 0.018368 | 0.014732 | 0.096187 | 0.102716 | 0.094613 | 0.106107 | 1 | 0.166476 | 
| prot | 0.097784 | 0.132186 | 0.169165 | 0.289458 | 0.295364 | 0.263978 | 0.283874 | 0.166476 | 1 | 
It is found that ECFP and FCFP shows very good correlation with Protein and atc code similairty fingerprint . But on the other hand pubchem fingeprint shows a best correlation with the side effect .
A question is arised here if the correlation is such low how much does Yamanishi's paper is able to predict the true relations? Can different substrctures methods able to correlate with the side effects ?
Well i am still thinking what's behind the side effects and the impact of right substructures on side effects.
 
