Tuesday, October 22, 2013

Fast Tanimoto Similarity Calculation using rcdk

Well many of you ,who are using r for cheminformatics must be knowing rcdk . Regarding the tanimoto calculation i have seen it seem it takes a long time to calculate the code in rcdk code looks neat but still the similarity calculation can be performed much faster using the inner products.Below given a simple code to do that and also the time taken is like 10 times faster than the rcdk code. Quite an impressive performance boost . I have made a pull request to Rajarshi's code, it should be available soon in the main package.

##Consider m is the binary matrix of 0 and 1 which you calculated using fp.sim.matrix###

Time taken for the new method
user  system elapsed 
  2.962   0.012   2.971 

#Normal method in rcdk

user  system elapsed 
 43.644   0.064  43.707 

Dynamic plots with R Studio.

Few days back i came to know that R Studio provides dynamic plots like you can plot histograms and move sliders, tick check boxes and also you can select from the drop down list the items you want to display from your dataset.I will provide some examples of this below. Quite cool enough from Rstudio group.

if(require(manipulate)) {
histogram( ~ eruptions, data=faithful, n=n),
n = slider(5,40)


if(require(manipulate)) {
histogram( ~ age, data=HELP, n=n, density=density),
n = slider(5,40),
density = checkbox()

Check box with density plot


if(require(manipulate)) {
histogram( ~ age, data=HELP, n=n, fit=distribution, dlwd=4),
n = slider(5,40),
distribution =
picker('normal', 'gamma', 'exponential', 'lognormal',

//dropdown and density

          beside = TRUE, main = factor,density=density),
  factor = picker("mpg", "disp", "hp","drat","wt"),density = checkbox())

This way you can make your ggplots dynamic with the manipulate package.

Monday, July 22, 2013

In January, I started  doing an interesting project using random walks to predict drug -target. I found many papers got recently published in this domain one is by Chen and another one from Xing Chen . Looks interesting work and there are several papers related to this topic you can just type it in google .

Now the point I am trying to indicate is that the molecular descriptors which they used is kind of ok or not. I made  descriptors based study too in which some pharmacophore descriptors gave me very good results.
The validation is still an important question. Ok, if you have got some model you need to test
your data. Looks like in these papers they didn't show  cross target class validation much which made me to do research on the method. Well i am still onto it.But today i will be posting some of the interesting results i got while working with the random walk with restart algorithm or you may call personalized page rank /shortest paths etc.

A random walk is a finite Markov chain that is time-reversible  In fact, there is not much difference between the theory of random walks on graphs and the theory of finite Markov chains; every Markov chain can be viewed as random walk on a directed graph, if we allow weighted edges. Similarly, time-reversible Markov chains can be viewed as random walks on undirected graphs, and symmetric Markov chains, as random walks on regular symmetric graphs.

A random walk on Graph starts at a node x and iteratively moves to a neighbor of x chosen uniformly at random from the set (x). The hitting time H(x,y) from x to y is the expected number of steps required for a random walk starting at x to reach y. Because the hitting time is not in general symmetric, it is also natural to consider the commute time C(x,y) := H(x,y) + H(y,x). Both of these measures serve as natural proximity measures and hence (negated) can be used as score(x, y).

Now,  some results

Figure  shows the statins network and the top scoring 10 predicted genes are listed for the compounds along with the true links. The true links are coloured in blue solid lines and red dashed lines are the predicted network.It has reported that both lovastatin and simvastatin having the side effect of alopecia and hair loss along with post marketing side effects shows variety of skin problem related to these drugs such as nodules, discoloration, dryness of skin/mucous membranes. 

We have found an association of simvastatin and lovastatin with KRA53 and KRA52 genes. KRA53(Keratin associated protein 5 type 3) is an essential gene for the formation of a rigid and resistant hair shaft through their extensive disulfide bond cross-linking with abundant cysteine residues of hair keratins. The matrix proteins include the high-sulfur and high-glycine-tyrosine keratins.  The majority of keratinizing disorders affect the epidermis and/or its adnexal structures such as hair and nail, or sweat and sebaceous glands, although a number of these diseases affect other epithelia such as mucosal or corneal epithelia. We hypothesize here the side effect of hairloss of lovastatin and simvastatin might be associated with KRA53 or KRA52.

The method also shows some very good results listed in the table below. Overall combination of sequence and descriptor similarity performs good. But still is it good enough to predict cross target class prediction. Well we have to see and validate more .

Dopamine D1 Receptor, Dopamine D2 Receptor
Sodiumchannel protein type III alpha subunit,
adenosine A2A receptor, adenosine A2B receptor
adrenoceptor alpha 1B
PhospholipaseA2,PPARG, phosphoglycerate kinase 1
Alkaline Phosphatase,PPARD,Multi drug resistance protein
Liver carboxylesterase 1, Acetylcholine receptor subunit alpha
Synaptosomal-associated protein 25,
Beta-2 adrenergic receptor