Saturday, December 27, 2014

FDA Adverse effect mining part 2

In the last 1st part of the blog we have shown how to use FDA data to visualize adverse events for a particular drug and how to use brandnames and ATC group class and visualize it using rCharts pyramid plots.  This section of the blog discusses about how to calculate the some useful statistics of the data and compare the side effects of two same class or off class drugs.  In the last section we didn't consider the outcome tables which indicates patient outcomes like hospitalization, death etc. The codes in the outcomes table are given as two letter codes and information about it can be found on the documentation of the data.  The codes are given in git. For this part we will use run.R , riskratio.R and function.R codes .
function.R has functions to get all the brandnames of a query drug and collects all the adverse events for particular drug. It also has get.Vis() function to plot pyramid plots by two letter outcome. If outcomes are not specified then it used all the data and plots it. 
run.R is used to execute those functions with the data which is given below.  The plots are also shown below.

I am interested in comparing two standard techniques, namely Reporting odds ratio (ROR) and proportional reporting ratio (PRR) on two different datasets. Both methods use the values according to a 2x2  contingency table,

Drug of Interest Other drug
Event of Interest a b
other event c d

Reporting odds ratio equation is given below. The value (a/b) is the number of patients who had the event of interest and have taken the drug of interest (a) divided by the number of patients who had the event when taking any other drug (b). The value (c/d) is the number of patients who have taken the drug of interest but did not have the event of interest (c), divided by the number of people who had any other events, given that they took any other drug (d).


 The PRR on the other hand can be calculated as follows: 


 where a/(a + c)  can be thought of as the probability of an event of interest occurring, given the drug of interest was taken and an event occurred whereas , b/( b+ d) can be thought of as the probability that the event of interest occurred, given any other drug was taken and an event occurred.
# Comparing the Risk adverse effect for two types of drugs
# Get the reactions of Crestor which matches Lipitor
d<-d2se[which(d1se$reaction %in% d2se$reaction ),]

# Aggregat all the data without considering the outcomes
crestor<-aggregate(count ~ reaction ,data=n, FUN=sum)
#head(dt[order(dt$count, decreasing = T), ])

# Get the reactions of Lipitor which matches Crestor from the above dataframe
t<-d1se[which(d$reaction %in% d1se$reaction),]
lipitor<-aggregate(count ~ reaction ,data=t, FUN=sum)

# compare crestor against Lipitor 

reaction ror
158 Renal failure 12.50
21 Blood cholesterol increased 7.32
25 Blood triglycerides increased 6.06
70 Fibromyalgia 5.78
75 Gastrooesophageal reflux disease 5.78
39 Chronic obstructive pulmonary disease 5.37
37 Chest discomfort 4.95
117 Migraine 4.95
147 Pancreatitis acute 4.95
170 Type 2 diabetes mellitus 4.54

reaction ror
158 Renal failure 12.50
21 Blood cholesterol increased 7.32
25 Blood triglycerides increased 6.06
70 Fibromyalgia 5.78
75 Gastrooesophageal reflux disease 5.78
39 Chronic obstructive pulmonary disease 5.37
37 Chest discomfort 4.95
117 Migraine 4.95
147 Pancreatitis acute 4.95
170 Type 2 diabetes mellitus 4.54

When crestor is compared to lipitor the prr and ror indicates that chances of Renal Failure , Increased Blood cholesterol, Fibromyaglia is more than lipitor, whereas when one takes lipitor there are more chances of abdominal discomfort, Rash, Muscle atrophy and so on.

Table below shows the prr of lipitor against crestor and its adverse reactions .

reaction prr
2 Abdominal discomfort 17.02
156 Rash 9.12
121 Muscle atrophy 6.68
63 Dyspepsia 6.38
19 Balance disorder 6.28
3 Abdominal pain 6.08
27 Bradycardia 4.86
80 Haematoma 4.86
114 Lung disorder 4.86
79 Gout 4.56

Thursday, December 18, 2014

Adverse Effect mining of FDA data Part 1

US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS, formerly AERS) is a database that contains information on adverse event and medication error reports submitted to the FDA. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. In this issue of my blog i have used 2012 quarter 4 data to visualize adverse reactions and also compare the adverse reactions for a group of drugs (group means by ATC code). There are some very important point to note the data is submitted from countries which mean a drug can have different names (brandnames) . So you can search as "Aspirin" on the database you need to search by all the alternate brandnames of Aspirin on the database. For that I have created a brandname based dataset of all the drugbank data which i will use for this example.

For this datasets used :
  • ATC drug group class data
  • Drug Brandnames data
  • 2012 FDA quarter 4 dataset.
R packages used
  • foreach
  • doParallel
  • rCharts(for visualization)
I used foreach with parallel because it takes a lot of time to query the data frames on a single processor making it parallel reduces a lot of timing. I am using rCharts NVD3 library for the interactive pyramid plots developed by Ramnath Vaidyanathan .  The codes for the work is avaialable at here. To simplify things I made a simple query "Aspirin" on the drug dataframe and used grepl with ignore case and drug_seq=1 ( which indicates primary medication) on lines 51-63(Initial.R script)  since aspirin is categorized  under 3 different classes such as Blood and blood forming organs ,Alimentary tract and metabolism and Nervous system , so i extracted all the drugs from each category and extracted the data from drug dataframe for each drugs and merged the results and visualized it as Interactive pyramid plots below.  I have some further ideas on this calculating some statistics based on class and also further categorizing using Mesh Class creating a Shiny app.

If you want the brandnames data with the ATC code classes leave a comment below with your email id or you can shoot an email.

Class Blood and blood forming organs ( Access the full charts at

Class Nervous System (

Class Alimentary tract and metabolism (

Friday, November 21, 2014

Drug Sales by Year googleVis motion charts giving data a new dimension.

I really like videos made by Hans Rosling on Gap minder on which he presents all the the data with the motion charts  Now using his idea and a small assignment which was given to the students of data science course at IU for which they need to analyze year wise sales data I made the data to talk using the googleVis package creating a similar type of motion charts and it just cool way to view longitudinal / Time series data. One can easily interact with the plots with mouse. The googleVis package provides an interface between R and the Google Charts API. It uses internal R HTTP server to display the output locally into a browser. If you want to create stories bring data to life i would recommend such tools for you. The graph below shows the yearly sales data (2003-2013) collected from by scraping the internet page. All codes are done in R. I am not releasing the R codes here but if you want them mail me with caption "drug sales code".

Friday, October 24, 2014

Lean Markov Chain Clustering in R

Markov Chain Clustering (MCL) is fast scalable supervised clustering algorithm based on information flow in graphs. The algorithm finds cluster in graphs by random walks. It uses two important operators one is the inflation and other the expansion. "Expansion takes the power of a stochastic matrix using the normal matrix product. Inflation takes the Hadamard power of a matrix (taking powers entrywise), followed by a scaling step, such that the resulting matrix is stochastic again, i.e. the matrix elements (on each column) correspond to probability values." More information can be found here and here . Below the R code describes how to perform it step by step. Also a nice explanation is presented here

Wednesday, October 1, 2014

Link Prediction using Bipartite Networks .

Missing link prediction of networks is  of practical significance in modern science like in Social Networks , Biological Networks and Food networks and lots others. Adamic-Adar  index refines the simple counting of common neighbors by assigning the lower connected neighbors more weights which is given by the equation below. More on the other indexes are ,


The code takes a bipartite graph as input (stored as a text file in an adjacency list) and computes the Adamic/Adar similarity of each non-neighboring node pair. The similarity is computed using the degree of the intermediate nodes. The output file is written as a text file containing three fields per row score , Proteins and Drugs. However this can be applied to other bipartite networks also.

After calculation the predicted links are stored in an output file and the highest predicted links can be obtained by sorting the first column. The Bipartite data (inputdata.txt) and code is avialable at Git.

Sunday, September 14, 2014

Some R codes and Examples to perform Random Walks on a network.

My research is mainly focused on Drug Target Prediction and Drug to Target to Adverse events prediction. I use Random walk on different heterogeneous networks for this. There are several papers which came up like .

1) Kohler 2008 (Walking the Interactome for Prioritization of Candidate Disease Genes)
2) The power of protein interaction networks for associating genes with diseases
3) Chen Etal ( Drug Target Prediction with Random walk with restart)
4) Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network

Here i will post some codes in R on how to perform a Normal Random walk on a small network using two method one proposed by Kohler (2008) and another proposed by Vanunu (2010).
The method iteratively simulates random transitions of a walker from a node to a randomly selected neighbour node and at any time step the walk can be restarted depending on a predefined probability. Random walk with restarts is slightly different than PageRank with priors in the way that it normalizes the link weights. The convergence is decided when a probability difference is less than 10e-10 between two consecutive time steps which is calculated by L1 / Frobenius norm. The loop breaks when the difference is less than value. The function assumes two types of method based 

1 Kohler -> which is just a simple random walk
2 Vanunu - > Which is a modification of random walk with restarts such that the link weight is normalized not only by number of outgoing edges but also by number of incoming edges. They mentioned "We chose to normalize the weight of an edge by the degrees of its end-points, since the latter relate to the probability of observing an edge between the same end-points in a random network with the same node degrees".

There are no such parameters applied here and also similarity matrices . This is the heart of the code of Random walk and one can modify on top of this as much as they can.

Hope this code makes life simple for some people when they try to understand it :)

Thursday, September 11, 2014

Some function for VS: AUC , BEDROC and RIE in R

Below are some R functions to compute Area under the curve , Robust Initial Enhancement Metric and Boltzmann-Enhanced Discrimination of ROC which is implemented in

Truchon et al. Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" J. Chem. Inf. Model. (2007) 47, 488-508.

These metrics use for the early recognition problem in virtual screening.




Monday, July 21, 2014

Converting InChi to Mol using PL/PYTHON and RDKit

I am at EBI this summer and working in the Unichem database virtualization. One of the part of the project is to perform a search of over 50 million compounds and generate the images of those compounds. It can be done on the fly but people here suggested me to generate all the mol files for those compounds. The data has only InChi's available so you need to convert it into Mol object and write it into mol file and then use database to dump the files. Another very fast and efficient method is to use PL/PYTHON which is very fast and you can integrate all the python code on postgres and generate the database. Quite fancy postgres and python. Certainly I choose that option for conversion. I used rdkit for reading the molecules and conversion to mol files and also the erroneous molecules are written as an error log file. I have given the PL/PYTHON code below just paste it and enter and then you create function at screen. Before using the script above you need to set the plpython as a language in your database which is done by


Once you're done with the script then  executing the following sql statement below will generate the mol files for you in the ctab column.

select uci,stdinchi,inchi_mol(stdinchi) as ctab into ndb_mol from db_mol ;

Thats it . It takes almost 44-48 hours to generate all the mol files for 65 million compounds. I used a loop in a python script to extract 1 million set of compounds and compute the mol files.

Thursday, July 17, 2014

RDKit , R and PostgreSQL : Predictive Modeling / QSAR with ChEMBL data

This post is based on doing Predictive Modeling with R and RDKit postgres cartridge . If you are a rdkit user then i think you do all use the rdkit postgres cartridge if not then start using it today it Free and very useful. Here is a nice documentation of installation of rdkit .

There are many ipython notebooks upon using python and rdkit for predictive modeling and qsar. You can also search it on google and some tutorials are given here.

But this post is for those who are R lovers and like to use R for their regular modeling purposes.
The tools I have used :
RDkit postgres cartridge
R libraries : RPostgreSQL, BMS, ggplot2 , elasticnet.

If you already have a working version of the ChEMBL postgres database then should be great otherwise please download and load it. The cartridge can be used to generate several types of fingerprints however to save the space it gets generated in hexadecimal format. The default setting is 512 for morgan fingerprint with path radius of 2. You can change the setting in the postgres cartridge for generation of fingerprints using the options below which Greg Landrum has suggested to me.

The options available are:
rdkit.dice_threshold           rdkit.layered_fp_size
rdkit.do_chiral_sss            rdkit.morgan_fp_size
rdkit.featmorgan_fp_size       rdkit.rdkit_fp_size
rdkit.hashed_atompair_fp_size  rdkit.ss_fp_size
rdkit.hashed_torsion_fp_size   rdkit.tanimoto_threshold

Note that a change to a configuration variable as done here only affects the current session. If you want to make it the default for the database as a whole you need to change the database configuration:

contrib_regression=# alter database contrib_regression set rdkit.morgan_fp_size=1024;

Then disconnect (close psql) and reconnect to pick up the new setting.

I used the default 512 bits. The sql query below makes a subset table based on chembl compound id, all human chembl targets, its standard and published activity values which are less than 50 uM. The sql code below  shows it how to perform it. It also generates fingerprints into rdkfps_1 table .

Once you have done this then your database is ready for modeling. I am using postgresapp and R version 3.0.3 (2014-03-06) . The following code shows you how to connect to postgres data and the query to run and how to convert hexadecimal to binary fingerprint (hex2bin()) and run a ridge regression model on the Serotonin 2a (5-HT2a) receptor dataset. I have written two function r2se() and plotsar() . r2se computes the r^2 and root mean square error and plotsar() plots the data . 


[1] 0.3224558

[1] 0.9095131

Let me know if you have any questions about the code and modeling and generating this plot. I am writing some more R codes to perform analytics with the ChEMBL data stay tuned .

Friday, February 14, 2014

Converting Similarity Matrix into a into a pairwise p-value

We can calculate the probability that a given random pairwise similarity score X is bigger than a value x as p(X > x) using the fitted Gaussian function, we can transform a Tanimoto similarity matrix into a p-value p(X > x) as follows:

were t(xi,xj) is the tanimoto similarity matrix and h is the smoothing factor which you need to estimate. 

Hope now you all can very easily understand how you can calculate your pvalue from a large distribution.