Pages

Thursday, July 17, 2014

RDKit , R and PostgreSQL : Predictive Modeling / QSAR with ChEMBL data

This post is based on doing Predictive Modeling with R and RDKit postgres cartridge . If you are a rdkit user then i think you do all use the rdkit postgres cartridge if not then start using it today it Free and very useful. Here is a nice documentation of installation of rdkit .

There are many ipython notebooks upon using python and rdkit for predictive modeling and qsar. You can also search it on google and some tutorials are given here.

But this post is for those who are R lovers and like to use R for their regular modeling purposes.
The tools I have used :
RDkit postgres cartridge
R
R libraries : RPostgreSQL, BMS, ggplot2 , elasticnet.

If you already have a working version of the ChEMBL postgres database then should be great otherwise please download and load it. The cartridge can be used to generate several types of fingerprints however to save the space it gets generated in hexadecimal format. The default setting is 512 for morgan fingerprint with path radius of 2. You can change the setting in the postgres cartridge for generation of fingerprints using the options below which Greg Landrum has suggested to me.

The options available are:
rdkit.dice_threshold           rdkit.layered_fp_size
rdkit.do_chiral_sss            rdkit.morgan_fp_size
rdkit.featmorgan_fp_size       rdkit.rdkit_fp_size
rdkit.hashed_atompair_fp_size  rdkit.ss_fp_size
rdkit.hashed_torsion_fp_size   rdkit.tanimoto_threshold


Note that a change to a configuration variable as done here only affects the current session. If you want to make it the default for the database as a whole you need to change the database configuration:

contrib_regression=# alter database contrib_regression set rdkit.morgan_fp_size=1024;
ALTER DATABASE

Then disconnect (close psql) and reconnect to pick up the new setting.

I used the default 512 bits. The sql query below makes a subset table based on chembl compound id, all human chembl targets, its standard and published activity values which are less than 50 uM. The sql code below  shows it how to perform it. It also generates fingerprints into rdkfps_1 table .


Once you have done this then your database is ready for modeling. I am using postgresapp and R version 3.0.3 (2014-03-06) . The following code shows you how to connect to postgres data and the query to run and how to convert hexadecimal to binary fingerprint (hex2bin()) and run a ridge regression model on the Serotonin 2a (5-HT2a) receptor dataset. I have written two function r2se() and plotsar() . r2se computes the r^2 and root mean square error and plotsar() plots the data . 



$RMSE

[1] 0.3224558

$RSquared
[1] 0.9095131



Let me know if you have any questions about the code and modeling and generating this plot. I am writing some more R codes to perform analytics with the ChEMBL data stay tuned .



No comments: