Open Chemical Information and Research: July 2014

I am at EBI this summer and working in the Unichem database virtualization. One of the part of the project is to perform a search of over 50 million compounds and generate the images of those compounds. It can be done on the fly but people here suggested me to generate all the mol files for those compounds. The data has only InChi's available so you need to convert it into Mol object and write it into mol file and then use database to dump the files. Another very fast and efficient method is to use PL/PYTHON which is very fast and you can integrate all the python code on postgres and generate the database. Quite fancy postgres and python. Certainly I choose that option for conversion. I used rdkit for reading the molecules and conversion to mol files and also the erroneous molecules are written as an error log file. I have given the PL/PYTHON code below just paste it and enter and then you create function at screen. Before using the script above you need to set the plpython as a language in your database which is done by

mydb#- CREATE PROCEDURAL LANGUAGE plpython2u;

	# Create function
	create or replace function inchi_mol(x text)
	returns text
	AS $$
	from rdkit import Chem
	from rdkit.Chem import AllChem
	file = open("/home/chembl/logerror2.txt","a")
	try :
	m=Chem.MolFromInchi(x)
	if m is not None :
	try:
	AllChem.Compute2DCoords(m)

	except (RuntimeError,ValueError,TypeError,Exception):
	pass
	file.write("2DCoords Error:")
	file.write(x)
	file.write("\n")
	try :
	return Chem.MolToMolBlock(m)

	except (RuntimeError,ValueError,TypeError,Exception) :
	pass
	file.write("Molblock Error:")
	file.write(x)
	file.write("\n")
	else:
	file.write("Inchi to Mol Error:")
	file.write(x)
	file.write("\n")
	except :
	pass
	file.write("Mol Error:")
	file.write("\n")

	file.close()
	$$ LANGUAGE plpython2u SECURITY DEFINER ;

view raw inchifrommol.sql hosted with ❤ by GitHub

Once you're done with the script then executing the following sql statement below will generate the mol files for you in the ctab column.

select uci,stdinchi,inchi_mol(stdinchi) as ctab into ndb_mol from db_mol ;

Thats it . It takes almost 44-48 hours to generate all the mol files for 65 million compounds. I used a loop in a python script to extract 1 million set of compounds and compute the mol files.

This post is based on doing Predictive Modeling with R and RDKit postgres cartridge . If you are a rdkit user then i think you do all use the rdkit postgres cartridge if not then start using it today it Free and very useful. Here is a nice documentation of installation of rdkit .

There are many ipython notebooks upon using python and rdkit for predictive modeling and qsar. You can also search it on google and some tutorials are given here.

But this post is for those who are R lovers and like to use R for their regular modeling purposes.
The tools I have used :
RDkit postgres cartridge
R
R libraries : RPostgreSQL, BMS, ggplot2 , elasticnet.

If you already have a working version of the C hEMBL postgres database then should be great otherwise please download and load it. The cartridge can be used to generate several types of fingerprints however to save the space it gets generated in hexadecimal format. The default setting is 512 for morgan fingerprint with path radius of 2. You can change the setting in the postgres cartridge for generation of fingerprints using the options below which Greg Landrum has suggested to me.

The options available are:

rdkit.dice_threshold rdkit.layered_fp_size

rdkit.do_chiral_sss rdkit.morgan_fp_size

rdkit.featmorgan_fp_size rdkit.rdkit_fp_size

rdkit.hashed_atompair_fp_size rdkit.ss_fp_size

rdkit.hashed_torsion_fp_size rdkit.tanimoto_threshold

Note that a change to a configuration variable as done here only affects the current session. If you want to make it the default for the database as a whole you need to change the database configuration:

contrib_regression=# alter database contrib_regression set rdkit.morgan_fp_size=1024;

ALTER DATABASE

Then disconnect (close psql) and reconnect to pick up the new setting.

I used the default 512 bits. The sql query below makes a subset table based on chembl compound id, all human chembl targets, its standard and published activity values which are less than 50 uM. The sql code below shows it how to perform it. It also generates fingerprints into rdkfps_1 table .


	create table tmp_data as
	SELECT DISTINCT molecule_dictionary.chembl_id,compound_structures.canonical_smiles,ASSAYS.TID,target_dictionary.chembl_id AS target_chembl_id,
	TARGET_DICTIONARY.ORGANISM,ACTIVITIES.*,assays.chembl_id AS assay_chembl_id
	from molecule_dictionary,compound_structures,assays,activities,target_dictionary
	where compound_structures.molregno = MOLECULE_DICTIONARY.MOLREGNO
	and MOLECULE_DICTIONARY.MOLREGNO = ACTIVITIES.MOLREGNO
	and ASSAYS.TID = TARGET_DICTIONARY.TID AND ASSAYS.ASSAY_ID = ACTIVITIES.ASSAY_ID
	and ACTIVITIES.STANDARD_UNITS = 'nM' AND ACTIVITIES.STANDARD_VALUE < 50
	and TARGET_DICTIONARY.ORGANISM = 'Homo sapiens'
	order by activities.standard_value ;


	select chembl_id,target_chembl_id,standard_value,published_type,data_validity_comment ,morganbv_fp(canonical_smiles::mol,4) as
	ecfp4 into rdkfps_1 from tmp_data where is_valid_smiles(canonical_smiles::cstring) ;

	#Count number of targets
	select count(distinct target_chembl_id) as tid from rdkfps_1 ;

	#count number of molecules
	select count(distinct chembl_id) as tid from rdkfps_1 ;

	#count number of molecules of a specific target
	select count(*) from rdkfps_1 where target_chembl_id ='CHEMBL224' ;

view raw chembl.sql hosted with ❤ by GitHub

Once you have done this then your database is ready for modeling. I am using postgresapp and R version 3.0.3 (2014-03-06) . The following code shows you how to connect to postgres data and the query to run and how to convert hexadecimal to binary fingerprint (hex2bin()) and run a ridge regression model on the Serotonin 2a (5-HT2a) receptor dataset. I have written two function r2se() and plotsar() . r2se computes the r^2 and root mean square error and plotsar() plots the data .

	install.packages('RPostgreSQL', type='source')

	#initiate connection
	library(RPostgreSQL)
	library(BMS)
	library(ggplot2)

	drv <- dbDriver("PostgreSQL")
	con <- dbConnect(drv, dbname="chembl_18")

	#query the database
	rs <- dbSendQuery(con,"select chembl_id,standard_value,ecfp4 from rdkfps_1 where target_chembl_id ='CHEMBL224' and published_type ='IC50'")

	#fetch all the results
	d<-fetch(rs ,n =-1)

	#close resultSet rs
	dbClearResult(rs)

	#remove duplicates ids
	d<-d[!duplicated(d[,1]), ]

	# remove the \\x from the hexadecimal string and convert hexadecimal to binary format
	ad<-data.frame()
	for (i in 1:dim(d)[1]){
	a = hex2bin(gsub("[\\x]","",d[i,3]))
	ad<-rbind(a,ad)
	}

	# remove the third columns
	d[,3]<-NULL

	#append the descriptor set
	d<-cbind(d,ad)

	#Near constant columns and NA
	descs <- d[, !apply(d, 2, function(x) any(is.na(x)) )]
	descs <- d[, !apply( d, 2, function(x) length(unique(x)) == 1 )]

	# Remove correlated columns > 0.25
	r2 <- which(cor(descs[3:511])^2 > .25, arr.ind=TRUE)

	r2 <- r2[ r2[,1] > r2[,2] , ]
	data <- descs[, -unique(r2[,2])]

	#Check Dimensions of the dataset
	dim(data)

	#-------------------------------------------------------------
	#train and test set
	names(set)[2]<-"Activity" # change the column name

	#Convert IC50 column to pIC50
	data[,2]<- -(log(data[,2]*10^-9))

	# split train and test set
	ind<-sample(2,nrow(data),replace=TRUE,prob=c(0.8,0.2))
	trainsetX<-data[ind==1,2:dim(data)[2]]
	trainY<-trainsetX$Activity
	testsetX<-data[ind==2,2:dim(data)[2]]
	testY<- testsetX$Activity

	#-------------------------------------------------------------
	# Function to compute RMSE and R2
	r2se<- function (obs,pred){
	rmse<-(mean((obs -pred)^2))^0.5
	ssr<-sum((obs - pred)^2)
	sst<-sum((obs - mean(obs))^2)
	R2<-1-(ssr/sst)
	output<-list(RMSE=rmse,RSquared=R2)
	return(output)
	}

	#-------------------------------------------------------------
	# Function to plot qsar results with ggplot2
	plotsar <-function(results){

	myplot<-ggplot(results,aes(x=observed,y=predicted,color=factor(set),shape=factor(set)))+
	geom_point()+scale_colour_manual(values=c("blue", "yellow"))+
	theme(axis.line = element_line(colour = "white"))+
	theme(panel.grid.major = element_line(colour = "white", linetype = "dotted"))+
	theme(axis.ticks.y=element_line(size=1),axis.text.y=element_text(size=16),axis.ticks.length=unit(0.25,"cm"),
	panel.background = element_rect(fill = "black", colour = NA))+
	theme(legend.key = element_rect(fill = "black"))+
	theme(axis.title=element_text(size=12,face="bold",colour = 'white'),
	plot.background = element_rect(colour = 'black', fill = 'black'))+
	theme( legend.title = element_text(size = 12, face = "bold", hjust = 0, colour = 'white'),
	legend.background=element_rect(color="black",fill="black"),
	legend.text=element_text(size=12,color="white",face="bold"),
	axis.title.x = element_text(size = 12, colour = 'white', vjust = 1) ,
	axis.title.y = element_text(size = 12, colour = 'white'))+
	ggtitle("Predicted v/s test set QSAR results")+
	theme(plot.title=element_text(lineheight=.8,face="bold",color="white",size=14))+stat_smooth(span = 0.9)


	return(myplot)
	}

	#-------------------------------------------------------------
	# Ridge Regression Model
	library(elasticnet)
	ridgeModel <- enet(x = as.matrix(trainsetX), y = trainY,lambda = 0.01)
	ridgetest <- predict(ridgeModel, newx = as.matrix(testsetX),
	s = 1, mode = "fraction",type = "fit")
	ridgetrain <- predict(ridgeModel, newx = as.matrix(trainsetX),
	s = 1, mode = "fraction",type = "fit")


	ridgeval1<-data.frame(obs = trainY, pred = ridgetrain$fit)
	ridgeval2<-data.frame(obs = testY, pred = ridgetest$fit)
	r2se(ridgeval1$obs,ridgeval1$pred)
	r2se(ridgeval2$obs,ridgeval2$pred)

	train.results<-data.frame(observed=trainY,predicted=ridgetrain$fit,set="train")
	test.results <- data.frame(observed=testY,predicted=ridgetest$fit,set="test")
	results<-rbind(train.results,test.results)

	plotsar(results)



	## close connection con
	dbDisconnect(con)

view raw model.R hosted with ❤ by GitHub

$RMSE

[1] 0.3224558

$RSquared

[1] 0.9095131

Let me know if you have any questions about the code and modeling and generating this plot. I am writing some more R codes to perform analytics with the ChEMBL data stay tuned .

Open Chemical Information and Research

Pages

Monday, July 21, 2014

Converting InChi to Mol using PL/PYTHON and RDKit

Thursday, July 17, 2014

RDKit , R and PostgreSQL : Predictive Modeling / QSAR with ChEMBL data

Followers

Blog Archive