Pages

Monday, July 21, 2014

Converting InChi to Mol using PL/PYTHON and RDKit

I am at EBI this summer and working in the Unichem database virtualization. One of the part of the project is to perform a search of over 50 million compounds and generate the images of those compounds. It can be done on the fly but people here suggested me to generate all the mol files for those compounds. The data has only InChi's available so you need to convert it into Mol object and write it into mol file and then use database to dump the files. Another very fast and efficient method is to use PL/PYTHON which is very fast and you can integrate all the python code on postgres and generate the database. Quite fancy postgres and python. Certainly I choose that option for conversion. I used rdkit for reading the molecules and conversion to mol files and also the erroneous molecules are written as an error log file. I have given the PL/PYTHON code below just paste it and enter and then you create function at screen. Before using the script above you need to set the plpython as a language in your database which is done by

mydb#- CREATE PROCEDURAL LANGUAGE plpython2u;



Once you're done with the script then  executing the following sql statement below will generate the mol files for you in the ctab column.

select uci,stdinchi,inchi_mol(stdinchi) as ctab into ndb_mol from db_mol ;

Thats it . It takes almost 44-48 hours to generate all the mol files for 65 million compounds. I used a loop in a python script to extract 1 million set of compounds and compute the mol files.





2 comments:

jpo said...

Cool stuff.

What fraction had errors?

What about now taking these mol files and and then recalculating the InChIs and seeing what fraction make the round trip?

This may be a useful filter to store a core of the reliably converted set....

Abhik Seal said...

The fraction of errors 0.0011 % from InChi to Mol
Yeah i will go for the "roundtrip" too .