Member-only story

Few-shot learning with LLM for prediction of binary molecular Models

Abhik Seal
7 min readFeb 21, 2025

--

Large language models (LLMs) have garnered significant interest across various domains everywhere and scientist and people day by day finding its unique uses in day to day life. Even in cheminformatics scientist have used them to generate molecules , automate tasks, synthesis planning like chemcrow. Doing chemistry and LLM is challenging but as we progress we see will new possibilities . Despite the broad popularity of descriptor-based QSAR modeling — where features such as molecular weight, hydrogen-bond donors, morgan fps and topological indices from RDKit are computed — this method deliberately avoids all classical descriptors. Instead, it relies on a raw, large language model (LLM) to interpret SMILES strings through a series of prompts, treating them purely as textual input rather than structured chemical information. This post presents an in-depth exploration of a code snippet that demonstrates how large language models (here, GPT based) perform classification predictions on the blood brain barrier dataset (TDC) , herg (TDC) and bace (β-site amyloid precursor protein cleaving enzyme) dataset( provided at github repo) . We will discuss both two methods i used one random sampling and scaffold-based sampling strategies, how they compare, and why LLM based predictions may still lag behind RDKit based approaches, while hinting at the exciting possibility of future improvements. I must say it is very much possible that one day these models can do predictions with high accuracy and while performing 5–7 experiments i am…

--

--

Abhik Seal
Abhik Seal

Written by Abhik Seal

Data Science / Cheminformatician x-AbbVie , I try to make complicated things looks easier and understandable www.linkedin.com/in/abseal/

No responses yet