Member-only story
Using Chemical Language transformer model for Molecular Property prediction regression tasks and visualize attention weights : Part 1
Came across the ChemBERTa-77M-MTR at Hugging face looks like it's pre-trained on 77M molecules. ChemBERT is a large-scale pre-trained molecular transformer model based on the BERT architecture, specifically designed for tasks in chemistry, drug discovery, or materials science. The model can be fine-tuned for specific tasks, such as property prediction, molecular generation. While doing this fine-tuning for sometime with max_length , batch size and epochs I came with some good scores and these models fit comfortably well with traditional descriptor type models and performs better. Below are some of the results of some datasets i used from TDC which shows these fine-tuned models could be leaders in the TDC benchmark tables. I am quite impressed with the results what I have got, and certainly it has a good potential to consider one of the methods for SA Prediction. I have merged the train and valid dataset together and these are the test set results.
The Input(Dataset) Class tokenizes the SMILES strings and prepares them as input for the model. It tokenizes the SMILES string using the tokenizer, applies padding and truncation to the maximum length, and returns the input IDs, attention masks, and the labels.