Retrieval-Augmented Generation for Clinical Trials: A Introductory Practical Guide (Part 1)

Abhik Seal
7 min readJun 1, 2024

During my collaboration for a US-based biosciences company with Lynx Analytics, I tackled a problem to one I faced during my time at AbbVie — retrieving documents based on content. Back then, the challenge was implementing semantics for full paragraphs, which wasn’t feasible easily. We had to rely on word matching and n-gram scoring to retrieve and rank documents. Fast forward to today, technologies like ChatGPT and vector stores with indexes have revolutionized our approach to this problem. In this blog post, I’ll guide you through a practical implementation of a retrieval-augmented generation (RAG) system for searching clinical trial data using Weaviate and LlamaIndex.

Introduction

The traditional approach involved using term frequency-inverse document frequency (TF-IDF) to represent documents as vectors in a high-dimensional space. We would then use cosine similarity to measure the similarity between the query and document vectors. However, this approach had limitations in capturing semantic meaning.

With vector embeddings and neural network-based models, we can now represent documents and queries in a continuous vector space where semantically similar texts are close to each other. This has been made possible by models like BERT, GPT-3.5 ,GPT4 ,Gemini and their successors. This is called Retrieval-Augmented Generation (RAG) combines the strengths of retrieval systems and generative models to…

--

--

Abhik Seal
Abhik Seal

Written by Abhik Seal

Data Science / Cheminformatician x-AbbVie , I try to make complicated things looks easier and understandable www.linkedin.com/in/abseal/

No responses yet