Uncover Hidden Insights in Clinical Trials database with Full Natural Language based Chat Engine

Abhik Seal
7 min readJul 5, 2024

--

Introduction

My last blog post about RAG based clinical trial search where we implemented a Retrieval-Augmented Generation (RAG) system for clinical trial data, focusing on content searching like criteria , enrollment, title and other fields to improve the retrieval and relevance of clinical trial documents. Clinical trials generate vast amounts of data, often stored in complex databases. Extracting meaningful insights from this data can be a daunting task, especially for researchers and healthcare professionals who may not be familiar with writing SQL queries. Here at Lynx Analytics we are experimenting a Clinical Chat Query Engine — an innovative solution leveraging OpenAI’s GPT-4 to simplify and automate the process of querying clinical trial databases. In this blog post, we will dive into the technical aspects of this powerful tool, showcasing how it can transform the way we explore clinical trial data from ClinicalTrials.gov using just natural language .

Overview of the Tool

For decades, traditional SQL queries have been the backbone of database management, providing structured and efficient data retrieval through well-defined schemas and precise syntax. However, this method often demands significant expertise in SQL and database design. The emergence of advanced technologies like GPT-4 has revolutionized this process, making data querying more accessible and intuitive. These AI models can understand natural language inputs, generate complex queries, and offer insights without needing deep technical knowledge. This democratizes data analysis, allowing more users to efficiently extract valuable information from databases. With better prompts design , function calling we can fully automated the process of querying databases without needing a database engineer for this task.

Working on the Database Chat Query Engine, designed to interface with clinical trial databases like ctgov, will enable users to ask questions in natural language and receive precise SQL-generated answers. Leveraging GPT-4’s capabilities, this engine simplifies uncovering hidden insights within clinical trial data, streamlining research and decision-making processes rapidly.

Moreover, this tool can communicate not only with clinical databases but with any database. It can connect and interact in real time with PostgreSQL, Oracle, MySQL, and almost all table-based databases, ensuring efficient and secure data retrieval. The tool includes mechanisms to handle API errors gracefully, with multiple retries to ensure reliability. This feature enhances the tool’s robustness, making it suitable for critical research applications. By processing and interpreting data using natural language, the tool makes complex data queries more accessible and intuitive. This not only democratizes data analysis across the organization but also enables us to derive deeper insights more quickly.

Detailed Logging and Feedback: Users receive detailed feedback on query execution, including the time taken and the results retrieved. This transparency helps users understand the data and the tool’s functionality better.

Without much ado lets see what i am trying to tell you with some use cases . All of these questions are available using a chat query interface.

For the experiment we used AACT clinical trial database latest PostgreSQL download or you can connect with the server hosted at their server . Use the login credentials to connect using pg-admin.

Imagine you want to use this natural language “identify the top 5 organ systems affected by serious adverse events in completed clinical trials.” Here’s how the database Chat Query Engine can help:

initial_question = "Find the top 5 organ systems affected by serious adverse events in completed studies."
response_content = ask_dbchat(initial_question, config)
organ_system  adverse_event_count
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Infections and infestations 476210
Gastrointestinal disorders 315092
Nervous system disorders 224658
Cardiac disorders 213832
Injury, poisoning and procedural complications 212676

Identify the top 5 industry sponsors with the highest number of recruiting Phase 1 clinical trials that started in 2024. Include the count of such trials for each sponsor in your results.

1. **Boehringer Ingelheim** with 18 trials
2. **Eli Lilly and Company** with 14 trials
3. **Novo Nordisk A/S** with 12 trials
4. **Merck Sharp & Dohme LLC** with 9 trials
5. **AstraZeneca** with 8 trials

Identify top 10 facility contacts and facilities which does the maximum number of trials ? identify names not ‘Site Public Contact’ Use ‘ilike’ for word matching. We use the ilike logic as the query engine thinks the data is very structured but there are full text columns and also text with Capital letters to avoid such things sql have a nice ilike keywords which can match capital and small letters easily. We are telling agent to retrieve those queries as well.

The top 10 facility contacts and facilities with the maximum number of trials, 
excluding those named 'Site Public Contact', are as follows:

1. US GSK Clinical Trials Call Center at GSK Investigational Site — 4260 trials
2. EU GSK Clinical Trials Call Centre at GSK Investigational Site — 4259 trials
3. For more information at the NIH Clinical Center contact Office of Patient Recruitment (OPR) at National Institutes of Health Clinical Center — 343 trials
4. Study Manager at ShORe Investigational Site — 148 trials
5. Clinical Trials Referral Office at Mayo Clinic in Rochester — 132 trials
6. For more information at the NIH Clinical Center contact National Cancer Institute Referral Office at National Institutes of Health Clinical Center — 109 trials
7. CYTE Global at Anthos Investigate Site — 107 trials
8. Study Coordinator at 89bio Clinical Study Site — 101 trials
9. Patrick LACARIN at CHU Clermont-Ferrand — 101 trials
10. Study Coordinator at Nicox Clinical Site — 88 trials.

Identify 5 drugs that were stopped or terminated in trials for safety concerns in 2023 ?

drug_name       trial_count
------------ -------------
Placebo 34
Gantenerumab 5
Paclitaxel 4
Carboplatin 3
Cisplatin 3

Removing placebo’s : just query remove the placebos and show the results

drug_name        trial_count
------------- -------------
Gantenerumab 5
Paclitaxel 4
Carboplatin 3
Dexamethasone 3
Cisplatin 3

Who are the top 10 principal investigators and its affiliations which does the maximum number of trials for cardiac related problems use synonyms for cardiac disorders ? Use ‘ilike’ for word matching.

principal_investigator                      affiliation                                                                                 trial_count
------------------------------------------ ---------------------------------------------------------------------------------------- -------------
David Kandzari, MD Piedmont Heart Institute 38
Joo Myung Lee, MD, MPH, PhD Samsung Medical Center 25
Florim Cuculi, MD Luzerner Kantonsspital 18
Bonaventure Yiu Ming IP, MB ChB Chinese University of Hong Kong 18
Robert J Lederman, M.D. National Heart, Lung, and Blood Institute (NHLBI) 15
Silvia Ulrich, Prof. University of Zurich 15
John L Sievenpiper, MD,PhD,FRCPC University of Toronto 15
Bon-Kwon Koo, MD, PhD Seoul National University Hospital 14
Ada Tang, PhD McMaster University 14
Christopher Chen, BMBCh, MRCP, FAMS, FRCPE Departments of Pharmacology and Psychological Medicine, National University of Singapore 14

The query which is automatically generated with cardiac semantics is :

SELECT
o.name AS principal_investigator,
o.affiliation AS affiliation,
COUNT(s.nct_id) AS trial_count
FROM
conditions c
INNER JOIN
studies s ON c.nct_id = s.nct_id
INNER JOIN
overall_officials o ON s.nct_id = o.nct_id AND o.role ILIKE '%PRINCIPAL_INVESTIGATOR%'
WHERE
(c.name ILIKE '%stroke%' OR c.name ILIKE '%cardiac%' OR c.name ILIKE '%heart%' OR c.name ILIKE '%myocardial%' OR
c.name ILIKE '%coronary%' OR c.name ILIKE '%arrhythmia%' OR c.name ILIKE '%angina%' OR c.name ILIKE '%arterial%' OR
c.name ILIKE '%aorta%' OR c.name ILIKE '%valve%' OR c.name ILIKE '%ischemic%' OR c.name ILIKE '%vascular%' OR
c.name ILIKE '%hypertension%' OR c.name ILIKE '%heart attack%' OR c.name ILIKE '%congestive heart failure%' OR
c.name ILIKE '%atrial fibrillation%' OR c.name ILIKE '%cardiovascular%') AND
s.study_first_submitted_date >= '2015-01-01' AND
s.study_first_submitted_date <= '2024-12-31' AND
o.name IS NOT NULL AND
o.name NOT ILIKE '%Site Public Contact%'
GROUP BY
o.name, o.affiliation
ORDER BY
trial_count DESC
LIMIT 10

What kind of conditions Ki Hong Choi, MD works on at the Samsung Medical Center ?

condition
------------------------------
Angina Pectoris
Coronary Artery Disease
Coronary Microvascular Disease
Ischemic Heart Disease

And something like this as well , Where does Novartis does clinical trials , give me top 10 PRINCIPAL INVESTIGATOR , facility name and how many trials they have done for Novartis ?

principal_investigator    facility_name                                            trial_count
------------------------ ----------------------------------------------------- -------------
Irina Murakhovskaya Montefiore Medical Center . 4
Jacob A. Aelion West Tennessee Research Institute 4
Neetu Talreja Treasure Valley Medical Research 4
Guillermo Valenzuela Integral Rheumatology and Immunology Specialists IRIS 3
Eric Boren Kern Research 3
Felton Anderson Anderson Medical Research Main Center 3
Daniel Soteres Asthma and Allergy Associates P C 3
Angel Carrasco Homestead Associates in Research Inc 3
Charles Bane STAT Research Inc Premier Clin Res LLC STAT Res 3
Andrew Smith Allergy Associates of Utah 3

Find the top 3 names as role principal investigators who have conducted the most industry-sponsored, interventional drug trials from 2013 to 2024.

principal_investigator      trial_count
------------------------ -------------
Study Director 11
Jin Li 10
Yuankai Shi 9

This query requires tuning as the query when examined uses all_intervention_types table with name = ‘Drug’ which is incorrect , better to use intervention table , again it can be suggested on chat as to use th intervention table which increased the count to

principal_investigator trial_count
- - - - - - - - - - - - - - - - - - -
Study Director 23
Milko Radicioni, MD 17
Li Zhang, MD 14

Here a little different case where you have Study Director as principal investigator name in output, actually inside AACT there exist a name Study Director so it gets it . One can avoid it . One can ask “don’t include Study Director in the output” , then

principal_investigator trial_count
- - - - - - - - - - - - - - - - - - -
Milko Radicioni, MD 17
Li Zhang, MD 14
Yuankai Shi 14

Give me 10 names of intervention drugs and industry sponsors for which trials are stopped or terminated or for the cause of death ?

intervention_drug           industry_sponsor
-------------------------- ----------------------------------
177Lu-PSMA-617 Endocyte
5-Azacitidine GlaxoSmithKline
5-fluorouracil Hoffmann-La Roche
5-Fluorouracil GlaxoSmithKline
Acetylsalicylic Acid (ASA) Bayer
Ad26.Mos4.HIV Janssen Vaccines & Prevention B.V.
Adhansia XR Purdue Pharma LP
adriamycin Janssen-Cilag International NV
Aducanumab (BIIB037) Biogen
Afatinib Boehringer Ingelheim

If you enjoyed reading this, and/or want to be kept in the loop about the next blog, follow me on Medium.

Feel free to connect with me on LinkedIn and if you feel this can be revolutionary in your industry research/job implementation.

if you feel to leave some tips send it here .

--

--

Abhik Seal
Abhik Seal

Written by Abhik Seal

Data Science / Cheminformatician x-AbbVie , I try to make complicated things looks easier and understandable www.linkedin.com/in/abseal/

No responses yet