CSBJ Journal Publication

Extraction without Hallucination.

General-purpose LLMs struggle with biomedical hallucination and dynamic knowledge integration. MedDiscover introduces a two-tier RAG benchmark (Gold expert-curated & Silver synthetic) specifically instantiated for metabolomics, providing transparent, reproducible evaluation of retrieval augmentation.

Interactive Interface

Try MedDiscover Live

Experience our domain-specific retrieval architecture powered by MedCPT embeddings. Upload a metabolomics document or test our pre-indexed metabolic-disorder corpus right from your browser via ZeroGPU.

huggingface.co/spaces/VatsalPatel18/MedDisover-space

Methodology

Pipeline Architecture

The Challenge in Metabolomics

General-purpose Large Language Models (LLMs) like GPT-4 face substantial limitations in specialized biomedical subfields. They struggle with hallucinations and complex terminologies. MedDiscover addresses this by introducing a domain-specific RAG evaluation benchmark tailored for metabolomics and metabolic-disorder literature (ICD-10 E70–E88). We focus on reproducible assessment of retrieval augmentation, embedding choices, and decoder sensitivity.

Gold Dataset

10 high-impact papers focusing on Gaucher disease, NAFLD, and Type 1 diabetes. Contains 30 expert-curated QA pairs with human-traceable reference answers to rigorously test clinical correctness.

Human Validated

Silver Benchmark

A scaled stress-test spanning 100 papers with ~600 synthetic, retrievability-filtered QA pairs (300 Ada, 300 MedCPT). Used to evaluate retrieval robustness at scale.

Automated At Scale

MedDiscover Semantic Retrieval Flow

Domain Indexing

Medical texts (ICD-10 E70–E88) are optimally chunked (500 tokens) and embedded using MedCPT.

FAISS Vector Search

Fast nearest-neighbor retrieval isolates the Top-K most semantically relevant biomedical contexts.

Grounded Generation

GPT-4o or HF Space synthesizes the answer, strictly bound to contexts to prevent hallucination.

Performance Metrics

Benchmark Results

Results from the Silver benchmark comparing OpenAI's general-purpose Ada-002 embeddings against the domain-specific MedCPT models, utilizing a GPT-4o decoder.

0.980
MedCPT Correctness
0.968
Ada-002 Correctness
5.72e-11
P-Value (Correctness)
0.303
Cliff's δ Effect Size

Silver Benchmark Statistics (RAGAS Metrics)

Means with bootstrap 95% confidence intervals across ~600 synthetic queries.

Statistically Significant
Metric Ada-002 [95% CI] MedCPT [95% CI] P-Value
Answer Correctness 0.968 [0.960, 0.976] 0.980 [0.974, 0.985] 5.72×10-11
Faithfulness 0.703 [0.667, 0.740] 0.687 [0.650, 0.725] 0.85
Context Recall 0.743 [0.702, 0.779] 0.722 [0.682, 0.762] 0.50
Context Precision 0.923 [0.890, 0.953] 0.908 [0.875, 0.937] 0.49

Contributors

Research Team

Researchers driving the intersection of Artificial Intelligence and Computational Structural Biotechnology.

Vatsal Patel

Lead Author

Principal Research Engineer in healthcare AI, building agentic clinical workflows, biomedical RAG systems, and multi-omics modeling for translational research.

Vatsal Patel

Elena Jolkver

Co Author

Professor of Applied AI and Senior Data Scientist contributing life-sciences and metabolomics expertise across biostatistics, machine learning, and MLOps.

Elena Jolkver

Anne Schwerk

Corresponding Author

Professor of Artificial Intelligence focused on responsible medical AI, combining NLP, healthcare data science, and trustworthy multimodal analytics.

Anne Schwerk

IU Internationale Hochschule GmbH, Germany Computational and Structural Biotechnology Journal (CSBJ)