CSBJ Journal Publication
Extraction without Hallucination.
General-purpose LLMs struggle with biomedical hallucination and dynamic knowledge integration. MedDiscover introduces a two-tier RAG benchmark (Gold expert-curated & Silver synthetic) specifically instantiated for metabolomics, providing transparent, reproducible evaluation of retrieval augmentation.
Interactive Interface
Try MedDiscover Live
Experience our domain-specific retrieval architecture powered by MedCPT embeddings. Upload a metabolomics document or test our pre-indexed metabolic-disorder corpus right from your browser via ZeroGPU.
Methodology
Pipeline Architecture
The Challenge in Metabolomics
General-purpose Large Language Models (LLMs) like GPT-4 face substantial limitations in specialized biomedical subfields. They struggle with hallucinations and complex terminologies. MedDiscover addresses this by introducing a domain-specific RAG evaluation benchmark tailored for metabolomics and metabolic-disorder literature (ICD-10 E70–E88). We focus on reproducible assessment of retrieval augmentation, embedding choices, and decoder sensitivity.
Gold Dataset
10 high-impact papers focusing on Gaucher disease, NAFLD, and Type 1 diabetes. Contains 30 expert-curated QA pairs with human-traceable reference answers to rigorously test clinical correctness.
Human ValidatedSilver Benchmark
A scaled stress-test spanning 100 papers with ~600 synthetic, retrievability-filtered QA pairs (300 Ada, 300 MedCPT). Used to evaluate retrieval robustness at scale.
Automated At ScaleMedDiscover Semantic Retrieval Flow
Domain Indexing
Medical texts (ICD-10 E70–E88) are optimally chunked (500 tokens) and embedded using MedCPT.
FAISS Vector Search
Fast nearest-neighbor retrieval isolates the Top-K most semantically relevant biomedical contexts.
Grounded Generation
GPT-4o or HF Space synthesizes the answer, strictly bound to contexts to prevent hallucination.
Performance Metrics
Benchmark Results
Results from the Silver benchmark comparing OpenAI's general-purpose Ada-002 embeddings against the domain-specific MedCPT models, utilizing a GPT-4o decoder.
Silver Benchmark Statistics (RAGAS Metrics)
Means with bootstrap 95% confidence intervals across ~600 synthetic queries.
| Metric | Ada-002 [95% CI] | MedCPT [95% CI] | P-Value |
|---|---|---|---|
| Answer Correctness | 0.968 [0.960, 0.976] | 0.980 [0.974, 0.985] | 5.72×10-11 |
| Faithfulness | 0.703 [0.667, 0.740] | 0.687 [0.650, 0.725] | 0.85 |
| Context Recall | 0.743 [0.702, 0.779] | 0.722 [0.682, 0.762] | 0.50 |
| Context Precision | 0.923 [0.890, 0.953] | 0.908 [0.875, 0.937] | 0.49 |
Contributors
Research Team
Researchers driving the intersection of Artificial Intelligence and Computational Structural Biotechnology.
Vatsal Patel
Lead Author
Principal Research Engineer in healthcare AI, building agentic clinical workflows, biomedical RAG systems, and multi-omics modeling for translational research.
Elena Jolkver
Co Author
Professor of Applied AI and Senior Data Scientist contributing life-sciences and metabolomics expertise across biostatistics, machine learning, and MLOps.
Anne Schwerk
Corresponding Author
Professor of Artificial Intelligence focused on responsible medical AI, combining NLP, healthcare data science, and trustworthy multimodal analytics.
IU Internationale Hochschule GmbH, Germany Computational and Structural Biotechnology Journal (CSBJ)