Computational and Structural Biotechnology Journal

Extraction without Hallucination.

General-purpose LLMs struggle with biomedical hallucination and dynamic knowledge integration. MedDiscover introduces a two-tier RAG benchmark (Gold expert-curated & Silver synthetic) specifically instantiated for metabolomics, providing transparent, reproducible evaluation of retrieval augmentation.

Interactive Interface

Try MedDiscover Live

Experience our domain-specific retrieval architecture powered by MedCPT embeddings. Upload a metabolomics document or test our pre-indexed metabolic-disorder corpus right from your browser via ZeroGPU.

huggingface.co/spaces/VatsalPatel18/MedDisover-space

Methodology

Pipeline Architecture

The Challenge in Metabolomics

General-purpose Large Language Models (LLMs) like GPT-4 face substantial limitations in specialized biomedical subfields. They struggle with hallucinations and complex terminologies. MedDiscover addresses this by introducing a domain-specific RAG evaluation benchmark tailored for metabolomics and metabolic-disorder literature (ICD-10 E70–E88). We focus on reproducible assessment of retrieval augmentation, embedding choices, and decoder sensitivity.

Gold Dataset

10 high-impact papers focusing on Gaucher disease, NAFLD, and Type 1 diabetes. Contains 30 expert-curated QA pairs with human-traceable reference answers to rigorously test clinical correctness.

Human Validated

Silver Benchmark

A scaled stress-test spanning 100 papers with ~600 synthetic, retrievability-filtered QA pairs (300 Ada, 300 MedCPT). Used to evaluate retrieval robustness at scale.

Automated At Scale

MedDiscover Semantic Retrieval Flow

Domain Indexing

Medical texts (ICD-10 E70–E88) are optimally chunked (500 tokens) and embedded using MedCPT.

FAISS Vector Search

Fast nearest-neighbor retrieval isolates the Top-K most semantically relevant biomedical contexts.

Grounded Generation

GPT-4o or HF Space synthesizes the answer, strictly bound to contexts to prevent hallucination.

Performance Metrics

Benchmark Results

Results from the Silver benchmark comparing OpenAI's general-purpose Ada-002 embeddings against the domain-specific MedCPT models, utilizing a GPT-4o decoder.

0.980
MedCPT Correctness
0.968
Ada-002 Correctness
5.72e-11
P-Value (Correctness)
0.303
Cliff's δ Effect Size

Silver Benchmark Statistics (RAGAS Metrics)

Means with bootstrap 95% confidence intervals across ~600 synthetic queries.

Statistically Significant
Metric Ada-002 [95% CI] MedCPT [95% CI] P-Value
Answer Correctness 0.968 [0.960, 0.976] 0.980 [0.974, 0.985] 5.72×10-11
Faithfulness 0.703 [0.667, 0.740] 0.687 [0.650, 0.725] 0.85
Context Recall 0.743 [0.702, 0.779] 0.722 [0.682, 0.762] 0.50
Context Precision 0.923 [0.890, 0.953] 0.908 [0.875, 0.937] 0.49

Contributors

Research Affiliation

The author list is now pinned in the left panel on desktop for constant visibility. This section keeps the institutional context attached to the publication.

IU Internationale Hochschule GmbH, Germany Computational and Structural Biotechnology Journal (CSBJ)