NTU's EHRStruct Benchmark for LLM Electronic Health Records

Researchers at Nanyang Technological University have developed EHRStruct, a benchmark designed to evaluate how large language models (LLMs) process structured electronic health records (EHRs). This benchmark includes 11 core tasks, comprising 2,200 samples, categorized by clinical scenario, cognitive level, and functional type.

The study's findings indicate that general-purpose LLMs generally outperform models specifically trained for medical applications. Data-driven tasks showed stronger performance, and both input format and fine-tuning methods significantly influenced outcomes. Building on these insights, the team proposed the EHRMaster framework, which, when integrated with Google's Gemini models, surpassed the performance of existing models. The research has been accepted as an Oral paper for the AAAI 2026 Main Technical Track.

Electronic Health Records are central to medical systems, providing comprehensive clinical information for patient diagnosis, testing, medication, vital sign monitoring, and disease management. As LLMs are increasingly applied in healthcare, their ability to effectively understand and process these structured records to assist clinical decision-making has become a critical area of development in medical artificial intelligence.

The EHRStruct benchmark, co-developed by computer scientists and medical experts, offers a comprehensive framework for evaluating LLMs. It is organized hierarchically by clinical scenario, cognitive level, and functional category, covering 11 tasks with 2,200 standardized samples. This framework aims to provide a unified, rigorous, and interpretable method for assessing the controllability, reliability, and clinical applicability of medical LLMs.

The research team conducted extensive evaluations on 20 mainstream LLMs and 11 advanced enhancement methods using EHRStruct. The EHRMaster framework, in combination with Gemini, demonstrated superior performance in processing structured EHRs.

A related initiative, the EHRStruct 2026 - LLM Structured EHR Challenge, has also been launched. This challenge provides a standardized platform for researchers to evaluate LLMs' capabilities in processing structured EHRs, serving as a benchmark for experimental results. The leaderboard is available on Codabench.

Task Definition and Key Findings

EHRStruct categorizes its 11 tasks by context type (data-driven vs. knowledge-driven) and cognitive level (understanding vs. reasoning). These are further grouped into six functional categories: information retrieval, data aggregation, arithmetic calculation, clinical identification, diagnostic evaluation, and treatment planning.

Key findings from the systematic evaluation of various LLMs include:

General LLMs vs. Medical-Specific Models: General-purpose LLMs generally performed better on structured EHR tasks than medical domain models, with closed-source commercial models, particularly the Gemini series, achieving the best results.
Performance in Data-Driven Tasks: LLMs showed more stable and superior performance in data-driven tasks compared to those requiring extensive medical knowledge.
Impact of Input Format: Natural language descriptions were more effective for data-driven reasoning tasks, while graph-structured representations suited data-driven understanding tasks. For knowledge-driven tasks, no single input format consistently improved performance.
Few-Shot Learning: Few-shot examples generally improved LLM performance, with 1-shot and 3-shot settings often outperforming 5-shot.
Multi-Task vs. Single-Task Fine-Tuning: While both improved capabilities, multi-task fine-tuning yielded more significant performance gains.
Context-Dependent Enhancement Methods: Non-medical enhancement methods performed poorly on knowledge-driven tasks, and medical-specific methods had limitations in data-driven tasks.

Overall Design and Construction of EHRStruct

The construction of EHRStruct involved four stages: task synthesis, task system construction, task sample extraction, and evaluation process setup. This collaborative effort between medical experts and computer researchers aimed to create a structured EHR evaluation system that addresses clinical needs across various scenarios and cognitive complexities.

Task Synthesis: Initial task settings were refined by computer researchers and validated by medical experts for clinical relevance. Tasks such as clinical identification and treatment planning, while less explored in structured EHRs, were included due to their practical significance in unstructured EHR work. Other tasks like information retrieval, data aggregation, arithmetic calculation, and diagnostic evaluation represent common LLM reasoning patterns in structured EHRs.

Task System Construction: The tasks are organized along three axes: clinical scenario (data-driven vs. knowledge-driven), cognitive level (understanding vs. reasoning), and functional category (six types). This classification system reflects both clinical intent and reasoning complexity.

Task Sample Extraction: Evaluation samples were created using two data sources: Synthea, which provides synthetic, privacy-free medical records suitable for controlled scenarios, and the eICU Collaborative Research Database, which contains real structured data from ICU environments. A total of 2,200 annotated samples were generated, with GPT-4o creating question-answer pairs based on task definitions and data structures.

Evaluation Process: EHRStruct established a unified experimental process for systematic evaluation. It covers 20 LLMs, including general and medical domain models. For each task, 200 question-answer samples are used. All samples are converted into four input formats: flattened text, special character-separated representation, graph-structured representation, and natural language description. Evaluations use single-turn generation and uniform hyperparameters to ensure fair comparisons. The benchmark also supports in-depth experiments on specific models, including few-shot prompting and fine-tuning. It reproduces and compares 11 structured data reasoning methods and introduces the EHRMaster method.

Experimental Results

Zero-Shot Performance: Researchers tested the zero-shot performance of various LLMs on the Synthea dataset. General LLMs significantly outperformed medical-specific models in most tasks, particularly knowledge-driven ones where medical models often failed to produce effective output. Closed-source commercial models, such as the Gemini series, ranked highest overall. Data-driven tasks generally showed better performance, while knowledge-driven tasks, especially diagnostic evaluation and treatment planning, remain challenging.

Relative Gain Comparison of 11 SOTA Methods: An evaluation of 11 representative state-of-the-art methods revealed a performance gap. General methods excelled at data-driven logical and numerical reasoning but performed poorly on clinical knowledge tasks. Conversely, medical methods, while proficient in knowledge-driven tasks like disease prediction, struggled with general data scenarios. This suggests a need for a unified solution that balances structured logical reasoning and clinical knowledge integration.

Benchmark Performance of EHRMaster: The EHRMaster framework, when paired with various Gemini models, demonstrated strong performance. It improved data-driven tasks, achieving 100% accuracy in some arithmetic reasoning scenarios, and showed performance improvements for challenging knowledge-driven tasks, highlighting its effectiveness in structured EHR reasoning.

The paper's first author is Yang Xiao, a Ph.D. student at Nanyang Technological University's School of Computer Science and Engineering. The corresponding author, Dr. Zhao Xuejiao, conducted this work as a Wallenberg-NTU Presidential Postdoctoral Fellow at the LILY Research Centre and is currently a Research Scientist at the Alibaba-NTU Global e-Sustainability CorpLab (ANGEL). The third author is Shen Zhiqi, a Senior Lecturer and Senior Research Fellow at the School of Computer Science and Engineering, Nanyang Technological University.