paint-brush
Experimental Setup for Evaluating LLM Performance on Biomedical Syllogistic Tasksby@largemodels
106 reads

Experimental Setup for Evaluating LLM Performance on Biomedical Syllogistic Tasks

by Large Models3mDecember 13th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The experimental setup includes five modules for processing syllogistic tasks and evaluating LLM performance. The tests used 11,200 instances across 28 syllogistic schemes, with parameters like distractors and entity types considered for comprehensive analysis. The experiments were conducted on a system with advanced computational resources, including AMD EPYC CPU and NVIDIA A100 GPUs.
featured image - Experimental Setup for Evaluating LLM Performance on Biomedical Syllogistic Tasks
Large Models HackerNoon profile picture
0-item
  1. Abstract and Introduction
  2. SylloBio-NLI
  3. Empirical Evaluation
  4. Related Work
  5. Conclusions
  6. Limitations and References


A. Formalization of the SylloBio-NLI Resource Generation Process

B. Formalization of Tasks 1 and 2

C. Dictionary of gene and pathway membership

D. Domain-specific pipeline for creating NL instances and E Accessing LLMs

F. Experimental Details

G. Evaluation Metrics

H. Prompting LLMs - Zero-shot prompts

I. Prompting LLMs - Few-shot prompts

J. Results: Misaligned Instruction-Response

K. Results: Ambiguous Impact of Distractors on Reasoning

L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge

M Supplementary Figures and N Supplementary Tables

F Experimental Details

The entire experimental setup was implemented as a python code package, consisting of 5 modules:


• pathways: defines the ontological relations and operations for biological pathways, as well as the logic to retrieve and transform data from the domain ontology.


• logic2nl: translates logic formulas into NL statements (premises, conclusions), using parameterised scheme templates.


• llm: provides access to LLMs, and facilitates logging.


• experiments: defines all test logic, parameterisation and metrics for the experiments.


• main: orchestrates all the tests and aggregates results.


The LLMs were loaded and run using the HuggingFace transformers library (v4.43.3). The prompts were processed directly by the models with generate, without sampling.


Parameters were set as follows:


• Number of premises = 2: the number of valid premises per instance (factual argumentative text).


• Max distractors = 5: maximum number of distractors to be added per instance.


• Subset size = 200: maximum number of positive and negative instances for each scheme.


• Batch size = 20: number of instances evaluated simultaneously.


• ICL: Whether the in-context learning prompt would be used or not.


• Model: the LLM to be evaluated


For both Task 1 and Task 2, each model was analyzed across all 28 syllogistic schemes, with responses evaluated from a total of 11,200 instances (28 schemes × (200 positive + 200 negative)). The exceptions were: Generalized Modus Ponens - complex predicates (18 instances), Hypothetical Syllogism 1 - base (202 instances) and Generalized Contraposition - negation (202 instances). This is due to the number of factually possible different instances being limited by the ground-truth data in such cases. For both tasks, this amounted to a total of 211,840 instances, with performance comparisons made between ZS and ICL settings, leading to a total of 423,680 prompts.


When examining the impact of distractors, responses were analyzed across five variants of each argument scheme, reflecting different numbers of irrelevant premises (from 1 to 5 distractors). For each scheme and model, this resulted in 1,000 responses (200 prompts per scheme × 5 variants), with the exception of the aforementioned cases. Across all schemes, this totaled 26,480 responses for one model, and 211,840 responses were analyzed for all models for each number of distractors, totaling 1,059,200 responses. Again, comparisons were made between ZS and ICL settings, doubling the previous number.


For both Task 1 and Task 2, performance was compared using actual gene entity names versus synthetic names across two selected schemes. Each scheme was tested with 400 instances (200 positive, 200 negative), resulting in a comprehensive analysis of 12,800 instances per task (200 instances × 2 schemes × 2 entity types × 8 models), providing a thorough comparison of model performance with both factual and synthetic gene names.


The experiments were run on a computer with an AMD EPYC 7413 24-Core CPU, 128GB of available RAM and 2 × NVIDIA A100-SXM4-80GB GPUs.


Authors:

(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;

(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;

(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;

(4) Marco Valentino, Idiap Research Institute, Switzerland;

(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.


This paper is available on arxiv under CC BY-NC-SA 4.0 license.