Tomas Bata University in Zlin - AI Research Lab

Internship: Benchmarking LLM-Driven Algorithm Design

A reproducible benchmark comparing LLaMEA, LLM4AD, and frontEASE under the same model endpoint, tasks, timeouts, and hardware conditions.

Project Overview

During my internship, I worked on a practical AI research project focused on Automated Algorithm Discovery. The goal was to understand how different LLM-driven frameworks behave when they are evaluated through the same benchmark pipeline.

I built a Python benchmark harness with adapters for LLaMEA, LLM4AD, and frontEASE. The harness stores raw JSON results, resumes interrupted experiments, validates generated algorithms in isolated subprocesses, and converts the final results into figures through a separate analysis notebook.

Key Features

Unified Benchmark Setup

A shared benchmark setup for all tested frameworks.

Sandboxed Validation

Sandboxed subprocess validation for generated algorithms.

Framework Adapters

Adapters for LLaMEA, LLM4AD, and frontEASE.

Result Analysis

Reusable result analysis with notebook-generated figures.

Development Process

1

Planning and Research

Defined the benchmark goals, tasks, risks, framework adapters, and repeatable evaluation setup.

2

Implementation

Built the Python harness, created framework adapters, stored raw results, and added resume support for long-running experiments.

3

Benchmarking and Validation

Ran controlled experiments, checked generated-code validity, tracked failures, and compared reliability across the frameworks.

4

Analysis and Reporting

Turned the final dataset into figures and wrote the realization document, planning document, and reflection.

Project Details

Internship Focus

LLM-driven algorithm design

My Role

AI Developer and Research Intern

Organization

Tomas Bata University in Zlin

Output

Benchmark harness, result data, figures, and internship documents

Technologies Used

Python Ollama llama3.1 Jupyter pandas SciPy matplotlib JSON Docker

Challenges & Solutions

Fair Comparison

Making the framework comparison fair and repeatable.

Solution:

I used one shared pipeline with the same model endpoint, tasks, seeds, hardware conditions, and timeout rules.

Generated-Code Reliability

Generated algorithms can fail, hang, or report results incorrectly.

Solution:

The harness validates code in isolated subprocesses, stores raw output, and marks invalid runs clearly for analysis.