Tomas Bata University in Zlin - AI Research Lab
A reproducible benchmark comparing LLaMEA, LLM4AD, and frontEASE under the same model endpoint, tasks, timeouts, and hardware conditions.
During my internship, I worked on a practical AI research project focused on Automated Algorithm Discovery. The goal was to understand how different LLM-driven frameworks behave when they are evaluated through the same benchmark pipeline.
I built a Python benchmark harness with adapters for LLaMEA, LLM4AD, and frontEASE. The harness stores raw JSON results, resumes interrupted experiments, validates generated algorithms in isolated subprocesses, and converts the final results into figures through a separate analysis notebook.
A shared benchmark setup for all tested frameworks.
Sandboxed subprocess validation for generated algorithms.
Adapters for LLaMEA, LLM4AD, and frontEASE.
Reusable result analysis with notebook-generated figures.
Defined the benchmark goals, tasks, risks, framework adapters, and repeatable evaluation setup.
Built the Python harness, created framework adapters, stored raw results, and added resume support for long-running experiments.
Ran controlled experiments, checked generated-code validity, tracked failures, and compared reliability across the frameworks.
Turned the final dataset into figures and wrote the realization document, planning document, and reflection.
LLM-driven algorithm design
AI Developer and Research Intern
Tomas Bata University in Zlin
Benchmark harness, result data, figures, and internship documents
Making the framework comparison fair and repeatable.
I used one shared pipeline with the same model endpoint, tasks, seeds, hardware conditions, and timeout rules.
Generated algorithms can fail, hang, or report results incorrectly.
The harness validates code in isolated subprocesses, stores raw output, and marks invalid runs clearly for analysis.