
A year ago,
Since then, Open AI, Anthropic, and Google have released enhanced versions of their models and new players like Deepseek and xAI have appeared. Many models are now marketed as capable of writing code, which was not the case before. I intend to benchmark these state-of-the-art LLMs to find out whether their ability to solve novel algorithmic problems has improved or not.
There are existing benchmarks for LLMs to assess their coding abilities.
This led to the creation of a new benchmark, allowing for a direct comparison of LLMs. And, after all, why not to do it just for fun?
The idea is to emulate human actions when solving algorithmic problems but use a LLM to write the code:
These steps should be performed for each problem in our testing set and for each LLM. For the sake of simplicity, there is only one attempt for a LLM to write code on each problem, without any feedback to reiterate to improve the solution. All problems are treated as independent; no context is shared between them.
Leetcode proved to be a good choice for benchmarking for several reasons:
If you have never dealt with competitive programming or solving algorithmic challenges, here is a brief explanation. Check out this sample problem description:
Given an array of integers numbers and an integer target, return indices of
the two numbers such that they add up to the target. You may assume that each
input would have exactly one solution, and you may not use the same element
twice. You can return the answer in any order.
A contestant programmer needs to write a piece of code what solves that problem. To start with there is a “snippet” — an uncompleted function, with a name and arguments given, but an empty body:
class Solution:
def twoSum(self, nums: List[int], target: int) -> List[int]:
# your code here
Usually, several samples of input and output (test cases), are provided in the description:
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
A problem may have dozens of test cases that are available only to the online judge. A problem is solved (or a solution is considered accepted) if and only if the code produces expected outputs for all test cases within a reasonable time and memory limits. The solution can be written in any programming language that is supported by the judge.
Each problem has an "acceptance rate," the ratio of accepted solutions submitted by Leetcode users. Note that a single user can submit their code unlimited number of times, and each attempt counts toward acceptance rate.
These rules were not invented by Leetcode; they have been commonly used in computer science contests for quite a long time.
As in the previous benchmark, I wanted to run LLMs on two sets of problems:
While most problems have plain-text descriptions and only require expanding a given function with code, others are different. Some require implementing an interface, i.e., expanding several functions in one problem. Others have external links and images, which could pose difficulties to LLMs, as few models support image inputs or browsing the internet. I decided to exempt problems with images, links, and those requiring the implementation of multiple functions.
Leetcode offers three problem lists: "Top interview 150", "Leetcode 75", and "Top 100 Liked". My dataset of "well-known" problems combines these lists, totaling 134 problems after exclusions mentioned above.
For "unseen" problems, I selected 99 of the most recent published problems: 33 easy, 33 medium, and 33 hard. Recency was determined based on problem IDs, which are incremental. Although Leetcode does not display the publication date of problems, it can be approximated from comments and discussions. The earliest of these "unseen" problems were most likely published around November 2024.
Difficulty levels are purely subjective and at the editors' discretion. I did not intend to match the number of problems for each difficulty or dataset.
|
|
Problem set |
---|---|---|
|
Well-Known |
Unseen |
Total |
133 |
99 |
Easy |
41 |
33 |
Medium |
78 |
33 |
Hard |
14 |
33 |
Acceptance rate of Leetcode users |
53.44% |
37,05% |
Problem statements and code snippets were obtained using my benchmarking tool, which is published on Github:
The benchmark was designed this way: LLM makes only one attempt to generate code without any prior information about the problem (or any other problems) and without knowing its test cases, except those that were in the description itself. There is no mechanism to provide feedback or improve the code after it is generated.
I used the same prompt for every LLM and every problem:
Hi, this is a coding interview. You will be given:
* A problem statement (with sample test cases if available).
* A starter code snippet (with fixed function signatures).
Please write your solution in the {language} programming language. Your code must:
* Solve the problem fully and correctly.
* Pass all provided sample test cases.
* Run within acceptable time and memory limits (assume large inputs if none are specified).
* Follow good coding practices (clear logic, readable structure, appropriate use of language features).
Here is the problem statement: {question}
Here is the code snippet, which you should expand with your solution: {snippet}
Important Requirements:
* Do not change any provided function signatures, class names, or method names.
* Output only valid source code that can be executed as-is, without any further improvements or bugfixes.
* Do not include docstrings, markdown, or commentary in your final code.
Good luck!
The prompt was “polished” with ChatGPT4 from my first draft, but without implementing any “prompt-engineering” techniques.
The problem description was stripped of HTML tags before using it in the prompt.
For the programming language I chose Python (version 3).
LLMs were asked to output only the working code without any preceding text, but that was not true in many cases. A basic cleanup was implemented, and everything besides the actual code was removed and not submitted.
The models used in the benchmark are listed in the table below, with all non-default parameters specified. Knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference, if known.
Vendor |
Model |
Knowledge cutoff date |
"Reasoning" |
Parameters |
---|---|---|---|---|
Anthropic |
claude-3-7-sonnet-20250219 |
Nov 2024 |
No |
temperature = 0.0 max_tokens = 4096 |
|
claude-3-7-sonnet-20250219 (with thinking enabled) |
Nov 2024 |
Yes |
temperature = 0.0 max_tokens = 16384 budget_tokens = 8192 |
DeepSeek |
deepseek-chat (DeepSeek-V3) |
unknown |
No |
temperature = 0.0 |
|
deepseek-reasoner (DeepSeek-R1) |
unknown |
Yes |
temperature = 0.0 |
|
gemini-2.0-flash-001 |
unknown |
No |
temperature = 0.0 |
|
gemini-2.0-pro-exp-02-05 |
unknown |
No |
temperature = 0.0 |
|
gemini-2.5-pro-exp-03-25 |
unknown |
Yes |
temperature = 0.0 |
xAI |
grok-2-1212 |
July 17, 2024 |
No |
seed = 42 |
OpenAI |
o1-2024-12-17 |
Oct 01, 2023 |
Yes |
seed = 42 |
|
o3-mini-2025-01-31 |
Oct 01, 2023 |
Yes |
seed = 42 |
The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This factor should be considered when re-running these tests.
All known knowledge cutoff dates are earlier than the oldest problem in the well-known dataset (November 2024). However, I could not find cutoff dates for the Gemini and DeepSeek model families.
Some models support "reasoning" or "thinking" modes by default, while for Claude 3.7 Sonnet it can be enabled by passing parameters. Use of this feature is specified in the table. Other model features (or "tools") like web search were not enabled, even if supported.
All competitors showed very high acceptance rates on well-known problems, like in the previous benchmark. I did not test superior models or modifications (namely: Claude 3.7 Sonnet with reasoning enabled, DeepSeek R1, Gemini 2.5 Pro, O1) to save time and credits, as the results were highly predictable.
Results are strikingly different for "unseen" problems in two aspects:
Significantly higher acceptance rates for well-known problems can be explained by the possibility that those problems and their solutions were part of the training sets, and models had only to reproduce a known correct solution. However, users' acceptance rate for new medium and hard problems is also lower than for problems in the "well known" set. This case is difficult to quantify, and it does not necessarily mean that new problems are "harder". Difficulty evaluation, as mentioned before, is highly subjective. And, as in the case with LLMs themselves, human users may submit known correct solutions for well-known problems.
All models with "reasoning" mode enabled significantly outperformed their basic counterparts. Most importantly, some of them were able to solve a sizeable part of medium and hard problems — an unfeasible achievement just a year ago. The o3-mini shows the best results among all "reasoning" models — it performed even better than the much more expensive and slower o1. It must be noted that
It cannot be guaranteed that the "unseen" problem sets were not included in the models' training data. To address this, we may consider generating new, unique problems specifically designed for future benchmarks — of course, using LLMs.
Additionally, another strategy is to employ less commonly used programming languages. This approach may require LLMs to devise a solution rather than "copy-pasting" known correct code written in Python.
These ideas need further research, and I hope that someone else or I can dive into them.
Cover image created by DALL·E.