paint-brush
Testing LLMs on Solving Leetcode Problems in 2025by@alexsvetkin
New Story

Testing LLMs on Solving Leetcode Problems in 2025

by Alex Svetkin9mApril 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

State of the art "reasoning" LLMs can solve a sizeable amount of Leetcode hard algorithmic problems.
featured image - Testing LLMs on Solving Leetcode Problems in 2025
Alex Svetkin HackerNoon profile picture
0-item
1-item

A year ago, my benchmark showed that Large Language Models (LLMs) could solve algorithmic coding challenges on Leetcode. However, their ability was limited to a subset of well-known, "popular" problems. New problems, which were never part of their training datasets, presented difficulties. While the easiest were mostly solved by the models, the hardest remained unattainable.

Since then, Open AI, Anthropic, and Google have released enhanced versions of their models and new players like Deepseek and xAI have appeared. Many models are now marketed as capable of writing code, which was not the case before. I intend to benchmark these state-of-the-art LLMs to find out whether their ability to solve novel algorithmic problems has improved or not.

Motivation

There are existing benchmarks for LLMs to assess their coding abilities.

SWE-bench is focused on solving real-life software problems – it is based on Github issues of existing open-source projects. It is indeed a brilliant idea, but it covers just too many things besides the actual algorithmic problem solving which I was focusing on.

Codeforces benchmarks better reflect LLMs' algorithmic problem-solving skills. OpenAI tested o1 and o3 models on Codeforces problems and shared detailed results (1, 2), while other competitors did not. That made a direct comparison impossible.

This led to the creation of a new benchmark, allowing for a direct comparison of LLMs. And, after all, why not to do it just for fun?

The benchmark design

The idea is to emulate human actions when solving algorithmic problems but use a LLM to write the code:

  1. Download the problem description.
  2. Create a prompt from the description.
  3. Generate code with the LLM.
  4. Submit the code for validation.
  5. Await results.


Sketched steps of the benchmark


These steps should be performed for each problem in our testing set and for each LLM. For the sake of simplicity, there is only one attempt for a LLM to write code on each problem, without any feedback to reiterate to improve the solution. All problems are treated as independent; no context is shared between them.


Why Leetcode?

Leetcode proved to be a good choice for benchmarking for several reasons:

  • Leetcode problems are used in actual interviews for software engineer positions.
  • Computer science students learn to solve similar problems during their education.
  • It has an online judge that can check if the solution is correct in seconds.
  • Many popular programming languages are supported.
  • Human user's performance on this problem is also available.


How Leetcode works

If you have never dealt with competitive programming or solving algorithmic challenges, here is a brief explanation. Check out this sample problem description:

Given an array of integers numbers and an integer target, return indices of
the two numbers such that they add up to the target. You may assume that each
input would have exactly one solution, and you may not use the same element
twice.  You can return the answer in any order.

A contestant programmer needs to write a piece of code what solves that problem. To start with there is a “snippet” — an uncompleted function, with a name and arguments given, but an empty body:

class Solution: 	  
  def twoSum(self, nums: List[int], target: int) -> List[int]: 	    
    # your code here 

Usually, several samples of input and output (test cases), are provided in the description:

Input:  nums = [2,7,11,15], target = 9 
Output: [0,1]  

A problem may have dozens of test cases that are available only to the online judge. A problem is solved (or a solution is considered accepted) if and only if the code produces expected outputs for all test cases within a reasonable time and memory limits. The solution can be written in any programming language that is supported by the judge.

Each problem has an "acceptance rate," the ratio of accepted solutions submitted by Leetcode users. Note that a single user can submit their code unlimited number of times, and each attempt counts toward acceptance rate.

These rules were not invented by Leetcode; they have been commonly used in computer science contests for quite a long time.


Datasets

As in the previous benchmark, I wanted to run LLMs on two sets of problems:

  • "well-known" problems are not only published long ago but are also most often used for software interviews — therefore, solutions are widely available.
  • "Unseen" problems which were published recently, and their solutions were not likely observed by the tested LLMs.

While most problems have plain-text descriptions and only require expanding a given function with code, others are different. Some require implementing an interface, i.e., expanding several functions in one problem. Others have external links and images, which could pose difficulties to LLMs, as few models support image inputs or browsing the internet. I decided to exempt problems with images, links, and those requiring the implementation of multiple functions.

Leetcode offers three problem lists: "Top interview 150", "Leetcode 75", and "Top 100 Liked". My dataset of "well-known" problems combines these lists, totaling 134 problems after exclusions mentioned above.

For "unseen" problems, I selected 99 of the most recent published problems: 33 easy, 33 medium, and 33 hard. Recency was determined based on problem IDs, which are incremental. Although Leetcode does not display the publication date of problems, it can be approximated from comments and discussions. The earliest of these "unseen" problems were most likely published around November 2024.

Difficulty levels are purely subjective and at the editors' discretion. I did not intend to match the number of problems for each difficulty or dataset.



Problem set


Well-Known

Unseen
(23 Mar 2025)

Total

133

99

Easy

41

33

Medium

78

33

Hard

14

33

Acceptance rate of Leetcode users

53.44%

37,05%

Problem statements and code snippets were obtained using my benchmarking tool, which is published on Github: https://github.com/whisk/leetgptsolver


Prompt, language choice and code generation

The benchmark was designed this way: LLM makes only one attempt to generate code without any prior information about the problem (or any other problems) and without knowing its test cases, except those that were in the description itself. There is no mechanism to provide feedback or improve the code after it is generated.

I used the same prompt for every LLM and every problem:

Hi, this is a coding interview. You will be given:
* A problem statement (with sample test cases if available).
* A starter code snippet (with fixed function signatures). 

Please write your solution in the {language} programming language. Your code must:
* Solve the problem fully and correctly.
* Pass all provided sample test cases.
* Run within acceptable time and memory limits (assume large inputs if none are specified).
* Follow good coding practices (clear logic, readable structure, appropriate use of language features). 

Here is the problem statement: {question} 

Here is the code snippet, which you should expand with your solution: {snippet} 

Important Requirements:

* Do not change any provided function signatures, class names, or method names.
* Output only valid source code that can be executed as-is, without any further improvements or bugfixes.
* Do not include docstrings, markdown, or commentary in your final code. 

Good luck!

The prompt was “polished” with ChatGPT4 from my first draft, but without implementing any “prompt-engineering” techniques.

The problem description was stripped of HTML tags before using it in the prompt.

For the programming language I chose Python (version 3).

LLMs were asked to output only the working code without any preceding text, but that was not true in many cases. A basic cleanup was implemented, and everything besides the actual code was removed and not submitted.


Models and parameters

The models used in the benchmark are listed in the table below, with all non-default parameters specified. Knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference, if known.

Vendor

Model

Knowledge cutoff date

"Reasoning"

Parameters

Anthropic

claude-3-7-sonnet-20250219

Nov 2024

No

temperature = 0.0 max_tokens = 4096


claude-3-7-sonnet-20250219 (with thinking enabled)

Nov 2024

Yes

temperature = 0.0 max_tokens = 16384 budget_tokens = 8192

DeepSeek

deepseek-chat (DeepSeek-V3)

unknown

No

temperature = 0.0


deepseek-reasoner (DeepSeek-R1)

unknown

Yes

temperature = 0.0

Google

gemini-2.0-flash-001

unknown

No

temperature = 0.0


gemini-2.0-pro-exp-02-05

unknown

No

temperature = 0.0


gemini-2.5-pro-exp-03-25

unknown

Yes

temperature = 0.0

xAI

grok-2-1212

July 17, 2024

No

seed = 42

OpenAI

o1-2024-12-17

Oct 01, 2023

Yes

seed = 42


o3-mini-2025-01-31

Oct 01, 2023

Yes

seed = 42

The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This factor should be considered when re-running these tests.

All known knowledge cutoff dates are earlier than the oldest problem in the well-known dataset (November 2024). However, I could not find cutoff dates for the Gemini and DeepSeek model families.

Some models support "reasoning" or "thinking" modes by default, while for Claude 3.7 Sonnet it can be enabled by passing parameters. Use of this feature is specified in the table. Other model features (or "tools") like web search were not enabled, even if supported.

Results

Results on the "well-known" problem set


All competitors showed very high acceptance rates on well-known problems, like in the previous benchmark. I did not test superior models or modifications (namely: Claude 3.7 Sonnet with reasoning enabled, DeepSeek R1, Gemini 2.5 Pro, O1) to save time and credits, as the results were highly predictable.

Results on the "unseen" problem set

Results are strikingly different for "unseen" problems in two aspects:

  1. For all models the acceptance rate is lower for "unseen" problems. This is especially notable for medium and hard problems.
  2. Models with "reasoning" or "thinking" mode enabled produced better results across problems of all difficulties, though exact numbers varied from model to model.

Significantly higher acceptance rates for well-known problems can be explained by the possibility that those problems and their solutions were part of the training sets, and models had only to reproduce a known correct solution. However, users' acceptance rate for new medium and hard problems is also lower than for problems in the "well known" set. This case is difficult to quantify, and it does not necessarily mean that new problems are "harder". Difficulty evaluation, as mentioned before, is highly subjective. And, as in the case with LLMs themselves, human users may submit known correct solutions for well-known problems.

All models with "reasoning" mode enabled significantly outperformed their basic counterparts. Most importantly, some of them were able to solve a sizeable part of medium and hard problems — an unfeasible achievement just a year ago. The o3-mini shows the best results among all "reasoning" models — it performed even better than the much more expensive and slower o1. It must be noted that o3-mini was specifically trained to solve competitive programming problems. However, I restrain from concluding which model is better for solving algorithmic problems because that is highly dependent on the token budget, and latency and cost are also to be considered.

Future considerations

It cannot be guaranteed that the "unseen" problem sets were not included in the models' training data. To address this, we may consider generating new, unique problems specifically designed for future benchmarks — of course, using LLMs.

Additionally, another strategy is to employ less commonly used programming languages. This approach may require LLMs to devise a solution rather than "copy-pasting" known correct code written in Python.

These ideas need further research, and I hope that someone else or I can dive into them.


Cover image created by DALL·E.