Laurent Laborde

Posted on Mar 30 • Edited on Mar 31

Benchmarking LFM2.5-Thinking on GSM8k (early result)

#ai

I have a secret passion for LFM2.5-Thinking. It's tiny 1.2B, it's fast, it's a reasoning model, and it's good. Really good.

My tests are still in progress. All i can do is share some early results. I use the public GSM8k dataset, but with my own benchmarking scripts.

What is the GSM8k benchmark?

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

The Top 10 Leaderboard in 2026, up to 97%. Take note of the massive context size.

And this what "State of the Art" results looked like in 2021, barely 35%.

Some early results

Questions: 1319 (test)
Context sizes to test: [1000, 2000, 3000, 4000, 5000, 6000, 7000]
Endpoint: http://192.168.1.110:8000 / lfm2.5-thinking

=== max_tokens=1000 ===
  [200/1319] acc=135/200 (67.5%) rate=3.9q/s
  [400/1319] acc=251/400 (62.8%) rate=4.6q/s
  [600/1319] acc=387/600 (64.5%) rate=5.0q/s
  [800/1319] acc=512/800 (64.0%) rate=5.1q/s
  [1000/1319] acc=640/1000 (64.0%) rate=5.3q/s
  [1200/1319] acc=771/1200 (64.2%) rate=5.3q/s
  Result: 851/1319 (64.5%) @ 5.48q/s

=== max_tokens=2000 ===
  [200/1319] acc=163/200 (81.5%) rate=2.1q/s
  [400/1319] acc=321/400 (80.2%) rate=2.3q/s
  [600/1319] acc=479/600 (79.8%) rate=2.5q/s
  [800/1319] acc=636/800 (79.5%) rate=2.5q/s
  [1000/1319] acc=791/1000 (79.1%) rate=2.5q/s
  [1200/1319] acc=956/1200 (79.7%) rate=2.6q/s
  Result: 1055/1319 (80.0%) @ 2.63q/s

=== max_tokens=3000 ===
  [200/1319] acc=171/200 (85.5%) rate=1.5q/s
  [400/1319] acc=341/400 (85.2%) rate=1.5q/s
  [600/1319] acc=505/600 (84.2%) rate=1.5q/s
  [800/1319] acc=674/800 (84.2%) rate=1.5q/s
  [1000/1319] acc=836/1000 (83.6%) rate=1.5q/s
  [1200/1319] acc=1008/1200 (84.0%) rate=1.5q/s
  Result: 1113/1319 (84.4%) @ 1.57q/s

=== max_tokens=4000 ===
  [200/1319] acc=175/200 (87.5%) rate=1.1q/s
  [400/1319] acc=348/400 (87.0%) rate=1.1q/s
  [600/1319] acc=517/600 (86.2%) rate=1.1q/s
  [800/1319] acc=683/800 (85.4%) rate=1.1q/s
  [1000/1319] acc=852/1000 (85.2%) rate=1.1q/s
  [1200/1319] acc=1033/1200 (86.1%) rate=1.1q/s
  Result: 1139/1319 (86.4%) @ 1.17q/s

=== max_tokens=5000 ===
  [200/1319] acc=176/200 (88.0%) rate=0.8q/s
  [400/1319] acc=350/400 (87.5%) rate=0.9q/s
  [600/1319] acc=523/600 (87.2%) rate=0.9q/s
  [800/1319] acc=687/800 (85.9%) rate=0.9q/s
  [1000/1319] acc=850/1000 (85.0%) rate=0.9q/s
  [1200/1319] acc=1025/1200 (85.4%) rate=0.9q/s
  Result: 1129/1319 (85.6%) @ 0.93q/s

=== max_tokens=6000 ===
  [200/1319] acc=181/200 (90.5%) rate=0.7q/s
  [400/1319] acc=351/400 (87.8%) rate=0.7q/s
  [600/1319] acc=523/600 (87.2%) rate=0.7q/s
  [800/1319] acc=696/800 (87.0%) rate=0.7q/s
  [1000/1319] acc=863/1000 (86.3%) rate=0.7q/s
  [1200/1319] acc=1048/1200 (87.3%) rate=0.7q/s
  Result: 1153/1319 (87.4%) @ 0.73q/s

=== max_tokens=7000 ===
  [200/1319] acc=172/200 (86.0%) rate=0.5q/s
  [400/1319] acc=346/400 (86.5%) rate=0.6q/s
  [600/1319] acc=520/600 (86.7%) rate=0.6q/s
  [800/1319] acc=683/800 (85.4%) rate=0.6q/s
  [1000/1319] acc=853/1000 (85.3%) rate=0.6q/s
  [1200/1319] acc=1034/1200 (86.2%) rate=0.6q/s
  Result: 1137/1319 (86.2%) @ 0.61q/s

=== Summary ===
max_tokens  accuracy   correct   total    rate
      1000     64.5%       851    1319    5.5
      2000     80.0%      1055    1319    2.6
      3000     84.4%      1113    1319    1.6
      4000     86.4%      1139    1319    1.2
      5000     85.6%      1129    1319    0.9
      6000     87.4%      1153    1319    0.7
      7000     86.2%      1137    1319    0.6

About boxed & fallback. The model is requested to put the result in a "boxed" format, but it sometimes fail to do so. I have some "fallback" parsing to try to extract the answer anyway

More graph (edited)

It doesn't benefit from more context. The variance is withing the margin of error. Multi-turn / retry will benefit from it. probably

I'm not running my full suite, therefore there might be some false negatives here (improper parsing, flag correct result as incorrect).

You'll hear me saying it at lof time on dev.to, i don't have enough compute power. Yet, it's a good rough estimate.

Multi-turn (edit2)

I tried to bench it on multi-turn. If an answer is wrong, tell the AI that the answer is wrong so that it can try again. TL;DR: not worth it. The models might be too small. It works, but at the cost of long context and much longer computation time. Might be worth it for some usecase i suppose.

Versus Qwen-3-1.7B

This is a simple test on a subset.

a self-evaluation of the answer compared to the ground truth in case the regexp failed. This give us some level of margin of error. (0.2%)

Context is 6k. It is important to note that Qwen is MUCH slower (3~5x) than LFM2.5, hence the small simple test.

=== GSM8K Evaluation Report ===
Total records: 1319

Numeric match (extracted == ground truth): 1194/1319 (90.5%)

Self-eval (1319 rated):
  CORRECT: 1196 (90.7%)
  INCORRECT: 118 (8.9%)
  UNSURE: 5 (0.4%)

Smart-eval: no data

Regex vs Self-eval (regex/judge):
  C/C: 1194
  I/C: 2
  I/I: 118
  I/U: 5
  Disagree: 2 (0.2%)

Extraction methods:
  boxed: 1267 (96.1%)
  the_answer_is: 40 (3.0%)
  fallback: 10 (0.8%)
  equals_eol: 2 (0.2%)

Turboquant

I tried to run turboquant but it doesn't accept parallel query. At least for the install i tried. So it almost completely defeat the purpose of compressing the KVCache for my use case.

Might still be interesting for long context but it's extremely slow compared to a normal vllm. It's all very early alpha anyway, wait and see for a useable turboquant.

Ah... and it doesn't support LFM2.5 somehow, despite being supposedly model agnostic.

TODO

There is always more to do.

DEV Community