I have a secret passion for LFM2.5-Thinking. It's tiny 1.2B, it's fast, it's a reasoning model, and it's good. Really good.
My tests are still in progress. All i can do is share some early results. I use the public GSM8k dataset, but with my own benchmarking scripts.
What is the GSM8k benchmark?
Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
The Top 10 Leaderboard in 2026, up to 97%. Take note of the massive context size.
And this what "State of the Art" results looked like in 2021, barely 35%.
Some early results
Questions: 1319 (test)
Context sizes to test: [1000, 2000, 3000, 4000, 5000, 6000, 7000]
Endpoint: http://192.168.1.110:8000 / lfm2.5-thinking
=== max_tokens=1000 ===
[200/1319] acc=135/200 (67.5%) rate=3.9q/s
[400/1319] acc=251/400 (62.8%) rate=4.6q/s
[600/1319] acc=387/600 (64.5%) rate=5.0q/s
[800/1319] acc=512/800 (64.0%) rate=5.1q/s
[1000/1319] acc=640/1000 (64.0%) rate=5.3q/s
[1200/1319] acc=771/1200 (64.2%) rate=5.3q/s
Result: 851/1319 (64.5%) @ 5.48q/s
=== max_tokens=2000 ===
[200/1319] acc=163/200 (81.5%) rate=2.1q/s
[400/1319] acc=321/400 (80.2%) rate=2.3q/s
[600/1319] acc=479/600 (79.8%) rate=2.5q/s
[800/1319] acc=636/800 (79.5%) rate=2.5q/s
[1000/1319] acc=791/1000 (79.1%) rate=2.5q/s
[1200/1319] acc=956/1200 (79.7%) rate=2.6q/s
Result: 1055/1319 (80.0%) @ 2.63q/s
=== max_tokens=3000 ===
[200/1319] acc=171/200 (85.5%) rate=1.5q/s
[400/1319] acc=341/400 (85.2%) rate=1.5q/s
[600/1319] acc=505/600 (84.2%) rate=1.5q/s
[800/1319] acc=674/800 (84.2%) rate=1.5q/s
[1000/1319] acc=836/1000 (83.6%) rate=1.5q/s
[1200/1319] acc=1008/1200 (84.0%) rate=1.5q/s
Result: 1113/1319 (84.4%) @ 1.57q/s
=== max_tokens=4000 ===
[200/1319] acc=175/200 (87.5%) rate=1.1q/s
[400/1319] acc=348/400 (87.0%) rate=1.1q/s
[600/1319] acc=517/600 (86.2%) rate=1.1q/s
[800/1319] acc=683/800 (85.4%) rate=1.1q/s
[1000/1319] acc=852/1000 (85.2%) rate=1.1q/s
[1200/1319] acc=1033/1200 (86.1%) rate=1.1q/s
Result: 1139/1319 (86.4%) @ 1.17q/s
=== max_tokens=5000 ===
[200/1319] acc=176/200 (88.0%) rate=0.8q/s
[400/1319] acc=350/400 (87.5%) rate=0.9q/s
[600/1319] acc=523/600 (87.2%) rate=0.9q/s
[800/1319] acc=687/800 (85.9%) rate=0.9q/s
[1000/1319] acc=850/1000 (85.0%) rate=0.9q/s
[1200/1319] acc=1025/1200 (85.4%) rate=0.9q/s
Result: 1129/1319 (85.6%) @ 0.93q/s
=== max_tokens=6000 ===
[200/1319] acc=181/200 (90.5%) rate=0.7q/s
[400/1319] acc=351/400 (87.8%) rate=0.7q/s
[600/1319] acc=523/600 (87.2%) rate=0.7q/s
[800/1319] acc=696/800 (87.0%) rate=0.7q/s
[1000/1319] acc=863/1000 (86.3%) rate=0.7q/s
[1200/1319] acc=1048/1200 (87.3%) rate=0.7q/s
Result: 1153/1319 (87.4%) @ 0.73q/s
=== max_tokens=7000 ===
[200/1319] acc=172/200 (86.0%) rate=0.5q/s
[400/1319] acc=346/400 (86.5%) rate=0.6q/s
[600/1319] acc=520/600 (86.7%) rate=0.6q/s
[800/1319] acc=683/800 (85.4%) rate=0.6q/s
[1000/1319] acc=853/1000 (85.3%) rate=0.6q/s
[1200/1319] acc=1034/1200 (86.2%) rate=0.6q/s
Result: 1137/1319 (86.2%) @ 0.61q/s
=== Summary ===
max_tokens accuracy correct total rate
1000 64.5% 851 1319 5.5
2000 80.0% 1055 1319 2.6
3000 84.4% 1113 1319 1.6
4000 86.4% 1139 1319 1.2
5000 85.6% 1129 1319 0.9
6000 87.4% 1153 1319 0.7
7000 86.2% 1137 1319 0.6
About boxed & fallback. The model is requested to put the result in a "boxed" format, but it sometimes fail to do so. I have some "fallback" parsing to try to extract the answer anyway
More graph (edited)
It doesn't benefit from more context. The variance is withing the margin of error. Multi-turn / retry will benefit from it. probably
I'm not running my full suite, therefore there might be some false negatives here (improper parsing, flag correct result as incorrect).
You'll hear me saying it at lof time on dev.to, i don't have enough compute power. Yet, it's a good rough estimate.
Multi-turn (edit2)
I tried to bench it on multi-turn. If an answer is wrong, tell the AI that the answer is wrong so that it can try again. TL;DR: not worth it. The models might be too small. It works, but at the cost of long context and much longer computation time. Might be worth it for some usecase i suppose.
Versus Qwen-3-1.7B
This is a simple test on a subset.
- a self-evaluation of the answer compared to the ground truth in case the regexp failed. This give us some level of margin of error. (0.2%)
Context is 6k. It is important to note that Qwen is MUCH slower (3~5x) than LFM2.5, hence the small simple test.
=== GSM8K Evaluation Report ===
Total records: 1319
Numeric match (extracted == ground truth): 1194/1319 (90.5%)
Self-eval (1319 rated):
CORRECT: 1196 (90.7%)
INCORRECT: 118 (8.9%)
UNSURE: 5 (0.4%)
Smart-eval: no data
Regex vs Self-eval (regex/judge):
C/C: 1194
I/C: 2
I/I: 118
I/U: 5
Disagree: 2 (0.2%)
Extraction methods:
boxed: 1267 (96.1%)
the_answer_is: 40 (3.0%)
fallback: 10 (0.8%)
equals_eol: 2 (0.2%)
Turboquant
I tried to run turboquant but it doesn't accept parallel query. At least for the install i tried. So it almost completely defeat the purpose of compressing the KVCache for my use case.
Might still be interesting for long context but it's extremely slow compared to a normal vllm. It's all very early alpha anyway, wait and see for a useable turboquant.
Ah... and it doesn't support LFM2.5 somehow, despite being supposedly model agnostic.
TODO
There is always more to do.





Top comments (0)