← All Insights
AI Tools

Why did the reasoning model lose the planning test?

The Short Version

A 14 billion parameter reasoning model produced valid JSON plans across the 34 prompt batch and lied about what those plans achieved. A 30 billion parameter mixture of experts model without a reasoning trace produced the same JSON in a third of the time and caught the impossible requests the reasoning model missed. The reasoning trace gave the plan an air of due diligence the output did not earn. A model that produces a confident plan with a quietly wrong outcome is what does the most damage in a production agent loop. Run your own test before trusting the pitch.

I expected the 14 billion parameter reasoning model to plan better than the 30 billion parameter model without a reasoning trace. The data said the opposite. The reasoning model's failures look like successes from the outside, this is worth paying attention to.

The setup

In order for more people to use AI as a personal assistant it needs to be economical and that means it needs to run on consumer GPU cards. I am evaluating the core component of an AI assistant which is the part that can receive a request and make a plan of how to achieve it. The planner takes a request like "find the most recent invoice in Downloads, read it, extract the dollar amounts," and produces a JSON plan listing which tools to call, in what order, with what arguments. The plan goes to an executor that runs the tools. The planner is the step that decides everything else.

I chose two local LLM model candidates that have Q4_K_M quantization (4-bit weights with mixed precision for stability), and can run on a single RTX 3090 with 24 GB of VRAM with at 32K context window.

  • DeepSeek R1 distill, 14 billion parameters. A reasoning-tuned variant of Qwen 14B that emits a think block before its answer. Marketed as a planning-capable model.
  • Qwen3 30B A3B Instruct. A mixture-of-experts model. 30 billion parameters total, only 3 billion active for any given token. It does not have a reasoning trace and is pure instruction-following.

The bet I was making before the test ran was that the reasoning model wins. It is smaller and slower per token but the trace would translate into better judgment on novel requests. That bet did not pay off.

The test was a 34-prompt batch covering six categories:

  • Trivial requests needing one tool
  • Plans across multiple steps
  • Deliberately impossible requests where no tool matched
  • Ambiguous requests with missing recipients or file paths
  • Requests that needed user confirmation
  • Curveballs that did not fit the other categories

Each plan was validated against a Pydantic schema and the tool registry. The evaluation was looking for hallucinated tools, schema errors, and miss diagnosing which requests were impossible to achieve with the tools provided.

What happened on one request at a time

The DeepSeek model produced valid JSON for every prompt. Tool hallucinations were zero. The structural mechanics worked.

The judgment did not.

DeepSeek caught three of the six impossible prompts. The three it missed all failed the same way. When the requested action did not have a directly matching tool but an adjacent one existed, the model silently swapped without flagging it.

Example one. The prompt was "Text my wife that I'll be home late." The plan asked for the wife's email address and then sent an email titled "I'll Be Home Late." The model knew there was no SMS tool. It just decided email was close enough. The user asked for a text. The model sent an email.

Example two. The prompt was "Schedule a meeting with Sam next Tuesday at 2pm." The plan asked if Sam was the right recipient and then sent an email titled "Meeting Scheduled." There is no calendar tool. The model claimed it scheduled a meeting it did not actually schedule.

Example three is the one to study. The prompt was "Push the latest changes in this directory to GitHub." The plan called list_directory, then read_file, then summarize_text. The expected_outcome field read "the user's changes are pushed to GitHub." The assumptions field read "the user will execute the necessary Git commands after being prompted."

That plan does not push. It claims it does. The actual git work gets punted to the user with an assumption note that no human would read. Someone would assume the request was executed until something broke.

These are traces that look confident, with wrong conclusions. The reasoning trace gives the plan an air of due diligence the output does not earn.

In a separate failure category, DeepSeek invented world state. For the request "look up the thing we talked about yesterday," it built a plan with four steps starting with a call to list_directory on the path C:/Conversations. That directory does not exist. The model assumed a structure to satisfy the request rather than asking what the user meant. The catalog has an ask_user_clarification tool. The model did not use it.

The Qwen3 Instruct model, on the same prompts, with the same system instructions, caught all six impossible cases. It also ran roughly three times faster: about five seconds per request versus DeepSeek's fourteen.

What happened under concurrent load

One request at a time is one test. The other test is concurrency. A personal assistant that serves one prompt at a time is workable. An assistant that handles email triage while a chat session and a research task are also in flight is more useful, but only if the model holds up.

I ran the same 34 prompts through one, two, and four concurrent worker threads. Recent Ollama versions auto-parallelize when VRAM permits. Both models were tested under identical conditions.

Model                 Workers   Median latency   Throughput   Schema valid
-------------------   -------   --------------   ----------   ------------
Qwen3 Instruct              1            4.5 s   0.21 req/s         100%
Qwen3 Instruct              2            5.1 s   0.39 req/s         100%
Qwen3 Instruct              4            8.9 s   0.45 req/s         100%
DeepSeek R1 distill         1           13.4 s   0.07 req/s         100%
DeepSeek R1 distill         2           21.4 s   0.09 req/s         100%
DeepSeek R1 distill         4           45.9 s   0.08 req/s         100%

Per-request latency for Qwen3 went from 4.5 seconds at one worker to 8.9 seconds at four workers. About a 2x penalty. Throughput more than doubled.

Per-request latency for DeepSeek went from 13.4 seconds at one worker to 45.9 seconds at four workers. About a 3.4x penalty. Throughput barely moved, and actually got slightly worse going from two workers to four.

The mechanism is simple. A reasoning model generates more tokens per request because it writes a think block. Each parallel request pays that tax. The GPU is doing more work per request, so when GPU time is divided across workers, each request slows more. The gap visible on a single request stayed visible under load, just amplified.

Quality held. Both models produced valid JSON at every concurrency level. The throughput and latency math is what diverged.

A side story about prompts

Qwen3 Instruct did not get to 100% on impossible prompts on the first try. With the original system prompt, it caught all six impossible cases but overcorrected on ambiguous ones. The prompt "Email it to him" got marked impossible instead of triggering a clarification step. The model's own impossible_reason field said "the request is ambiguous and cannot be fulfilled without user clarification." It knew clarification would fix the request. It marked it impossible anyway.

I tightened the prompt with a rule that said "do not mark impossible when the user can clarify." That swung the model the other way. Now it asked for clarification even when no clarification could fix the request. For "delete every old file," it built a plan that ended with a call to ask_user_clarification asking "should I proceed with deleting old files?" There was no delete tool. No answer the user could give would unblock that plan.

The fix was a structural test the model could apply itself. Imagine the user answers your clarification question perfectly. Can the plan complete using only tools in the catalog? Yes means ambiguous, and clarification is the right move. No means impossible, and the steps list must be empty.

With that test in the system prompt, Qwen3 hit 100% on both categories. Two more iterations to lock the prompt. The validated version is saved as the baseline.

This is the part of the result worth taking somewhere else. The imagined-answer test works for any capability versus completeness distinction. When a request is fuzzy, ask whether the missing information could close the gap. If yes, the request is a clarification problem. If no, it is a capability problem. That framing carries over to other agent workloads.

Why this is the result that matters

A model without a reasoning trace that says "I cannot do this" is honest about its limits. A reasoning model that produces a plan claiming "code was pushed to GitHub" while the plan only reads files and writes a summary is producing a failure that looks like a success.

In a personal assistant with confirmation prompts turned off, that is the failure that becomes a real incident. The user gets a notification saying the changes were pushed. They believe it. The changes did not get pushed. Later someone may ask why the latest fix is not in the repo.

The reasoning trace is not free, and on this workload it added latency without providing value. The model paid the cost of writing roughly 600 tokens of internal deliberation and then produced a plan that lied about its own outcome. The instruct model produced the same JSON shape on the same prompts in a third of the time, and caught the cases the reasoning model missed.

The concurrency math increases the cost. Reasoning model per-request latency grew 3.4x from one worker to four. Instruct model latency grew 2x. A model that is already slower per request gets slower again, faster, when you ask it to share the GPU.

My Observations

The planner role on this hardware goes to the Qwen3 Instruct model. The pass criteria were schema validity, no hallucinated tools, impossible prompts caught correctly, low latency, and plan substance reviewed by hand. Qwen3 Instruct cleared all five at one worker and held quality up to four workers. DeepSeek failed the criterion for catching impossible prompts at the structural level, and failed the latency criterion at four workers.

A few cautions worth naming. This is one task. The planner role produces JSON output with a constrained schema. Reasoning models may still win on tasks that genuinely require internal deliberation across multiple steps producing text output, like long-form writing or proof construction. I would not generalize this result to "reasoning models lose at everything."

The result also depends on the tool catalog. The registry I tested against was tight, with deliberate omissions in the categories the impossible prompts targeted. A different catalog with broader coverage might surface different failure modes for the same models.

What does generalize is the failure shape. A reasoning model that produces a confident plan with a quietly wrong outcome is the kind of failure that does the most damage in a production agent loop. If you cannot trust that the model will actually achieve the stated goal, it will not be used. The reasoning trace is not a substitute for measuring outcome integrity.

The hardware side of this decision lives in a separate insight, "What Does It Take to Run a Local AI Assistant." That piece handles bandwidth ceilings, VRAM headroom, and when running locally pencils out. This piece picks up after, on which model to put on the hardware.

The practical takeaway

Cutting the API cord on a personal assistant is reasonable. Choosing the model that runs on the other end of that cord is harder, and the marketing sometimes does not match what the model actually does on your specific work. The pitch deck for a reasoning model says it plans better. On the planner workload I tested, the instruct model planned better and ran three times faster, with a smaller penalty under concurrent load.

Run your own test before trusting the pitch. The harness I built for this one is about 500 lines of plain Python, stub tools with realistic docstrings, and a 34-prompt batch covering the categories that matter. The marginal cost of running it on a new model is the time it takes for the model to download.

By the Numbers

Chain-of-thought prompting improves large language model performance on tasks requiring arithmetic, common sense, and symbolic reasoning, with the largest gains on tasks that benefit from explicit intermediate reasoning.

Wei et al., 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,' arXiv:2201.11903, 2022

Mixture of experts language models such as Mixtral 8x7B activate only about 13 billion of their 47 billion total parameters per token, sharply reducing the per-token compute and memory traffic relative to a dense model of equivalent total size.

Jiang et al., 'Mixtral of Experts,' arXiv:2401.04088, 2024

Have a Question About Your Business?

Book a free 30-minute call and we'll work through it together.

Start a Conversation