Now at 40-50tok/s generation and ~2000 tok/s prefill with a model that I've seen reason through race conditions and be able to trivially pull off any straight-forward coding task, and remain coherent at 500k context. With a preview checkpoint of the weights!
I'm excited for the future of local LLMs. There is some buy-in but apparently not an extreme amount to get access to models that can stand in the for the giants on all but the most challenging and/or hands-off coding tasks.
Not clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark?
Cheers
Instructions to reproduce, and benchmarks here: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...
Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)
Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.
I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?
Eventually, I'm going to stop writing stuff like this @dang, because even though it is literally being read by a human, it's going to just be copy and pasted into a chatbot, which will actually spend the time trying to comprehend what I am saying.
(emphasis mine)
> Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model.[2]
> As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification.[3]
[1]: https://arxiv.org/pdf/2412.19437#subsection.2.2
[2]: https://arxiv.org/pdf/2412.19437#subsubsection.5.4.3
[3]: https://arxiv.org/pdf/2606.19348v1#subsection.2.1
Side comment: I feel you may be too cynical towards your fellow commenters.
You draft n tokens, and you verify them in a single forward pass.
Here's the vLLM flag:
--speculative-config '{{"method":"mtp","num_speculative_tokens":2}}'
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.It works great. I'll keep my increased performance, and
> so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers
you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop.
Assuming that’s not true based on your phrasing, you’d be shooting yourself in the foot. Start using online models with the same quant at least benchmark as what you could run at home. Prepare for the at home model to be slower.
You don't even need to go that far. For example, with Exoscale Dedicated Inference[1] you just point it at the Hugging Face for the model and quantisation you want to test and it automagically spits out an OpenAI-compatible API endpoint.
[1] https://www.exoscale.com/ai-cloud-infrastructure/dedicated-i...
(I have no relationship with Exoscale, this particular product just crossed my radar recently)
Well, yes, I understood that.
Which is why I started with the words "You don't even need to go that far.".
To re-phrase what I said in clearer terms:
Instead of renting an instance, then messing around with configuring Linux and whatever via SSH or Ansible or whatever. Just point a Hugging Face link at this magic service and get a ready-to-go API back. Enabling you to test your desired model spec with minimum fuss.
Ultimately the guy wants his own hardware. So why waste time messing around with someone else's VM if you just want to test a specific model spec. That is the TL;DR.
I very much want local AI to win this in the end, but it’s extremely expensive to run good models at good speed locally right now. Minimax M2.5/2.7, Qwen 3.6, etc are pretty good for basic stuff, but pretty far off from competing with Opus/Fable.
While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.
I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).
I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.
With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.
I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.
The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.
Would you mind elaborating on this?
I shared a project in their #research channel where I used their qwen36moe quant to refresh my PhD research. The channel had a topic that ended with something like "and all things research..."
One of their people accused me of self-promotion, and I reiterated that I shared it in that channel because it was their quant doing something (I thought) interesting as a research model. The number of people interested in the topic can be counted on your hands (in binary).
They remained accusatory, made it personal, and then started deleting messages. I suppose I escalated a bit (from their perspective), saying how this was not a good first encounter, they could have asked me to move it instead of just deleting it. Then they deleted every message, including all of their own, and put me in timeout. Erased from history, unable to participate, and so I left.
A coworker of mine (ML guy) is also sus about their quants, not nefarious, more that their benchmark results do not mean they are better, possibly skewed / benchmaxxed.
Edit: 3.6 not 3.7!
Someone's optimistic
(Of course for all I know the 3.7 series is doing incredibly well in China, but I've seen almost no buzz around it from the circles that I inhabit.)
We know they have what it takes to fight back, and they know it... so I agree, there's no reason not be optimistic about future Qwen releases. But then I've never really understood what motivates these releases in the first place.
They aren't, though. GLM 5.2 is very far out in front of everybody else in the open-weight business when it comes to coding. They seem to have put a disproportionate effort into improving coding, and while it paid off for that, it does seems to have cost some efficiency.
You could say that GLM 5.2 is to DS4 as Fable is to Opus. Fable is is no better at a lot of tasks than Opus, but it codes like nothing else ever built.
fwiw, this is my high level process.
1. i keep comprehensive notes _while_ i'm experimenting(like checkpoints). this is a mix of commits and a append-only changelog file
2. if some part of what i've done seems like i can share with public, i create an outline of it. small paragraph with a few points
3. then i ask the model to merge the two to generate a post. i have my own style guide but ofcourse, the model idiosyncrasies will always creep in.
The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model reasons inside <think>, decides, and never emits the tool call. No error, just empty tool_calls. The fix was swapping the backbone from Thinking to Instruct, not tuning any parser flag.
The "load the bigger model first, size the smaller against actual residency" playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB framework floor shows up even at small gpu_memory_utilization values — plan against actuals, not targets.
```
(...) - Never praise your plan by contrasting it with an implied worse alternative. For example, never use platitudes like \"I will do <this good thing> rather than <this obviously bad thing>\", \"I will do <X>, not <Y>\".
- Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. (...)
```
It seems the OpenAI people added that first bullet to specifically address the tendency the model has, as seen in the parent comment. The goblin stuff coincidentally appears right after in the system prompt, so in included it as a bonus.
Though I concede it is not that much different than straightening the tie of your most valuable employee before you unwisely put them in front of a client and saying "please don't tell them about the regressions they didn't notice and remember, they don't want things explained in allegories drawn from the Silmarillion".
This may happen once we see finetuned GLM/Kimi/DeepSeek companies enter the market. I think it's not happening yet because of the hardware supply chain issues.