Uncategorized

How to Run Qwen 3.5 Medium Locally: The 35B MoE Value King

How to run Qwen 3.5 Medium locally: the 35B MoE value king

Running frontier-grade models locally usually hits the same wall: VRAM. Dense 30B to 70B models can be incredible, but the moment you want decent context length, fast generation, and stable throughput, you are forced into aggressive quantization or expensive multi-GPU setups. Qwen 3.5 Medium flips that math with a Mixture of Experts (MoE) design, and the standout in the series is Qwen3.5-35B-A3B: a 35B-parameter model that only activates about ~3B parameters per token.

This guide walks through a practical local LLM setup for Qwen 3.5 Medium on consumer GPUs (24GB class) and Apple Silicon, including how to choose the right runtime (Ollama, llama.cpp, vLLM-style servers, or MLX), how to pick weight formats, and how to handle Thinking Mode and multimodal workflows in a way that is reliable and repeatable.


What makes Qwen3.5-35B-A3B different (and why MoE matters)

MoE models are not “smaller models pretending to be big.” They are large models where each token routes through only a subset of specialized “experts.” Instead of paying the compute and memory cost of using every parameter on every token (dense), you pay for a small activated slice (sparse). That is why the Qwen3.5-35B-A3B profile feels like a sweet spot: you get 35B-scale capacity with ~3B active parameters per token, which often translates into strong reasoning and instruction-following while staying feasible on a single consumer machine.

Practically, this impacts three decisions you must make for local deployment:

  • Runtime choice: Not every local inference engine supports MoE equally well (or supports it efficiently).
  • Weight format: The same model can be served as GPU-friendly (BF16/FP16/FP8) or CPU/GPU mixed quant formats (GGUF, etc.).
  • Performance tuning: Context length and KV-cache settings can dominate VRAM usage, even if the model itself is sparse.

Thinking mode: when to use it

Many modern instruction models expose a “Thinking Mode” style of prompting where the model is encouraged to produce extra internal reasoning tokens before emitting a final answer. Locally, the trade-off is straightforward: Thinking Mode usually improves difficult reasoning (multi-step debugging, math, planning) but costs more tokens and latency. If you are building an agent, a code assistant, or anything that benefits from correctness over speed, it can be worth it. If you are doing chat, summarization, or classification, you will typically want it off.


Hardware planning: what you need to run Qwen 3.5 Medium

“Runs on 24GB” can mean very different things depending on your context length, batch size, and whether you keep everything on the GPU. Use the table below as a reality check for typical developer workflows (interactive chat, coding help, small agent loops). The goal is stable inference, not a benchmark screenshot.

SetupGood forWhat to watchRecommended approach
Single 24GB NVIDIA GPU (e.g., RTX 4090 / 3090 class)Fast local chat + coding + toolsKV cache VRAM at long context; GPU offload settingsUse a GPU-first runtime (or GGUF with high GPU offload); keep context reasonable
16GB NVIDIA GPUCasual local useModel fit + KV cache; quantization qualityUse quantized weights (GGUF or similar) and keep context shorter
Apple Silicon (unified memory)Portable dev + private workflowsUnified memory consumption; swap pressureUse MLX/Metal-friendly tooling; pick an appropriate quant
CPU-onlyTesting, batch jobs, automationSpeed (tokens/sec)Use llama.cpp with a quant; consider smaller Qwen 3.5 Medium variants if needed

Rule of thumb: even if the model weights fit, long context is where VRAM disappears. If you push huge context windows, the KV cache becomes the bottleneck. For best “developer laptop or single GPU” experience, start with a moderate context and scale up only when you can measure stable VRAM headroom.


Choose your local runtime (Ollama vs llama.cpp vs vLLM vs MLX)

There is no single best runtime. The best choice depends on whether you prioritize simplicity (Ollama), maximum control (llama.cpp), scalable serving (vLLM-style servers), or Apple Silicon efficiency (MLX). Here is how to decide quickly.

Option A: Ollama (fastest path to a working local setup)

Ollama is ideal if you want a “pull the model and run it” experience and you are okay with a curated ecosystem. Typical workflow: pick a Qwen 3.5 Medium model that Ollama supports, run it, and then attach a UI like Open WebUI. Under the hood, you are still benefiting from quantized formats and a mature local inference stack, but you avoid 90% of the knobs on day one.

Example commands (adjust the model tag to the exact Qwen 3.5 Medium name available in your registry):

# 1) Pull the model (name/tag varies by registry)
ollama pull qwen3.5:35b-a3b

# 2) Run an interactive session
ollama run qwen3.5:35b-a3b

# 3) Run as a local server (for UIs or API clients)
ollama serve

When Ollama is the wrong tool: if you need fine-grained MoE tuning, advanced batching, custom CUDA builds, or you are serving multiple concurrent users with high throughput, you will likely want a dedicated server stack.

Option B: llama.cpp (best control for quantized local inference)

llama.cpp is the go-to when you want maximum control over quantized inference and hardware offload. It is also the most common path when your model weights are in GGUF format. You can run fully on CPU, or offload layers to GPU for much better speed. For a 24GB GPU, you typically aim for high offload while leaving room for KV cache.

Example (with a GGUF file you have already downloaded):

# Build llama.cpp (example for macOS/Linux)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build -j

# Run (tune -ngl and -c for your VRAM and context needs)
./build/bin/llama-cli \
  -m /path/to/Qwen3.5-35B-A3B.gguf \
  -c 8192 \
  -ngl 99 \
  --temp 0.7 \
  --top-p 0.9 \
  -p "You are a helpful coding assistant. Explain what MoE is in two paragraphs."

Tuning tips that matter: -c (context) increases KV cache usage; -ngl controls GPU offload; and concurrency is limited compared to server-first runtimes. If your generation suddenly slows or crashes, reduce context first, then reduce offload.

Option C: vLLM-style serving (best for APIs, concurrency, and “real app” deployments)

If you are building a local service that multiple apps or users will hit, you will want a server with efficient batching and an OpenAI-compatible endpoint. vLLM is a common choice in this category. The main advantage is that you stop thinking in “single chat session” terms and start thinking in “requests per second” and concurrency.

Example server launch (the exact flags depend on your installed version and the model repository):

# Install (example)
pip install -U vllm

# Serve a Hugging Face model repo locally
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B \
  --port 8000 \
  --max-model-len 8192

Best practice: when running MoE models on a single GPU, start with conservative max context and verify steady-state VRAM at load before raising --max-model-len. If you must support long context, plan for reduced concurrency.

Option D: MLX on Mac (best for Apple Silicon portability)

On Apple Silicon, MLX-based tooling often provides the most stable “local on a MacBook” experience because it targets the Metal stack and unified memory. The trade-off is that model availability and quant formats can differ from the CUDA ecosystem. Use this route when you care about privacy and portability more than maximum tokens/second.

# Example (tooling and exact model identifiers vary by ecosystem)
pip install -U mlx-lm

# Run a compatible Qwen 3.5 Medium checkpoint
mlx_lm.generate \
  --model Qwen/Qwen3.5-35B-A3B \
  --prompt "Write a Python function to parse a log line into structured JSON." \
  --max-tokens 512

Step-by-step local LLM setup for Qwen 3.5 Medium (practical workflow)

This section is a “do this in order” checklist you can reuse, regardless of runtime.

  1. Pick your target machine. Decide whether you are optimizing for a 24GB NVIDIA GPU, a smaller GPU, or Apple Silicon unified memory.
  2. Choose a runtime based on your goal. Ollama for convenience, llama.cpp for quant/control, vLLM for serving, MLX for Mac.
  3. Select the correct weight format. For CUDA serving, BF16/FP16 (or FP8 where supported) is common. For local single-process inference, GGUF-style quant often wins on usability.
  4. Start with a moderate context. Use something like 4k to 8k tokens initially. Increase only after you measure VRAM headroom.
  5. Decide on Thinking Mode policy. Default it off for speed, then enable for tasks that need deeper reasoning (debugging, planning, math).
  6. Add a UI or API layer. For chat: Open WebUI or similar. For apps: OpenAI-compatible local endpoint (vLLM/Ollama).
  7. Measure and tune. Track tokens/sec, VRAM, and latency under real prompts. Tune context, batch sizes, GPU offload, and sampling.

Connect a local UI (optional but recommended)

A UI makes it easier to compare prompts, iterate on system instructions, and evaluate Thinking Mode behavior without writing a client. If your runtime exposes an HTTP server (Ollama or a vLLM OpenAI-compatible endpoint), most local UIs can connect with minimal configuration.

Two patterns work well:

  • Chat-first UI for interactive testing, prompt templates, and quick comparisons.
  • API-first server for app integration (tools, agents, RAG pipelines).

Enable “Thinking Mode” safely (without breaking your app)

In practice, Thinking Mode changes output structure. If your downstream code expects a single “final answer,” extra reasoning tokens can confuse parsers, JSON schemas, or tool-calling. The safest approach is to:

  • Keep Thinking Mode off by default for production endpoints.
  • Enable it per-request when you detect a hard task (or when the first attempt fails).
  • Strip or separate reasoning tokens before post-processing (especially if you require valid JSON).
# Pseudocode pattern: two-pass solve
# 1) Fast attempt (thinking off)
answer = llm(prompt, thinking=False)
if not passes_checks(answer):
    # 2) Retry with thinking on
    answer = llm(prompt, thinking=True)
return answer

Performance tuning: context length, KV cache, and quantization

Most “it doesn’t fit in VRAM” problems are actually “KV cache blew up” problems. Even with MoE, long context means large KV cache. If you need long context, you can still run Qwen 3.5 Medium locally, but you must treat context length as a first-class capacity planning variable.

Start with these defaults

  • Context: start at 4096 to 8192.
  • Sampling: temperature 0.6 to 0.8, top-p 0.85 to 0.95 for general assistant behavior.
  • Max tokens: cap generation to keep runaway Thought traces under control.
  • Concurrency: keep it low until you confirm steady VRAM under load.

Quantization guidance (pragmatic)

If you are deploying on a single consumer GPU, quantization is the lever that most often turns “almost works” into “works all day.” The exact quant scheme depends on your runtime and the format you download, but the strategy is consistent:

  • Prefer lighter quant (higher quality) if you have VRAM headroom and you care about reasoning and code accuracy.
  • Prefer stronger quant (smaller) if you need longer context or you are on 16GB (or less) VRAM.
  • Benchmark with your real prompts, not synthetic tests. MoE models can respond differently to quant choices than dense models.

Finally, treat “tokens/sec” and “first token latency” separately. For interactive use, first token latency matters more than raw throughput.


Multimodal and “native vision” workflows (what to expect locally)

“Native multimodal” can mean different implementations depending on the checkpoint: some are single unified models, others use companion vision encoders, and local runtimes vary in how well they support the full pipeline. The practical advice is:

  • Confirm your runtime supports the exact multimodal variant you want to run locally (text-only support is more common than full vision).
  • Keep a text-only fallback for production apps, even if your ideal workflow is multimodal.
  • Test memory and latency with realistic image sizes and prompt patterns (vision can shift bottlenecks).

If you are building an app that needs OCR-like extraction, UI understanding, or screenshot debugging, consider starting with a text-only flow and then adding vision once you have a stable local server and evaluation harness.


Conclusion

Qwen 3.5 Medium, and especially Qwen3.5-35B-A3B, is compelling because it shifts local AI from “compromise” to “default.” The MoE design is the core enabler: you get 35B-scale capacity while activating roughly ~3B parameters per token, which helps it run on realistic developer hardware. For most people, the quickest win is an Ollama-based setup with a UI; for power users, llama.cpp offers deep control over quantization and GPU offload; and for serious local serving, a vLLM-style OpenAI-compatible endpoint is the cleanest path to app integration.

Next steps: pick your runtime, start with moderate context (4k to 8k), measure VRAM and latency, then tune. Add Thinking Mode as a per-request tool for hard problems, not a default. Once you have stable local inference, you can layer on RAG, tools, and multimodal features with far less risk and far fewer surprises.

Enjoyed this article?

Subscribe to get more AI insights and tutorials delivered to your inbox.