By Sahil Kapoor in AI - 03 Jun 2026

Setting up DGX Spark with a MacBook

How I connected a MacBook to the NVIDIA DGX Spark with Tailscale, served Qwen3-Coder with vLLM, and got GitHub Copilot Agent Mode running on a local model.

📌

Series: Running Local, building a private AI stack from scratch. This is Part 2. Part 1 is here.

In Part 1 I talked about what the DGX Spark is. Getting a model running on it took a few hours. Making it useful took several days. My goal was simple: keep working entirely from macOS while turning the DGX into a private AI server that could replace cloud coding models for day-to-day development.

I still work on a Mac. Every tool I use daily is on macOS. VS Code, Copilot, the browser, everything. The DGX runs Ubuntu. I had no intention of switching operating systems. So the question became: how do you build a workflow where the Mac is what you touch and the DGX is what does the work?

Figuring that out took a few days, and a few things broke along the way.

Mac as screen, DGX as brain

The DGX sits on my desk, always on, running as a server. The Mac connects over the network, I open VS Code with Remote SSH into the DGX, and everything runs there. Every terminal command, every file, every model call. From the outside it looks like working locally.

For this to work from anywhere, not just at my desk, I needed the two machines to find each other reliably. I used Tailscale. Installing it on the DGX is a single command, then sudo tailscale up --hostname=nvidia-dgx-spark gives the machine its permanent name on my network. That is the entire setup.

After that ssh sahil@nvidia-dgx-spark works from the Mac, from any network, with nothing else to configure. The hostname is permanent, so every tool I set up later, VS Code, Copilot, the phone, points at the same address.

I also use NVIDIA Sync for file syncing between the two machines. It runs on top of Tailscale so there should be nothing extra to configure. Except enrollment kept failing with "Alias already exists." NVIDIA Sync reads your ~/.ssh/config and registers Host entries as its own aliases. I had dgx-spark.local in my SSH config and it clashed. Removing that entry fixed it.

Picking the right model

I downloaded the wrong version of Qwen3-Coder-Next first. There is a BF16 base model at 160GB and a pre-quantized FP8 version at 80GB. I grabbed the base model. It downloaded for hours and then refused to load because 160GB does not fit in 128GB of memory. If you want to understand why, this explanation of quantization formats is a good starting point. The short version: always download the FP8 checkpoint. Lesson learnt, downloaded Qwen/Qwen3-Coder-Next-FP8 this time and it loaded fine.

I also tried Nemotron, NVIDIA's own model. Seemed like the natural thing to test on NVIDIA hardware. It ran slow, noticeably slower than Qwen, and the output quality for coding tasks did not justify the wait. I went back to Qwen3-Coder-Next-FP8. Running it through vLLM on the Spark, it sits at around 43 tokens per second. Fast enough that it feels like a real coding assistant, not a batch job.

How hard can I push it

The 43 tok/s number kept bugging me. Was that good? What was the actual limit? So I spent an evening benchmarking properly.

First test: one hard task. I asked it to build a production Red-Black Tree in Rust, rotations, deletion rebalancing, full test module. Streamed the response and measured. First token came back in 206 ms. Decode ran at just under 50 tokens per second, a bit faster than my first casual measurement, until it hit the output cap mid test-suite.

The interesting part is why 50 and not more. Total parameter count does not affect speed. Active parameters do. Qwen3-Coder-Next is a mixture-of-experts model, 80B total parameters but only about 3B active per token. Every token pulls those active weights, about 3GB at FP8, through the memory bus. The theoretical ceiling works out to around 90 tok/s, and after MoE routing overhead and KV cache reads you land at roughly half of that. That one took me a while to internalise.

Second test: concurrency. Same prompt, fired in parallel, 1 to 256 simultaneous requests.

Concurrency	Per-request	Aggregate	TTFT
1	50.1 tok/s	49.7 tok/s	158 ms
16	15.9 tok/s	252.4 tok/s	445 ms
64	8.2 tok/s	521.9 tok/s	1.2 s
128	5.4 tok/s	673.8 tok/s	5.2 s
256	3.6 tok/s	897.4 tok/s	4.0 s

Aggregate throughput never plateaued. At 256 concurrent requests the Spark was pushing nearly 900 tokens per second total, and doubling concurrency kept buying 65 to 84 percent more aggregate every step. vLLM's continuous batching genuinely works. Per-request speed is the trade: at 256 concurrent, each request crawls at 3.6 tok/s, slower than human typing.

The real ceiling is not throughput, it is KV cache. 128GB minus an 80GB model leaves about 40GB for context. Past 256 concurrent requests the engine starts preempting.

What this means in practice: for me alone, 50 tok/s and snappy. For a team of 10 to 20 hitting it in parallel, 15 to 20 tok/s each, perfectly usable. That is enough to support a small engineering team from a single desktop machine. For batch jobs, roughly 23 million tokens a day from a golden box on my desk.

vLLM over Ollama

I looked at Ollama first. It is easier to set up and I use it for casual model exploration. But for a coding workflow with Copilot Agent Mode, it was not the right tool here.

Ollama is a wrapper around llama.cpp, which was built CPU-first with CUDA added later. On a machine built around NVIDIA's Blackwell architecture with 1TB/s memory bandwidth, that matters. vLLM was built from scratch for NVIDIA hardware. It uses PagedAttention for dynamic KV cache management, supports the tensor parallelism that Blackwell is designed around, and NVIDIA themselves recommend it as the inference stack for the DGX Spark. In practice, Ollama runs the same model at around 9 tok/s on the Spark. vLLM gets 43 tok/s. That gap is the architecture difference showing up in real numbers.

Model	Ollama	vLLM
Qwen3-Coder-Next-FP8	~9 tok/s	~43 tok/s

Installing vLLM itself is just a pip install. The part worth showing is the start script, because every flag in it is a decision. This lives on the DGX and is what actually serves the model:

#!/bin/bash
source ~/Developer/DGX/services/vllm/venv/bin/activate
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  2>&1 | tee -a ~/Developer/DGX/services/vllm/logs/vllm.log

--max-model-len 262144 is the 256K context window. --gpu-memory-utilization 0.85 leaves headroom for the OS, because on unified memory the GPU and the system share the same pool. --host 0.0.0.0 is what lets the Mac reach it over Tailscale instead of just localhost. And the last two flags are for Copilot.

--tool-call-parser qwen3_coder is the flag that took the longest to find. Copilot sends tool calls and vLLM needs to know how to parse them for this specific model. I tried hermes first. Crashed. Tried qwen3. Also crashed. The correct value is qwen3_coder. It is in vLLM's error output when you use the wrong one, but you would not know to look there unless something was already wrong.

I wrapped this in a systemd service so the model comes back on its own after a reboot or a crash. The DGX is meant to run unattended, and a server you have to manually restart is not a server. When I want to see what the model is doing, journalctl -u vllm -f follows the logs live.

VS Code and Copilot

I use VS Code Insiders over Cursor or Cline because Agent Mode is native and the custom endpoint is set up directly in Copilot settings. One fewer thing managing the connection.

🔄

Update: VS Code now supports custom model endpoints natively. The May 20 release (1.121) added the Custom Endpoint provider, and 1.122 brought it to Stable. You no longer need Insiders for this. Custom models work in Agent Mode too, as long as the model supports tool calling.

In VS Code you add the model through the Copilot model picker, Manage Models, Custom Endpoint. It writes this into your settings. The two fields that matter are the url, which points at vLLM on the DGX, and toolCalling, without which the model never shows up in Agent Mode:

[{
  "name": "Qwen3-Coder-Next",
  "vendor": "customendpoint",
  "apiKey": "none",
  "apiType": "chat-completions",
  "models": [{
    "id": "Qwen/Qwen3-Coder-Next-FP8",
    "name": "Qwen3-Coder-Next",
    "url": "http://nvidia-dgx-spark:8000/v1/chat/completions",
    "toolCalling": true,
    "vision": false,
    "maxInputTokens": 262144,
    "maxOutputTokens": 16000
  }]
}]

The url field needs the full path including /v1/chat/completions. I had it pointing to just the base URL once and got a 404 that took longer than it should have to debug.

Open Copilot Chat, pick the model from the picker, switch to Agent Mode.

What the workflow looks like now

The workflow itself is surprisingly boring, which is exactly what I wanted. I open VS Code on the Mac, connect to the DGX over SSH, and work as normal. Copilot Agent Mode sends requests to vLLM through Tailscale, Qwen processes them on the Spark, and responses come back fast enough that I stop thinking about where they are coming from.

The first real task I threw at it was not a benchmark or a demo. It was actual work: reading an existing codebase, making changes across multiple files, checking outputs, fixing mistakes, and iterating. That was the point where it stopped feeling like a local AI experiment and started feeling like infrastructure.

Whether it can even replace Claude Sonnet is still an open question, and that is the comparison that matters. Over the next few weeks I will be using it across real projects and comparing the two directly. But for the first time, a local coding assistant feels practical rather than aspirational.

Next: what OpenClaw actually looks like running on the Spark, and the first experiments with running a multi-agent platform.