Building a Local AI Code Completion Extension from Scratch (And Fighting Every Devil in the Details)

Today was one of those days where you set out to do one thing and end up solving twelve completely different problems just to get there. The goal was simple enough on paper: build a VS Code extension that does AI-powered code completion using a local LLM — no cloud calls, no subscriptions, no latency depending on some remote API’s mood.

Here’s how it actually went.

The Architecture Brainstorm

We actually thought before coding, which paid off enormously later. A few core decisions up front:

One Rust sidecar binary. Rather than fighting VS Code’s TypeScript sandbox or Zed’s WASM restrictions for GPU access, we went with a monolithic Rust binary that both IDE extensions talk to over a Unix socket via JSON-RPC. The IDE adapters are thin clients. All the intelligence lives in the sidecar.

Two models, two jobs:

Qwen3-0.6B — the generative brain. Does nothing but write code.
MiniLM-L6-v2 (23M params) — the librarian. Finds relevant code chunks from the project index.

This wasn’t obvious at first. There was a real temptation to use Qwen for embeddings too (“one model, simpler!”), but MiniLM embeds a chunk in ~1ms vs ~40ms for Qwen. When you’re re-indexing on every file save, that 40x difference adds up fast.

Three-layer context building:

Tree-sitter (AST parse) → MiniLM (retrieve relevant chunks) → Qwen (generate completion)

AST parsing figures out what’s in scope. MiniLM finds the most semantically relevant code from the project. Qwen gets rich, targeted context and generates the completion.

LoRA per-project learning — accept/reject signals accumulate in a buffer, and when the editor’s been idle for 5+ minutes, we fine-tune a rank-8 LoRA adapter on the project’s code style. Each project gets its own adapter.safetensors. The training loop is stubbed and ready for v2.

Building in Parallel

We split the work into 10 tasks and ran the core crates as parallel subagents — crates/retrieval/, crates/inference/, crates/trainer/, crates/sidecar/, extensions/vscode/, and extensions/zed/ all building simultaneously. This worked well. 71 tests passing across the workspace when everything came together.

The project structure ended up clean:

ast-complete/
├── crates/
│   ├── core/       # types, protocol, config
│   ├── ast/        # tree-sitter, 9 languages
│   ├── retrieval/  # MiniLM + vector index
│   ├── inference/  # Qwen3, generation, LoRA
│   ├── trainer/    # buffer, scheduler
│   └── sidecar/    # main binary, server, indexer
├── extensions/
│   ├── vscode/     # TypeScript
│   └── zed/        # Rust/WASM

The Gauntlet of Runtime Errors

This is where things got interesting. Everything compiled. Nothing worked. Here’s the full list of fires.

1. Logs on stdout killed the socket handshake

The VS Code extension spawns the sidecar and reads the socket path from the first line of stdout. But tracing_subscriber was writing to stdout by default — so the extension parsed an ANSI-colored log line as the socket path, failed to connect, and gave up immediately.

Fix: route tracing to stderr, print the socket path to stdout before anything else.

// Print socket path FIRST, before model loading
println!("{}", socket_path);
// NOW redirect logs to stderr
let subscriber = tracing_subscriber::fmt()
    .with_writer(std::io::stderr)
    .finish();

2. Extension connected before the socket existed

The socket path is determined before model loading begins — good for the handshake — but the server doesn’t start listening until after models load (~12 seconds). The extension tried once, got ENOENT, and gave up.

Fix: retry loop, polling every 2 seconds for up to 60 seconds.

for (let attempt = 0; attempt < 30; attempt++) {
    try {
        await this.connectToSocket(socketPath);
        return;
    } catch {
        this.log(`Socket not ready, retrying in 2s...`);
        await new Promise(r => setTimeout(r, 2000));
    }
}

3. VS Code didn’t reinstall because the version was the same

code --install-extension silently skips reinstall if the version matches. We burned through 0.1.0 → 0.1.1 → 0.1.2 during debugging before figuring this out. Now build.sh patches the version automatically. Lesson: always bump the version when iterating on a local extension.

4. MiniLM BERT was implemented wrong

The parallel subagent that built crates/retrieval/ wrote a BERT model from scratch in Candle. Structurally plausible, but the weight tensor shapes didn’t match the actual all-MiniLM-L6-v2 checkpoint. The embedder crashed with a shape mismatch on load.

Fix: replace the custom BERT with candle_transformers::models::bert::BertModel. That’s what the library is for.

use candle_transformers::models::bert::{BertModel, Config as BertConfig};
// guaranteed to match the HuggingFace weights
let model = BertModel::load(vb, &config)?;

5. Qwen3 config had unknown fields

Candle’s Qwen3Config struct doesn’t handle unknown fields — and Qwen3-0.6B’s config.json has fields like sliding_window, head_dim, and rope_scaling that aren’t in the struct. Deserialization failed hard.

Fix: load the config as serde_json::Value, strip the unknown fields, then deserialize:

let mut raw: serde_json::Value = serde_json::from_str(&config_str)?;
if let Some(obj) = raw.as_object_mut() {
    for field in &["sliding_window", "head_dim", "kv_lora_rank", 
                   "rope_scaling", "rope_local_base_freq"] {
        obj.remove(*field);
    }
}
let config: Qwen3Config = serde_json::from_value(raw)?;

6. BF16 weights don’t mmap on Metal

The Qwen3 checkpoint is stored in BF16. VarBuilder::from_mmaped_safetensors with a Metal device fails on BF16 — Metal doesn’t support it as a native dtype through Candle’s memory mapping path.

Fix: load with explicit F32 dtype.

let vb = unsafe {
    VarBuilder::from_mmaped_safetensors(
        &[weights_path],
        DType::F32,  // not DType::BF16
        &device,
    )?
};

7. Metal deadlock between Qwen and MiniLM

This one was subtle. Qwen loads onto Metal. The background indexer starts embedding files using MiniLM — also on Metal. Metal doesn’t handle concurrent command queue dispatch from two separate model contexts well. The indexer would embed one file, then hang forever waiting for the device.

The fix was embarrassingly simple: run MiniLM on CPU. It’s 23M parameters. CPU handles it fine at ~5ms per chunk. Qwen stays on Metal for the latency-sensitive generation path.

// MiniLM always runs on CPU to avoid Metal contention with Qwen
let device = Device::Cpu;

8. The model was generating EOS immediately

This was the final boss. Everything connected and indexed. The provider fired. Every completion came back as an empty string.

Qwen3-0.6B is a chat-tuned model. Feed it raw code and it generates EOS on the first token because it doesn’t recognize the format. It expects the <|im_start|>system / <|im_end|> chat template.

Fix: wrap the code completion prompt in the chat template.

fn build_prompt(cursor_context: &str, retrieved: &str) -> String {
    format!(
        "<|im_start|>system\nYou are a code completion engine. \
         Complete the code at the cursor. Output ONLY the completion, no explanation.\
         <|im_end|>\n\
         <|im_start|>user\n\
         {context}\n\
         Code to complete:\n{cursor}\
         <|im_end|>\n\
         <|im_start|>assistant\n",
        context = retrieved,
        cursor = cursor_context,
    )
}

After that: Completion text: "return f'Hello, {name}!'" — correct, contextual, local, 513ms.

What Works

Ghost text completions in 9 languages (JS/TS, Python, Go, Ruby, Rust, Java, PHP, Bash, YAML)
Semantic project indexing with vector search on activation
Incremental re-indexing on file save via filesystem watcher
Tab to accept full completion, Shift+Tab for word-by-word
Metal/MPS acceleration for Qwen3 generation
Zero cloud calls after initial model download (~1.6GB total)
Accept/reject signal collection wired up, LoRA training loop ready for v2

What I’d Do Differently

Ship the chat template on day one. The entire “empty completions” debugging adventure happened because nobody checked whether the base model vs. instruct model distinction mattered for code completion. It matters enormously. Always check.

Don’t implement model architectures from scratch. The custom BERT in embedder.rs cost an hour of debugging to replace with three lines using the existing candle_transformers BERT. Use the libraries.

Test GPU device sharing early. The Metal deadlock between Qwen and MiniLM would have been caught in integration testing. Running two models on the same Metal device needs explicit device management from the start — it’s not something you want to discover while staring at a stuck progress counter.

The Stack

Rust — sidecar binary, all ML inference, AST parsing, vector index
Candle — pure-Rust ML framework, Metal/CUDA/CPU backends
tree-sitter — universal AST parser (9 language grammars)
TypeScript — VS Code extension shell
HuggingFace Hub — model download on first run

71 passing tests, compiling clean, generating real completions locally. Not bad for a single session.