Vibe Coding

How to Build a Full-Local Voice Coding Stack on Windows: Dictaro + Ollama + Cursor

Build a full-local voice coding stack on Windows: Dictaro for dictation, Ollama for cleanup and code generation, Cursor as the IDE. Hardware requirements, model selection (Qwen 2.5 Coder, Mistral Small), step-by-step Ollama setup, and when the hybrid architecture is the right call.

Rosen Velikov

11 May 2026 — 10 min read

TLDR

A full-local voice coding stack on Windows means your voice, your code context, and your AI outputs all stay on your machine — no cloud API calls for any step of the dictation-to-generation pipeline. Dictaro handles the dictation; Ollama handles both the cleanup and the local model inference; Cursor (or any IDE) receives the prompt.
This is the right architecture for developers with proprietary codebases, client projects under NDA, or simply a preference for zero-cloud development workflows.
It is also, in 2026, a technically viable workflow on a mid-range Windows machine — the gap between local model quality and cloud API quality has narrowed to the point where local-first vibe coding is practical for most development tasks.
This guide covers hardware requirements, model selection, Dictaro + Ollama configuration, and the specific settings that make the full-local stack work for real development work rather than demo conditions.

Why Full-Local in 2026
What Full-Local Actually Means for Each Processing Step
Hardware Requirements for a Practical Local Stack
Model Selection: Dictation Cleanup and Code Generation
Setting Up Ollama on Windows
Connecting Dictaro to Ollama on Windows
The Full-Local Vibe Coding Workflow
What Full-Local Cannot Do Yet
Hybrid Architecture: Local Cleanup, Cloud Generation
Privacy and Use Cases That Require Full-Local

Why Full-Local in 2026

In 2025, "full-local AI" meant accepting meaningfully worse quality for the sake of privacy. Local models were substantially behind GPT-4 and Claude 3 Opus on code generation. Running Llama 3 for serious coding work required either a powerful GPU or patience with slow inference. The practical trade-off was real: privacy at the cost of output quality that affected productivity.

In 2026, that trade-off has narrowed. Llama 3.3 70B running on a mid-range RTX 4070 produces code quality that handles the majority of vibe coding tasks without the correction rounds that made earlier local models frustrating. Qwen 2.5 Coder 32B is, for code generation specifically, competitive with GPT-4o on a wide range of practical tasks. Mistral Small 3.1 runs on 12GB VRAM and produces cleanup and instruction-following quality sufficient for professional dictation workflows.

The privacy motivation is also better defined in 2026 than it was a year ago. Developers are clearer about which content they are comfortable routing through cloud APIs and which content they prefer to keep local. Proprietary algorithms, pre-launch product code, client codebase content under NDA, and security-sensitive implementations — these are the categories where local processing is a deliberate architecture choice rather than a blanket preference. The full-local stack provides this routing control for the entire voice-to-code pipeline.

What Full-Local Actually Means for Each Processing Step

A vibe coding session on Windows involves several processing steps, each of which can be either cloud-routed or local:

Step 1 — Voice transcription. Your speech is converted to raw text. In Dictaro, this step routes to Dictaro's own private servers — not Microsoft Azure, not Google Speech API. This step is not fully local in the Dictaro architecture. For fully local transcription, alternatives include faster-whisper running locally or Whisper.cpp on CPU. The trade-off: Dictaro's private server transcription is faster and more accurate than most local Whisper setups on CPU-only hardware; fully local transcription requires a capable GPU to match the latency.

Step 2 — Dictation cleanup. The raw transcript is sent through an AI model to remove filler words, format the output, and convert natural speech to prompt-register text. With Dictaro + Ollama BYOK, this step runs entirely on your Windows machine. The raw transcript goes to your local Ollama instance; the cleaned prompt comes back to your cursor. No network call. No cloud API. Complete local processing.

Step 3 — AI code generation. The cleaned prompt is submitted to an AI coding tool for code generation. In a full-local stack, this step uses a local model running in Ollama or LM Studio — accessed via a local Cursor configuration (using Ollama as the custom API endpoint), Open WebUI, or Continue (VS Code extension) with a local model backend. In a hybrid stack, this step uses a cloud API (Cursor's Claude or GPT-4o backend) while only the cleanup step is local.

The full-local stack covers Steps 2 and 3 locally. Step 1 uses Dictaro's private server transcription unless you also configure a local Whisper setup for the transcription step, which is covered briefly at the end of this guide.

Hardware Requirements for a Practical Local Stack

Local model performance depends primarily on VRAM for the model and RAM for the system. A practical full-local development stack requires:

Minimum (cleanup only, hybrid for code generation): 8GB VRAM (RTX 3060, RTX 4060, RX 7600). Sufficient for running Mistral Small 3.1 8B or Llama 3.2 8B for dictation cleanup. Code generation uses a cloud API. This is the hybrid architecture: local cleanup, cloud code generation.

Mid-range (cleanup + code generation with 32B model): 24GB VRAM (RTX 4090, RTX 3090, or two GPUs via Ollama's multi-GPU support). Sufficient for Qwen 2.5 Coder 32B for both cleanup and code generation. This is the practical sweet spot for the full-local stack on a high-end consumer GPU.

Budget approach on CPU: 32GB RAM minimum (64GB recommended). Llama 3.2 3B on CPU produces acceptable cleanup quality and runs fast enough for real-time dictation. Code generation on CPU is slow (30-120 seconds per response on Llama 3.3 70B) — feasible for tasks where generation time is not the bottleneck, impractical for rapid iteration workflows.

Quantization: All models in the Ollama library are available in quantised versions (Q4_K_M, Q5_K_M, Q8_0) that reduce VRAM requirements significantly. A Q4_K_M quantised Llama 3.3 70B model runs on 40GB VRAM (two 24GB GPUs) rather than the 140GB required for the full-precision version. Quality loss from Q4_K_M quantization is minimal for most dictation cleanup and code generation tasks.

Recommended GPU for a balanced full-local stack in 2026: RTX 4090 (24GB VRAM). Sufficient for Qwen 2.5 Coder 32B at Q5_K_M quantization for both cleanup and code generation, with acceptable inference latency (3-8 seconds per response for 500-token outputs).

Model Selection: Dictation Cleanup and Code Generation

Two different models serve two different tasks in this stack. The optimal model for each task has different requirements.

For dictation cleanup

The cleanup model converts raw dictated speech to clean, formatted prompt text. The requirements are: fast inference (you want the cleaned text to appear quickly after releasing the hotkey), strong instruction-following (the model must apply your cleanup prompt accurately), and good grammatical output. Code generation quality is irrelevant for this step.

Mistral Small 3.1 8B: Excellent cleanup quality for 8GB VRAM. Fast inference. Strong instruction following. Good choice for the hybrid architecture where code generation uses a cloud API.
Llama 3.2 3B: Sufficient cleanup quality for minimal VRAM (4GB). Slower on GPU than expected due to architecture; fast on CPU for short texts. Best for budget setups.
Qwen 2.5 7B: Strong multilingual cleanup — relevant if you dictate in languages other than English. Good instruction following at 7B scale.
Phi-4 14B: Microsoft's instruction-tuned model. Excellent cleanup quality at 14B scale. Requires 10GB VRAM at Q4_K_M. Better cleanup quality than 7-8B models for complex custom prompt instructions.

For code generation

The code generation model receives the cleaned prompt from Dictaro (via Cursor or Open WebUI) and generates code. Requirements: strong code generation quality, ability to follow multi-step technical specifications, context window length sufficient for complex codebases.

Qwen 2.5 Coder 32B: The current best local model for code generation as of mid-2026 on most benchmarks. Requires 20GB+ VRAM at Q4_K_M. 128K context window. Competitive with GPT-4o on HumanEval and practical code generation benchmarks.
Llama 3.3 70B: Strong general capability including code. Better reasoning than Qwen 2.5 Coder for architectural and design questions. Requires 40GB+ VRAM at Q4_K_M (two GPUs or large single GPU).
DeepSeek-Coder-V2 16B: Good code generation quality at 16B scale for 10-12GB VRAM. A practical choice for single RTX 4080 setups.
Codestral 22B: Mistral's code-specific model. Strong performance on code completion and generation. 16GB VRAM at Q4_K_M.

Setting Up Ollama on Windows

Ollama is a local model runtime that runs models as a local API server. It is the standard backend for local AI on Windows in 2026.

Step 1: Install Ollama. Download the Windows installer from ollama.com. Run it. Ollama installs as a Windows service that starts automatically. The API runs at localhost:11434 by default.

Step 2: Pull your models. Open Windows Terminal and run:

For cleanup (8B cleanup model): ollama pull mistral-small3.1
For code generation (32B code model): ollama pull qwen2.5-coder:32b-instruct-q4_K_M

The pull downloads the model files from Ollama's model library. The Q4_K_M suffix specifies the quantisation level — the practical balance of quality and size for most setups.

Step 3: Verify the API. Open a browser and navigate to localhost:11434. You should see the Ollama API running. This is the endpoint that Dictaro and Cursor will call for local inference.

Step 4: Optional — Open WebUI. Open WebUI provides a web interface for interacting with Ollama models, similar to ChatGPT's interface but running entirely locally. Install via Docker Desktop on Windows or with the standalone installer. Access it at localhost:8080. This is useful as a fallback chat interface if Cursor's local model integration is not configured.

Connecting Dictaro to Ollama on Windows

Step 1: Open Dictaro settings. Right-click the Dictaro system tray icon and select Settings.

Step 2: Navigate to AI Cleanup. Find the AI Cleanup section and select Ollama as the provider.

Step 3: Enter the Ollama endpoint. Enter http://localhost:11434 as the base URL. This is Ollama's default API endpoint.

Step 4: Select the model. Enter the model name — for example, mistral-small3.1 — as the model identifier. This must match the model name from your ollama list output.

Step 5: Configure the cleanup prompt. For vibe coding on Windows, a recommended local cleanup prompt:

"Convert this dictated speech to a clean AI coding instruction. Preserve all file paths, function names, class names, variable names, technical terms, and exact numeric values exactly as spoken. Format numbered requirements as numbered lists. Remove filler words and hesitations. Output as a direct, precise coding instruction ready to submit to an AI coding assistant."

Step 6: Test the pipeline. Hold the transcription hotkey, speak a test sentence, release, then press the cleanup hotkey. The cleanup call routes to your local Ollama instance. Response time depends on your GPU — 1-3 seconds on a mid-range GPU for a 100-200 token cleanup call.

The Full-Local Vibe Coding Workflow

With Dictaro + Ollama configured, the complete workflow for a local vibe coding session on Windows:

Cursor with a local model backend

Cursor allows custom API endpoints. In Cursor's settings, under Models, add a custom model pointing to http://localhost:11434/v1 (Ollama's OpenAI-compatible API endpoint). Set the model identifier to your code generation model (e.g., qwen2.5-coder:32b-instruct-q4_K_M). Cursor routes AI requests to your local Ollama instance rather than Anthropic or OpenAI.

The complete session flow: hold Dictaro hotkey, speak the specification, release, press cleanup hotkey (Ollama processes cleanup locally), paste or the text inserts at cursor in Cursor's chat, press Enter (Cursor routes to Ollama for code generation), accept the result. Every AI processing step — cleanup and generation — runs on your machine.

Continue (VS Code extension)

Continue is an open-source VS Code extension that integrates local models directly into the editor. Configure it with your Ollama endpoint and code generation model. Dictate into Continue's chat panel via Dictaro's system-wide hotkey. Continue routes to Ollama for generation.

Open WebUI as the interface

For workflows that do not require IDE integration — architectural design sessions, documentation drafting, code review commentary — Open WebUI provides a clean chat interface for your local models. Dictaro's hotkey works in Open WebUI's text input field. Dictate, cleanup runs via Ollama, paste cleaned prompt into Open WebUI, generation runs via Ollama. Entirely local from speech to response.

What Full-Local Cannot Do Yet

Honesty about the current limits matters for evaluating whether the full-local stack is appropriate for a specific workflow:

Context window length: Cloud models like Claude 3.7 Sonnet support 200K token context windows. Local models at Q4_K_M quantization on a single 24GB GPU support practical context lengths of 8K-32K tokens for real-time inference. For very large codebase context submissions, cloud models still have a structural advantage in context length that affects output quality for tasks dependent on full-codebase understanding.

Cursor Agent mode: Cursor's agentic coding mode — which autonomously runs commands, reads files, and iterates on code — currently works with cloud APIs and not with local model backends via the custom endpoint configuration. Agentic tasks that require Cursor's built-in orchestration remain cloud-dependent.

Inference latency on mid-range hardware: A 32B model at Q4_K_M generates at 15-30 tokens per second on an RTX 4090. A 200-token response takes 7-13 seconds. For rapid iteration tasks where response latency matters, the hybrid architecture (local cleanup, cloud generation) provides faster development velocity with local cleanup for the privacy-sensitive step.

Hybrid Architecture: Local Cleanup, Cloud Generation

The practical middle ground for most developers: route the dictation cleanup step through a local Ollama model (sensitive prompt content stays local) while routing code generation through a cloud API (Cursor's Claude or GPT-4o backend for maximum generation quality and speed).

This architecture means:

Your spoken specifications — which contain the most sensitive content (architecture details, business logic, naming conventions) — are cleaned locally. The raw speech and the proprietary context it contains do not leave your machine during the cleanup step.
The cleaned prompt — which has already been stripped of spoken-language noise and formatted as a precise coding instruction — is submitted to the cloud generation API. The cloud model receives a clean specification prompt rather than the raw dictation content.
Code generation quality remains at cloud API level.

For developers who are comfortable with cloud code generation (their code goes to Anthropic or OpenAI anyway via Cursor) but want to remove the dictation vendor infrastructure from the content processing chain: this hybrid is the right architecture. Only Dictaro's private transcription servers and your chosen local Ollama model process the dictation content itself. The cloud API receives the cleaned output.

Privacy and Use Cases That Require Full-Local

The full-local stack — cleanup and generation both via Ollama — is appropriate when:

The codebase is subject to NDA or trade secret protection and routing it through cloud APIs creates a contractual or legal risk for the developer or their firm.
The content includes security-critical implementations — cryptographic routines, authentication logic, vulnerability patches — where exposing the code to cloud infrastructure before deployment is a security risk.
The developer operates in a regulated environment where all AI tool usage must be through approved, air-gapped or on-premises infrastructure. In financial services, defence contracting, healthcare technology, or government-adjacent development, cloud AI tools may be prohibited for certain workstreams regardless of provider.
The project involves pre-launch intellectual property where the competitive value lies in the code itself and cloud API exposure creates a disclosure risk.

For most other development contexts, the hybrid architecture (local cleanup, cloud generation) provides sufficient privacy for the dictation step while maintaining cloud model quality for code generation.

For the full BYOK and Ollama architecture in Dictaro: What Is BYOK in Dictation Apps?

For the vibe coding overview — system-wide Windows coverage versus IDE-chat-only voice: Voice Dictation for Vibe Coding on Windows

For the developer-specific setup guide: How to Set Up Voice Dictation on Windows

Dictaro is a Windows-only AI dictation app. System-wide operation on Windows 10 and 11. AI text cleanup with BYOK for OpenAI, Anthropic, Groq, Ollama, LM Studio, Gemini, OpenRouter, and more. No account required. Download and start dictating in under two minutes.