I Ran 100% Local AI for 6 Months

Six months ago, I stopped sending data to OpenAI. Not because I couldn't afford it, not because the models weren't good enough — but because I started asking a question nobody around me seemed to be asking: where is all of this going, data-wise?

I'm a technical leader who builds systems at scale. I think about data governance the way a structural engineer thinks about load-bearing walls. And when I looked honestly at our team's AI usage patterns — developers pasting proprietary code snippets into ChatGPT, customer queries being processed through cloud APIs, internal documents being embedded in third-party vector stores — I saw a problem that wasn't going to stay invisible forever. So I decided to do something about it before someone else made the decision for me.

I built Sidekick: a 100% local AI assistant. Python backend, Flutter frontend, Ollama for model serving, Graph RAG for multi-document reasoning. Everything runs on-device. No data leaves your machine. After six months of running it as my primary AI tool, I have opinions. Strong ones.

Here's what I actually learned — the good, the complicated, and the parts that genuinely surprised me.

01 — The Privacy Problem

The Privacy Time Bomb

Start with a number that should make every enterprise CTO uncomfortable: €1.2 billion. That's the fine Ireland's Data Protection Commission levied against Meta in May 2023 for transferring EU user data to US servers under the old Standard Contractual Clauses framework. Largest GDPR fine in history at the time. And it was for a data transfer practice that thousands of companies are still doing today — including, almost certainly, companies that are piping sensitive data through AI APIs without much thought about it.

The Meta case wasn't about malice. It was about a legal framework that hadn't caught up with how data was actually moving. That's precisely where we are with enterprise AI right now. The data is moving. The legal frameworks haven't caught up. The fines, when they come, will be spectacular.

The EU AI Act (2024) added another layer. Certain AI applications are now classified as "high-risk" — AI used in employment decisions, credit scoring, law enforcement, critical infrastructure. High-risk classification means mandatory conformity assessments, human oversight requirements, and extensive documentation. If you're using cloud AI APIs for any of these use cases, you're not just accepting commercial risk. You're accepting regulatory risk that your legal team probably hasn't fully mapped yet. I'd bet on the latter, honestly.

And then there's HIPAA in healthcare. Using a third-party AI API to process patient records? That's potentially a covered entity using a business associate without an appropriate Business Associate Agreement. The Office for Civil Rights has been ramping up enforcement. Healthcare organizations that thought they were being innovative by integrating AI quickly are learning that innovation and compliance aren't automatically the same thing. Usually the hard way.

"Enterprises are one security audit away from discovering that their AI integration strategy is also their biggest regulatory liability."

— Personal observation after reviewing AI integration patterns across multiple enterprise deployments

The uncomfortable truth is that most enterprises have no idea what data is flowing through their AI APIs. There's no log that says "developer X pasted a client contract into Claude at 3:47 PM on March 2nd." That data is gone. Or more precisely, it's somewhere you don't control, governed by a Terms of Service that can change, belonging to a company that can get acquired, hacked, or subpoenaed.

Local AI isn't just a privacy preference. In many regulated industries and jurisdictions, it's becoming a compliance necessity — one that's moving faster than most organizations are planning for.

Practical Action

Before your next AI integration, do a data classification audit. Map exactly what categories of data will flow through your AI API calls. Classify each by sensitivity: public, internal, confidential, regulated. If any regulated data (HIPAA, GDPR special categories, PCI) is in scope — stop and get legal involved before building.

If you can't answer "where does this data go after the API call?" for your AI provider — that's your answer about the risk level.

02 — The Performance Question

The Performance Reality Check

When I started building Sidekick, the most common objection I heard from other engineers was: "local models are too slow." That was true eighteen months ago. It's categorically not true today, and the gap is closing faster than most people track.

Specific numbers, because vague comparisons don't help anyone make real decisions.

On my Apple M2 MacBook Pro, Llama 3.1 8B running on Q4_K_M quantization via llama.cpp generates approximately 60 tokens per second. Average human reading speed is around 200-250 words per minute, which works out to roughly 3-4 words (tokens) per second. The model is generating text 15x faster than you can read it. For interactive use cases — code completion, question answering, summarization — local inference on modern Apple Silicon isn't slow. The bottleneck shifts from generation speed to prompt processing time.

For larger models, the Apple M2 Ultra can run 70B parameter models at 20+ tokens per second. That's not toy performance. GPT-3-class capability, running entirely on hardware you own, with no API calls, no latency jitter from network round trips, and no per-token cost. The Qualcomm Snapdragon X Elite chips are showing 45 TOPS on-device AI benchmarks, which means the performance curve is extending to laptops and edge devices well beyond Apple Silicon.

The key technical enabler is llama.cpp and the GGUF quantization format. Quantization reduces model precision from 16-bit or 32-bit floating point to 4-bit or 8-bit integers, shrinking model size by 75% with acceptable quality degradation. A 70B model that would normally require 140GB of VRAM can run in 40GB in Q4 quantization — the unified memory of an M2 Ultra Mac Studio.

60 tokens/sec — Llama 3.1 8B on M2

20+ tokens/sec — 70B models on M2 Ultra

75% model size reduction via Q4 quantization

One counterintuitive finding from running Sidekick: for interactive use cases, local actually has lower perceived latency than cloud APIs. With cloud APIs, there's a network round trip before the first token arrives — typically 500ms to 2 seconds depending on load. With local inference, the first token can appear in under 100ms. Even if the overall generation is slightly slower end-to-end, the experience feels faster because the response starts immediately. That matters more than people expect.

That said, I want to be honest about where local still falls short. For workloads requiring frontier-class reasoning — complex multi-step analysis, nuanced language tasks where a single word choice matters — the 70B local models are meaningfully behind GPT-4-class. The gap is narrowing with each model generation, but it exists today. The right framing isn't "local vs. cloud." It's "local-first, cloud-optional for specific workloads where you actually need the extra capability."

03 — What Building Sidekick Taught Me

Six Months of Running It Myself

Building a system and actually using it as your primary tool are very different experiences. Here's what I learned from the second part.

Users Trust Local AI More — and That Changes Behavior

This was the finding I didn't anticipate. When I demoed Sidekick to colleagues, the most common reaction was relief. Not excitement about the tech — relief. Relief that they could paste in a confidential client document without worrying. Relief that the conversation history wasn't being stored somewhere outside their control. The psychological overhead of using cloud AI — the nagging question of "should I really be putting this in here?" — just disappeared when everything was local.

That trust change had a real impact on how people used the tool. They shared more complete context. They asked more detailed questions. They brought it into sensitive workflows they'd previously kept manual. The quality of interactions went up because users weren't self-censoring their inputs. There's a lesson there that gets missed in most AI adoption discussions: trust is a feature. Not a soft, nice-to-have feature — a fundamental one that determines whether people actually use the thing.

Graph RAG Was the Architecture Breakthrough

Standard RAG works well when your documents are independent chunks and your questions are straightforward. It fails badly when you need to synthesize information across multiple documents or answer questions that require following a chain of relationships. I hit this wall early and it was frustrating.

The Microsoft Research paper "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" (Edge et al., arXiv 2404.16130, April 2024) formalized something I'd been discovering empirically: building a knowledge graph on top of your documents enables a fundamentally different class of reasoning. Instead of "find the three chunks most similar to this query," you can traverse a graph of entities, relationships, and concepts to answer questions that require multi-hop reasoning. The paper arrived at almost exactly the right time for what I was trying to build.

For Sidekick, this meant I could ask things like "how does the architecture decision in document A affect the requirements described in document B?" and get a coherent answer that demonstrated actual understanding of the connection, not just semantic similarity matching. That's the difference between a document search tool and something that reasons.

"Standard RAG is a document search system. Graph RAG is a reasoning system. The difference matters enormously in production."

Ollama Made This Viable for the First Time

Credit where it's due: the Ollama project fundamentally changed the developer experience for local AI. Before Ollama, running a local model meant compiling llama.cpp, managing model weights manually, writing your own inference API, and debugging CUDA configurations. It was a systems engineering project before you even got to the interesting part. Most people gave up there.

Ollama wraps all of that into a clean API that mirrors OpenAI's interface, handles model downloading and management, and abstracts the hardware acceleration layer. For Sidekick, this meant I could focus on the application architecture — the RAG pipeline, the graph construction, the Flutter UI — rather than model serving infrastructure. That's a real unlock. I'd still be fighting CUDA configs if Ollama hadn't existed.

04 — The Stack

The Architecture That Actually Works

Here's the actual technical stack I landed on after six months of iteration. I'm sharing this not as a prescription but as a reference point. The specific tools matter less than the patterns underneath them — these will shift, but the patterns won't.

Model Serving: Ollama

Ollama handles model serving via a REST API on localhost:11434. It speaks a dialect compatible with the OpenAI API, which means existing LLM libraries (LangChain, LlamaIndex) work with minimal modification. I settled on Llama 3.1 8B for general use — fast, capable enough for most things — with Llama 3.1 70B for complex reasoning tasks on higher-end hardware.

Quantization: GGUF via llama.cpp

The GGUF format (which replaced GGML) is the standard for quantized model distribution. Q4_K_M hits the best quality/size/speed tradeoff for most use cases. Q8_0 gives near-full-precision quality at moderate size reduction. Q2_K sacrifices noticeable quality but enables very large models on limited hardware — useful in a pinch, not something I'd recommend as a default.

Vector Store: ChromaDB

ChromaDB runs embedded — no separate server process — and handles the standard RAG retrieval pipeline well. For larger document collections, FAISS or Qdrant are better choices. The key pattern here: your vector store is ephemeral and reconstructable from source documents. Don't treat it as your primary data store. I made this mistake early and paid for it.

Graph RAG Layer: Custom Python

I built the Graph RAG implementation on top of NetworkX with entity extraction via a local spaCy model. The pipeline: extract named entities and relationships from documents, build a knowledge graph, then at query time use both vector similarity and graph traversal to construct the retrieval context. Microsoft's GraphRAG library (open-sourced in mid-2024) is now a better starting point than building from scratch — if I were starting today, I'd use that.

Frontend: Flutter

Flutter was the right choice for cross-platform desktop UI. One codebase targeting macOS, Windows, and Linux. The conversation UI, document management interface, and settings panel are all Flutter. Communication with the Python backend happens over a local REST API. It works well, though I'll admit the Flutter-Python bridge had some rough edges early on.

Getting Started Quickly

If you want to experiment with local AI without committing to a full stack, start here: install Ollama (brew install ollama on macOS), pull a model (ollama pull llama3.1), and hit the API at localhost:11434. You're running local AI in under 10 minutes. Add ChromaDB for RAG, add the Microsoft GraphRAG library for complex document reasoning, and you have 80% of Sidekick's architecture.

05 — Being Honest

3 Cases Where Cloud AI Still Wins

I've been arguing for local AI, and I believe in it. But I'd lose credibility if I pretended it's universally superior. There are cases where cloud AI is clearly the right choice, and being honest about this is what separates a considered position from advocacy.

1. Massive Training and Fine-tuning Workloads

If you need to train or significantly fine-tune a model, you need GPU clusters. The compute economics here are brutal — training a 7B parameter model from scratch on a single A100 GPU would take months. Cloud GPU providers give you burst access to hundreds of GPUs for hours. This is where the cloud's fundamental advantage, elastic compute, is genuinely irreplaceable. No amount of on-device hardware changes this math.

2. Frontier Reasoning Tasks

GPT-4, Claude 3.5 Sonnet, and Gemini Ultra are meaningfully ahead of the best available open-source models for complex reasoning tasks. If you're doing multi-step financial analysis, complex code architecture design, or nuanced legal interpretation, the quality gap is real. Narrowing with each model generation, but real today. Use the right tool. Don't use a hammer when you need a scalpel — and don't pretend local 70B is a scalpel when it isn't yet.

3. Teams Without the Infrastructure Appetite

Running local AI requires someone who can maintain it. Model updates, hardware management, debugging inference issues — this is a real operational cost that doesn't show up in any tutorial. For a small team shipping a product fast, the engineering opportunity cost of maintaining local AI infrastructure might outweigh the privacy benefits. Know your team's actual constraints, not the ones you wish you had.

Conclusion

The Future Is Local-First, Cloud-Optional

After six months, here's where I've landed: the dichotomy between "cloud AI" and "local AI" is going to dissolve, and it's going to dissolve faster than most organizations are planning for.

The model quality gap between open-source and frontier models is closing with every six-month model generation cycle. The hardware to run powerful models locally is now in the laptops that engineers already carry. The tooling — Ollama, llama.cpp, ChromaDB, GraphRAG — has crossed the threshold from experimental to production-viable. That shift happened quietly, and a lot of people missed it.

Meanwhile, the regulatory pressure is only building. The EU AI Act is being implemented. GDPR enforcement is getting more aggressive. Healthcare, finance, and legal sectors are watching HIPAA and data sovereignty rules evolve in real time. The question isn't whether data privacy concerns will force a reckoning with how enterprises use cloud AI. It's when, and whether your organization is ahead of it or behind it.

"The organizations that build local AI capability now will have an architectural advantage when regulation catches up. The ones that wait will be retrofitting under pressure."

The smart play right now is to start building hybrid capability: understand where your data is genuinely sensitive, route those workloads locally, and use cloud APIs for what they're actually best at — frontier reasoning and burst compute. Don't wait for regulation to force your hand. By then you're retrofitting under pressure, and that never goes well.

I built Sidekick to scratch my own itch. It became a genuinely useful tool, and building it taught me things about local AI I couldn't have learned any other way. If you're thinking about this space — start building. The hardware is ready. The tooling is ready.

RS Arun

Technical Leader · Builder of Sidekick, a 100% local AI assistant · 12+ years in engineering and system design

View Portfolio