Running AI Yourself
The most complete form of tool sovereignty is running AI on hardware you control. When the model runs on your machine — or a server you own or lease — no external provider can change its behavior, restrict your access, raise your prices, or observe your inputs. This is self-hosting, and it is increasingly practical for a wide range of use cases. It is not free, and it is not always the right choice. But understanding what it takes, what it costs, and what it gives you is essential knowledge for any sovereign AI practitioner.
What Self-Hosting Means
Self-hosting means you download an AI model's weights to storage you control, run an inference engine on your own hardware, and query the model through a local interface rather than an external API. Your inputs and outputs never leave your infrastructure. No one else's servers are involved. The tools for doing this have become remarkably accessible. Ollama is a widely used application that allows you to run dozens of open models — including Llama, Mistral, Gemma, Phi, and many others — on a personal laptop or desktop with a single command. LM Studio provides a graphical interface for the same purpose. llama.cpp is a high-performance inference engine written in C++ that runs efficiently on CPU and GPU hardware. For production deployments at scale, vLLM and Hugging Face's Text Generation Inference (TGI) are widely used server frameworks. This ecosystem means that running a capable AI model locally is no longer a research-only capability. A student with a modern laptop can run a 7-billion-parameter model with reasonable performance. A developer with a desktop GPU can run a 13-billion-parameter model quickly. A team with a GPU server can run 70-billion-parameter models with performance approaching the closed frontier.
Consumer GPUs with 16-24 GB of VRAM can run 13B parameter models at inference speeds adequate for interactive use. Models can also be quantized — compressed with a small accuracy trade-off — to fit on hardware with less memory. A quantized 7B model runs acceptably on many modern laptops with a dedicated GPU, and even on CPU-only hardware at reduced speed. The capability threshold for practical local AI is now within reach for serious individual practitioners.
What Self-Hosting Costs
Self-hosting is not free. Its costs are real and must be weighed against its benefits. Hardware cost: A capable GPU for serious local inference starts at several hundred dollars for older consumer cards and rises to thousands for high-end current cards. Enterprise GPU hardware is far more expensive. These are capital costs paid upfront, not variable costs per query. Electricity: Running a GPU under load continuously is meaningful in electricity consumption. A high-end GPU running continuously for a month may add $20-60 to an electricity bill, depending on local rates — manageable for most users, but real. Operational overhead: Unlike an API, a self-hosted model requires setup, maintenance, model updates, and troubleshooting. When something goes wrong, there is no support team to call. This operational burden falls on you. Model capability gap: For many tasks, the best open models available for self-hosting are genuinely excellent. But for tasks requiring frontier-level reasoning — complex multi-step analysis, advanced code generation, nuanced judgment calls — the gap between current open models and frontier closed models may matter. The sovereign practitioner must honestly assess whether that gap affects their specific use cases. Development time: Integrating a self-hosted model into an application requires more work than using an API. Authentication, rate limiting, monitoring, and scaling are all handled by the provider in the API case — and are all your responsibility in the self-hosting case.
What Self-Hosting Gives You
Against these costs, self-hosting offers a distinctive set of benefits that no managed service can replicate. Complete data privacy: Your prompts and outputs exist only on your hardware. They cannot be logged by a third party, used for training, subpoenaed from a provider, or accessed in a data breach at a remote server. For sensitive personal, professional, legal, or medical content, this matters enormously. Operational independence: Once the model is running, it continues to run regardless of whether the original provider's servers are up, their API is rate-limited, their pricing has changed, or their company exists. You are not dependent on any external service for your tool to work. Customization freedom: You can modify the model's system prompt, apply custom fine-tuning, modify the inference code, add pre- and post-processing, and change anything about how the model behaves — without any provider's permission or awareness. Cost at scale: At high query volumes, the per-query cost of self-hosted inference is very low once hardware is paid for. For applications that make millions of queries, the economics often favor self-hosting decisively over API pricing. Offline operation: A self-hosted model works with no internet connection. For field applications, air-gapped environments, or situations where connectivity cannot be guaranteed, this is a hard requirement that only local deployment can meet.
Match each self-hosting tool to its primary purpose.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A journalist investigates a government corruption case and wants to use an AI model to analyze thousands of sensitive leaked documents. She is concerned about document contents being exposed. What deployment approach best addresses her concern?
A developer finds that a 7-billion-parameter open model runs too slowly on CPU-only hardware for interactive use. Which technique allows the model to run with reduced memory requirements and acceptable speed on the same hardware, with a small trade-off in output quality?
Self-Hosting Cost-Benefit Analysis
- You are advising a small independent law firm (5 attorneys, 3 staff) that currently spends $800/month on closed AI API subscriptions. They have expressed interest in self-hosting for privacy reasons — client privilege makes them uncomfortable with any client communications or case strategy being transmitted to external servers.
- Research or estimate the following for a realistic self-hosting setup for this firm:
- 1. Hardware requirements: What GPU hardware would adequately run a capable open model? What is the current approximate cost of that hardware?
- 2. Model selection: Name a specific open model (with version) that would serve legal drafting, document summarization, and research tasks. What are its performance characteristics?
- 3. Software stack: What tools would you use for inference serving, model management, and potentially a chat interface for the attorneys?
- 4. Monthly operating cost: Estimate electricity cost, any cloud backup costs, and the time cost of monthly maintenance (hours per month times a notional hourly rate).
- 5. Break-even analysis: At $800/month in current API spend, how many months would it take to recover the hardware investment?
- 6. Capability assessment: Is there any task the firm currently uses AI for where the open model may underperform the closed API? How would you handle that gap?
- Present your analysis as a one-page recommendation memo addressed to the firm's managing partner.