Private, Production-Ready, Custom LLM Stack Options

If you want generative AI without sending data to third parties, you need a private, custom stack. This guide shows how to stand up task-specific models—now including Coding, Software Engineering, Math/Reasoning, Legal Compliance, Web Content Conversion, Safety/Moderation, Creative Writing, plus domain tracks for Law and Healthcare—all served behind TLS and authentication, with optional RAG, tools, observability, and air-gapped operation. You’ll get copy-paste playbooks for quick prototypes and for production.
At the heart of the stack is a collection of curated, task-specific models optimized for real-world use cases. For code generation and development tasks, the Qwen2.5-Coder model is recommended. For more advanced software engineering workflows like multi-file reasoning, architecture decisions, or code review, larger models like DeepSeek-Coder-V2-Instruct and CodeLlama-34B-Instruct are introduced. Mathematical reasoning and structured problem solving are delegated to models such as Qwen2.5-Math and DeepSeek-R1-distill. ReaderLM is highlighted as a specialized model for web content cleanup and conversion, ideal for preparing unstructured documents for RAG systems.
Crucially, the stack includes dedicated safety and moderation layers using models like Llama-Guard 3, which can intercept and filter both incoming prompts and outgoing completions. Creative generation is handled by community favorites such as MythoMax-L2. Domain-specific requirements are also addressed: in the legal field, models like LegalGPT, Legal-BERT, and LegalLLaMA are included, while healthcolare applications are supported by BioMistral, BioGPT, Clinical Camel, and others. In many cases, domain adaptation through LoRA-finetuned general-purpose models such as Llama 3.1 or Qwen2.5 provides superior results on organization-specific data. These can be paired with schema-validated prompts, redaction rules, and manual review checkpoints to ensure compliance and safety.
The stack also supports Retrieval-Augmented Generation (RAG) workflows, enabling private document ingestion and vector-based retrieval using open embedding models and tools like FAISS or Qdrant. Agent capabilities are added through safe tool execution (such as Python for math problems or repo analysis for engineering tasks) and web fetchers sanitized through ReaderLM. Each model is served through a dedicated endpoint, with API gating, access logs, and strict control over inference behavior.
Architecture and Approach
At its simplest, a private LLM stack begins with Ollama or llama.cpp on a single workstation. This lets you experiment quickly with small or medium-sized models. OpenWebUI provides a lightweight interface, while a local embedding store like FAISS or SQLite can support basic retrieval-augmented generation. Security at this level is straightforward: bind services to localhost and, if needed, place a reverse proxy with authentication in front.
For production workloads, the architecture becomes more layered. Here, vLLM or Hugging Face TGI act as the primary serving engines, exposing OpenAI-compatible APIs with streaming and batching. A reverse proxy such as Caddy provides TLS, Basic Auth or API keys, IP allowlists, and optional mTLS. Each role-specific model—whether for coding, software engineering, legal work, or medical language—runs as a separate containerized service with its own route, authentication, and policies. A safety layer, often powered by Llama-Guard 3, sits at the perimeter to filter both inbound prompts and outbound responses. RAG pipelines and tool integrations can be added as separate microservices, while observability comes from Prometheus or OpenTelemetry collectors feeding into Grafana dashboards. For the most sensitive deployments, model weights and images are mirrored internally, with all egress blocked to enforce true air-gapped operation.
Reference Architecture
Tier A — Rapid Prototyping
- Serving: Ollama (single box, GPU or CPU)
- UI: OpenWebUI
- Storage: Local embeddings/vector store (FAISS/SQLite)
- Security: Localhost only or reverse proxy with basic auth
Tier B — Production
- Serving: vLLM (primary) or Hugging Face TGI (alternative), OpenAI-compatible APIs
- Reverse Proxy: Caddy (TLS, Basic Auth/API keys, rate limits, IP allowlists)
- Models by Job-to-be-Done:
- Coding: Qwen2.5-Coder
- Software Engineering (architecture/code review): DeepSeek-Coder-V2-Instruct, CodeLlama-34B-Instruct, Qwen2.5-Coder-32B
- Math/Reasoning: Qwen2.5-Math and/or DeepSeek-R1-distill
- Web Content Conversion: ReaderLM (HTML→Markdown)
- Safety/Moderation: Llama-Guard 3 (ingress and egress)
- Creative Writing: MythoMax-L2
- Law (domain): Legal-tuned open models (e.g., Law-GPT, Legal-BERT, LegalLLaMA), or LoRA-tuned Llama/Qwen
- Healthcare (domain): BioMistral, BioGPT, Clinical Camel/MedAlpaca (plus your own LoRA)
- Generalist Assistants: Llama 3.1, Qwen2.5-Instruct, Mixtral, Gemma 2, optional DBRX
- Agents: RAG + tools (Python/sympy, repo access, sanitized web fetch → ReaderLM)
- Observability: Prometheus/Grafana or OpenTelemetry collector
- Governance: Redaction/segregated logs, retention policies
- Air-Gap: Mirrored model cache, internal container registry, egress blocked
Models by Role (Why These)
- Coding — Qwen2.5-Coder (7B→32B)
Strong code generation & instructions; great for snippet-level tasks and IDE-style completions. - Software Engineering — DeepSeek-Coder-V2-Instruct (16B/33B), CodeLlama-34B-Instruct, Qwen2.5-Coder-32B
Use when you need architecture tradeoffs, multi-file reasoning, refactors, code review, and documentation—not just line-level code. - Math/Reasoning — Qwen2.5-Math + DeepSeek-R1-distill
Math for explicit derivations; R1-distill for broader multi-step reasoning beyond strict math. - Web Content Conversion — ReaderLM
Purpose-built to convert messy HTML into clean Markdown/text for ETL and RAG. - Safety/Moderation — Llama-Guard 3
Open safety classifier you can tailor to policy; gate before and after model responses. - Creative Writing — MythoMax-L2-13B
Community favorite for ideation, tone, and narrative exploration. - Law (domain) — Law-GPT / Legal-BERT / LegalLLaMA (or LoRA-tuned Llama/Qwen)
Legal vocab/citation patterns; great for case summaries, contract commentary, statute Q&A. Always pair with organization-specific policy JSON and human review. - Healthcare (domain) — BioMistral, BioGPT, Clinical Camel/MedAlpaca
For biomedical/clinical language, drug names, ICD/CPT-style text, and PubMed-like corpora. Apply strict PHI handling and log redaction. - Generalist Assistants — Llama 3.1 / Qwen2.5-Instruct / Mixtral / Gemma 2
Reliable all-purpose chat; pick based on hardware and licensing.
Tip: For law/healthcare, you can often get farther by LoRA-tuning a strong generalist (Llama 3.1/Qwen2.5) on your private corpus than by relying solely on small niche models. Keep the safety sidecar enabled and version your policies.
Security, Customization, and Operations
Running custom LLMs privately means placing security at the core. Model services should never be exposed directly to the internet. Instead, they should sit behind a reverse proxy that enforces TLS, authentication, and rate-limiting. Secrets such as Hugging Face tokens or API keys must be handled via environment variables or a vault system. Logs should be carefully designed to avoid leaking sensitive content. For healthcare and legal use, access controls and retention policies are non-negotiable.
Customization is what makes the stack valuable to each organization. LoRA and QLoRA allow lightweight fine-tuning on proprietary datasets without the overhead of full retraining. Quantization, using formats like GGUF for CPU deployments or GPTQ/AWQ for GPUs, reduces VRAM requirements and makes deployment possible on commodity hardware. Context windows should be chosen pragmatically: while some models advertise 64k or 128k tokens, throughput can degrade significantly. For most tasks, 8k to 32k offers a good balance. Prompting standards—such as requiring JSON outputs, LaTeX derivations, or refusal templates—help enforce reliability.
Operations extend beyond deployment. Observability is essential for ensuring models are fast, reliable, and aligned. Metrics around latency, token throughput, error rates, and moderation outcomes should be tracked, visualized, and tied to service level objectives. Offline evaluation sets allow teams to benchmark accuracy in coding, legal reasoning, or medical QA. Red-teaming should be a regular practice, with suites that attempt jailbreaks, prompt injections, and PII or PHI exfiltration. Findings should feed back into prompt templates, refusal strategies, or safety policies.
- Don’t expose model ports to the internet. Bind to
127.0.0.1
/private VLAN. - Put Caddy in front with TLS, Basic Auth, API keys, rate-limits, and (optionally) IP allowlists or mTLS.
- Secrets: Use env files or a secrets manager; rotate HF tokens and keys.
- Logging: Redact inputs/outputs (especially PHI/PII); segregate safety logs from app logs; set retention.
RAG & Tools (Your Private Agent Platform)
RAG Pipeline: loaders → clean/convert (ReaderLM) → chunk → embeddings → vector DB → retriever → (optional re-rank) → prompt.
Tools to wire in:
- Python/sympy for exact math and controlled code execution
- Repo access (read-only) for code search, test triggers, PR suggestions
- Web fetcher → ReaderLM sanitizer (strip scripts/trackers; preserve tables/code)
Guardrails: schema-constrained JSON, output validators, Llama-Guard on ingress & egress, and—where needed—PII/PHI detectors.
Customization Paths
- LoRA/QLoRA with Unsloth/PEFT; mount adapters on base models in vLLM.
- Quantization: GGUF (llama.cpp) for CPU/low-VRAM; GPTQ/AWQ for GPU efficiency.
- Context Windows: Keep pragmatic windows (8k–32k) for throughput; extend only when necessary.
- Prompt Standards: Per-role system prompts (coder, software engineer, math, web convert, safety, creative, law, healthcare), JSON schemas, and refusal patterns.
Observability, Evals & Red-Team
- Metrics: latency, tokens, queueing, error rates; per-model dashboards and SLOs.
- Offline evals: code pass rates, multi-file tasks (SE), math accuracy, moderation precision/recall, retrieval hit rate.
- Red-team: jailbreak suites, prompt-leak checks, PHI/PII exfil tests; tune policies and refusals.
Hardware & Cost Notes
The VRAM requirements of each model vary. Seven-billion parameter models run comfortably on a 12–16 GB GPU, and with GGUF quantization can even be hosted on CPUs for lighter use. Thirteen-billion parameter models generally require 20–24 GB, while models in the 30B range or mixture-of-experts configurations may demand 48 GB or multiple GPUs with tensor or replica parallelism. In practice, the biggest throughput drivers are batching, context length, tokenizer efficiency, and kv-cache size.
- 7B models: single 12–16 GB GPU (or CPU via GGUF at lower throughput).
- 13B–34B models: 20–48 GB GPUs (or quantize/parallelize).
- MoE/32B+: multi-GPU or high-VRAM nodes; consider tensor/replica parallelism.
- Throughput drivers: batching, context length, tokenizer speed, kv-cache size.
Deployment Playbooks
Most organizations begin with a single-box prototype. Using Ollama and OpenWebUI, you can quickly test Qwen2.5-Coder, DeepSeek-R1, ReaderLM, or MythoMax without complex setup. For production, containerized deployments with vLLM or TGI behind Caddy become the norm. Each model runs as its own service, with unique routes, API keys, and Basic Auth credentials. Legal and healthcare endpoints can be placed under stricter logging and review workflows. Safety sidecars run at both ingress and egress to filter input and output. With this pattern, you can scale from a developer laptop to an enterprise cluster without changing the fundamental architecture.
Playbook A — Single Box (Prototype)
Ollama + OpenWebUI (GPU or CPU)
# Examples (expand as needed)
ollama pull qwen2.5-coder:7b-instruct
ollama pull qwen2.5-math:7b-instruct
ollama pull deepseek-r1:8b
# For engineering-scale: use larger code models via Ollama or vLLM
- Keep bindings to localhost; place a simple proxy with auth if you must expose.
Playbook B — Production (vLLM/TGI + Caddy + Guard + RAG + Tools)
- Separate endpoints for coder vs software-engineering models (different system prompts).
- Keep Llama-Guard on both ingress and egress.
- Add /law and /healthcare routes with stricter policies and logging rules.
- Use compose for reproducible deployments and per-route Basic Auth/API keys.
Industry Applications
Marketing agencies might combine coding, creative writing, and safety models to build brand-safe content pipelines. Legal firms benefit from law-specific LoRA models integrated with internal RAG databases of contracts and case law. Healthcare providers can use biomedical LLMs for literature review or clinical note summarization, but always with PHI safeguards and clinician review. Engineering teams, meanwhile, get the most value from separating coding and software engineering models: one handles snippet completion, while the other supports architecture decisions, refactors, and documentation.
- Agency: coding + creative + safety; RAG on brand assets; LoRA for tone.
- Legal/Finance: strict moderation, PII redaction, air-gapped cache; RAG on controlled corpora; LoRA on internal documents.
- Healthcare: PHI-aware logging, restricted prompts, clinician-in-the-loop; LoRA on approved clinical notes.
- Product/Eng: software-engineering model gated by unit/integration tests; repo-aware suggestions; ADR (architecture decision record) generation.
1. Software Engineering / Full-Stack Development
For more than just code completion — think architecture decisions, code review, refactoring, documentation generation, and integration advice — you want models that combine code + general reasoning:
- DeepSeek-Coder-V2-Instruct (16B or 33B)
Excellent at multi-language coding, understands frameworks, can walk through architecture tradeoffs. - CodeLlama-34B-Instruct
Strong on large codebases and multi-file reasoning. - Qwen2.5-Coder-32B (we already had its smaller siblings)
Larger context + better reasoning for big projects.
Deployment: Serve these alongside your math and generalist models; give them more VRAM and longer context for multi-file reasoning. Keep them in a separate “engineering” endpoint so your prompt templates can differ from “coding snippet” use cases.
2. Law-Specialized Models
For legal tasks like case summarization, contract review, statute Q&A, you want a model trained on legal corpora:
- Law-GPT / Legal-BERT / LexLM — domain-specific open models with legal vocab & reasoning.
- Pythia-Legal / LegalLLaMA — open fine-tunes of LLaMA on U.S. case law + contracts.
- BloombergGPT (closed weights, but the concept shows value of finance/law-domain training — for law you’d emulate with open sources).
Note: You can also take Llama 3.1 or Qwen2.5 and LoRA-finetune on your jurisdiction’s case law & statutes for better results while keeping them private.
3. Healthcare / Biomedical Models
For clinical language, drug names, and medical reasoning:
- BioMistral — tuned for biomedical + clinical text.
- BioGPT / PubMedBERT — Microsoft’s biomedical LMs.
- Clinical Camel / MedAlpaca — tuned for medical dialogues & diagnosis explanation.
- Med-PaLM 2 (Google) — not open weights, but open analogues like OpenBioLLM are emerging.
HIPAA note: Even with a medical LLM, you still have to wrap it in the same safety + logging + governance stack we’ve described — and ensure you don’t store PHI in logs unless encrypted + access-controlled.
4. Integration into Our Stack
We’d simply add three more “job” rows in the HTML matrix:
- Software Engineering — DeepSeek-Coder-V2-Instruct, CodeLlama-34B, Qwen2.5-Coder-32B.
- Law — Law-GPT, Legal-BERT, LexLM, or LoRA-tuned Llama/Qwen.
- Healthcare — BioMistral, BioGPT, Clinical Camel, MedAlpaca.
They’d follow the same “Proto → Prod” pattern (Ollama for quick use, vLLM/TGI for production), get their own Basic-Auth route in Caddy, and be quantized/served based on VRAM.
FAQ
Can this run fully offline? Yes. Mirror images/models to an internal registry and a local model cache, then block egress.
How private is it? When configured as above, no external inference calls are made; logs can be redacted, encrypted, and access-controlled.
Do I need domain-specific models for law/healthcare? They help, but LoRA-tuned generalists often outperform small niche models on your own corpus. Keep humans in the loop for regulated outputs.
Conclusion
A private LLM stack allows organizations to enjoy the benefits of modern AI while keeping sensitive data under their own control. By combining generalist assistants with models purpose-built for coding, software engineering, math, content conversion, moderation, creativity, law, and healthcare, you can meet a wide range of needs. The path from prototyping to production is straightforward: start small with Ollama, expand into vLLM/TGI and Caddy for serving and security, add RAG and tools for agentic workflows, and wrap everything with observability and governance. Done correctly, this approach provides enterprise-grade AI capabilities—without surrendering privacy or compliance.