Private, Production-Ready, Custom LLM Stack Options

If you want generative AI without sending data to third parties, you need a private, custom stack. This guide shows how to stand up task-specific models—now including Coding, Software Engineering, Math/Reasoning, Legal Compliance, Web Content Conversion, Safety/Moderation, Creative Writing, plus domain tracks for Law and Healthcare—all served behind TLS and authentication, with optional RAG, tools, observability, and air-gapped operation. You’ll get copy-paste playbooks for quick prototypes and for production.

At the heart of the stack is a collection of curated, task-specific models optimized for real-world use cases. For code generation and development tasks, the Qwen2.5-Coder model is recommended. For more advanced software engineering workflows like multi-file reasoning, architecture decisions, or code review, larger models like DeepSeek-Coder-V2-Instruct and CodeLlama-34B-Instruct are introduced. Mathematical reasoning and structured problem solving are delegated to models such as Qwen2.5-Math and DeepSeek-R1-distill. ReaderLM is highlighted as a specialized model for web content cleanup and conversion, ideal for preparing unstructured documents for RAG systems.

Crucially, the stack includes dedicated safety and moderation layers using models like Llama-Guard 3, which can intercept and filter both incoming prompts and outgoing completions. Creative generation is handled by community favorites such as MythoMax-L2. Domain-specific requirements are also addressed: in the legal field, models like LegalGPT, Legal-BERT, and LegalLLaMA are included, while healthcolare applications are supported by BioMistral, BioGPT, Clinical Camel, and others. In many cases, domain adaptation through LoRA-finetuned general-purpose models such as Llama 3.1 or Qwen2.5 provides superior results on organization-specific data. These can be paired with schema-validated prompts, redaction rules, and manual review checkpoints to ensure compliance and safety.

Your private LLM infrastructure also supports Retrieval-Augmented Generation (RAG) workflows, enabling private document ingestion and vector-based retrieval using open embedding models and tools like FAISS or Qdrant. Agent capabilities are added through safe tool execution (such as Python for math problems or repo analysis for engineering tasks) and web fetchers sanitized through ReaderLM. Each model is served through a dedicated endpoint, with API gating, access logs, and strict control over inference behavior.

‍

Architecture and Approach

At its simplest, a private LLM stack begins with Ollama or llama.cpp on a single workstation. This lets you experiment quickly with small or medium-sized models. OpenWebUI provides a lightweight interface, while a local embedding store like FAISS or SQLite can support basic retrieval-augmented generation. Security at this level is straightforward: bind services to localhost and, if needed, place a reverse proxy with authentication in front.

For production workloads, the architecture becomes more layered. Here, vLLM or Hugging Face TGI act as the primary serving engines, exposing OpenAI-compatible APIs with streaming and batching. A reverse proxy such as Caddy provides TLS, Basic Auth or API keys, IP allowlists, and optional mTLS. Each role-specific model—whether for coding, software engineering, legal work, or medical language—runs as a separate containerized service with its own route, authentication, and policies. A safety layer, often powered by Llama-Guard 3, sits at the perimeter to filter both inbound prompts and outbound responses. RAG pipelines and tool integrations can be added as separate microservices, while observability comes from Prometheus or OpenTelemetry collectors feeding into Grafana dashboards. For the most sensitive deployments, model weights and images are mirrored internally, with all egress blocked to enforce true air-gapped operation.

‍

Reference Architecture

‍

Tier A — Rapid Prototyping

Serving: Ollama (single box, GPU or CPU)
UI: OpenWebUI
Storage: Local embeddings/vector store (FAISS/SQLite)
Security: Localhost only or reverse proxy with basic auth

Tier B — Production

Serving: vLLM (primary) or Hugging Face TGI (alternative), OpenAI-compatible APIs
Reverse Proxy: Caddy (TLS, Basic Auth/API keys, rate limits, IP allowlists)
Models by Job-to-be-Done:
- Coding: Qwen2.5-Coder
- Software Engineering (architecture/code review): DeepSeek-Coder-V2-Instruct, CodeLlama-34B-Instruct, Qwen2.5-Coder-32B
- Math/Reasoning: Qwen2.5-Math and/or DeepSeek-R1-distill
- Web Content Conversion: ReaderLM (HTML→Markdown)
- Safety/Moderation: Llama-Guard 3 (ingress and egress)
- Creative Writing: MythoMax-L2
- Law (domain): Legal-tuned open models (e.g., Law-GPT, Legal-BERT, LegalLLaMA), or LoRA-tuned Llama/Qwen
- Healthcare (domain): BioMistral, BioGPT, Clinical Camel/MedAlpaca (plus your own LoRA)
- Generalist Assistants: Llama 3.1, Qwen2.5-Instruct, Mixtral, Gemma 2, optional DBRX
Agents: RAG + tools (Python/sympy, repo access, sanitized web fetch → ReaderLM)
Observability: Prometheus/Grafana or OpenTelemetry collector
Governance: Redaction/segregated logs, retention policies
Air-Gap: Mirrored model cache, internal container registry, egress blocked

Models by Role (Why These)

Coding — Qwen2.5-Coder (7B→32B)
Strong code generation & instructions; great for snippet-level tasks and IDE-style completions.
Software Engineering — DeepSeek-Coder-V2-Instruct (16B/33B), CodeLlama-34B-Instruct, Qwen2.5-Coder-32B
Use when you need architecture tradeoffs, multi-file reasoning, refactors, code review, and documentation—not just line-level code.
Math/Reasoning — Qwen2.5-Math + DeepSeek-R1-distill
Math for explicit derivations; R1-distill for broader multi-step reasoning beyond strict math.
Web Content Conversion — ReaderLM
Purpose-built to convert messy HTML into clean Markdown/text for ETL and RAG.
Safety/Moderation — Llama-Guard 3
Open safety classifier you can tailor to policy; gate before and after model responses.
Creative Writing — MythoMax-L2-13B
Community favorite for ideation, tone, and narrative exploration.
Law (domain) — Law-GPT / Legal-BERT / LegalLLaMA (or LoRA-tuned Llama/Qwen)
Legal vocab/citation patterns; great for case summaries, contract commentary, statute Q&A. Always pair with organization-specific policy JSON and human review.
Healthcare (domain) — BioMistral, BioGPT, Clinical Camel/MedAlpaca
For biomedical/clinical language, drug names, ICD/CPT-style text, and PubMed-like corpora. Apply strict PHI handling and log redaction.
Generalist Assistants — Llama 3.1 / Qwen2.5-Instruct / Mixtral / Gemma 2
Reliable all-purpose chat; pick based on hardware and licensing.

Tip: For law/healthcare, you can often get farther by LoRA-tuning a strong generalist (Llama 3.1/Qwen2.5) on your private corpus than by relying solely on small niche models. Keep the safety sidecar enabled and version your policies.

Security, Customization, and Operations

Running custom LLMs privately means placing security at the core. Model services should never be exposed directly to the internet. Instead, they should sit behind a reverse proxy that enforces TLS, authentication, and rate-limiting. Secrets such as Hugging Face tokens or API keys must be handled via environment variables or a vault system. Logs should be carefully designed to avoid leaking sensitive content. For healthcare and legal use, access controls and retention policies are non-negotiable.

Customization is what makes the stack valuable to each organization. LoRA and QLoRA allow lightweight fine-tuning on proprietary datasets without the overhead of full retraining. Quantization, using formats like GGUF for CPU deployments or GPTQ/AWQ for GPUs, reduces VRAM requirements and makes deployment possible on commodity hardware. Context windows should be chosen pragmatically: while some models advertise 64k or 128k tokens, throughput can degrade significantly. For most tasks, 8k to 32k offers a good balance. Prompting standards—such as requiring JSON outputs, LaTeX derivations, or refusal templates—help enforce reliability.

Operations extend beyond deployment. Observability is essential for ensuring models are fast, reliable, and aligned. Metrics around latency, token throughput, error rates, and moderation outcomes should be tracked, visualized, and tied to service level objectives. Offline evaluation sets allow teams to benchmark accuracy in coding, legal reasoning, or medical QA. Red-teaming should be a regular practice, with suites that attempt jailbreaks, prompt injections, and PII or PHI exfiltration. Findings should feed back into prompt templates, refusal strategies, or safety policies.

‍

Don’t expose model ports to the internet. Bind to 127.0.0.1/private VLAN.
Put Caddy in front with TLS, Basic Auth, API keys, rate-limits, and (optionally) IP allowlists or mTLS.
Secrets: Use env files or a secrets manager; rotate HF tokens and keys.
Logging: Redact inputs/outputs (especially PHI/PII); segregate safety logs from app logs; set retention.

RAG & Tools (Your Private Agent Platform)

RAG Pipeline: loaders → clean/convert (ReaderLM) → chunk → embeddings → vector DB → retriever → (optional re-rank) → prompt.

Tools to wire in:

Python/sympy for exact math and controlled code execution
Repo access (read-only) for code search, test triggers, PR suggestions
Web fetcher → ReaderLM sanitizer (strip scripts/trackers; preserve tables/code)

Guardrails: schema-constrained JSON, output validators, Llama-Guard on ingress & egress, and—where needed—PII/PHI detectors.

Customization Paths

LoRA/QLoRA with Unsloth/PEFT; mount adapters on base models in vLLM.
Quantization: GGUF (llama.cpp) for CPU/low-VRAM; GPTQ/AWQ for GPU efficiency.
Context Windows: Keep pragmatic windows (8k–32k) for throughput; extend only when necessary.
Prompt Standards: Per-role system prompts (coder, software engineer, math, web convert, safety, creative, law, healthcare), JSON schemas, and refusal patterns.

Observability, Evals & Red-Team

Metrics: latency, tokens, queueing, error rates; per-model dashboards and SLOs.
Offline evals: code pass rates, multi-file tasks (SE), math accuracy, moderation precision/recall, retrieval hit rate.
Red-team: jailbreak suites, prompt-leak checks, PHI/PII exfil tests; tune policies and refusals.

Hardware & Cost Notes

The VRAM requirements of each model vary. Seven-billion parameter models run comfortably on a 12–16 GB GPU, and with GGUF quantization can even be hosted on CPUs for lighter use. Thirteen-billion parameter models generally require 20–24 GB, while models in the 30B range or mixture-of-experts configurations may demand 48 GB or multiple GPUs with tensor or replica parallelism. In practice, the biggest throughput drivers are batching, context length, tokenizer efficiency, and kv-cache size.

7B models: single 12–16 GB GPU (or CPU via GGUF at lower throughput).
13B–34B models: 20–48 GB GPUs (or quantize/parallelize).
MoE/32B+: multi-GPU or high-VRAM nodes; consider tensor/replica parallelism.
Throughput drivers: batching, context length, tokenizer speed, kv-cache size.

Deployment Playbooks

Most organizations begin with a single-box prototype. Using Ollama and OpenWebUI, you can quickly test Qwen2.5-Coder, DeepSeek-R1, ReaderLM, or MythoMax without complex setup. For production, containerized deployments with vLLM or TGI behind Caddy become the norm. Each model runs as its own service, with unique routes, API keys, and Basic Auth credentials. Legal and healthcare endpoints can be placed under stricter logging and review workflows. Safety sidecars run at both ingress and egress to filter input and output. With this pattern, you can scale from a developer laptop to an enterprise cluster without changing the fundamental architecture.

Playbook A — Single Box (Prototype)

Ollama + OpenWebUI (GPU or CPU)

# Examples (expand as needed)
ollama pull qwen2.5-coder:7b-instruct
ollama pull qwen2.5-math:7b-instruct
ollama pull deepseek-r1:8b
# For engineering-scale: use larger code models via Ollama or vLLM

‍

Keep bindings to localhost; place a simple proxy with auth if you must expose.

Playbook B — Production (vLLM/TGI + Caddy + Guard + RAG + Tools)

Separate endpoints for coder vs software-engineering models (different system prompts).
Keep Llama-Guard on both ingress and egress.
Add /law and /healthcare routes with stricter policies and logging rules.
Use compose for reproducible deployments and per-route Basic Auth/API keys.

Industry Applications

Marketing agencies might combine coding, creative writing, and safety models to build brand-safe content pipelines. Legal firms benefit from law-specific LoRA models integrated with internal RAG databases of contracts and case law. Healthcare providers can use biomedical LLMs for literature review or clinical note summarization, but always with PHI safeguards and clinician review. Engineering teams, meanwhile, get the most value from separating coding and software engineering models: one handles snippet completion, while the other supports architecture decisions, refactors, and documentation.

Agency: coding + creative + safety; RAG on brand assets; LoRA for tone.
Legal/Finance: strict moderation, PII redaction, air-gapped cache; RAG on controlled corpora; LoRA on internal documents.
Healthcare: PHI-aware logging, restricted prompts, clinician-in-the-loop; LoRA on approved clinical notes.
Product/Eng: software-engineering model gated by unit/integration tests; repo-aware suggestions; ADR (architecture decision record) generation.

1. Software Engineering / Full-Stack Development

For more than just code completion — think architecture decisions, code review, refactoring, documentation generation, and integration advice — you want models that combine code + general reasoning:

DeepSeek-Coder-V2-Instruct (16B or 33B)
Excellent at multi-language coding, understands frameworks, can walk through architecture tradeoffs.
CodeLlama-34B-Instruct
Strong on large codebases and multi-file reasoning.
Qwen2.5-Coder-32B (we already had its smaller siblings)
Larger context + better reasoning for big projects.

Deployment: Serve these alongside your math and generalist models; give them more VRAM and longer context for multi-file reasoning. Keep them in a separate “engineering” endpoint so your prompt templates can differ from “coding snippet” use cases.

2. Law-Specialized Models

For legal tasks like case summarization, contract review, statute Q&A, you want a model trained on legal corpora:

Law-GPT / Legal-BERT / LexLM — domain-specific open models with legal vocab & reasoning.
Pythia-Legal / LegalLLaMA — open fine-tunes of LLaMA on U.S. case law + contracts.
BloombergGPT (closed weights, but the concept shows value of finance/law-domain training — for law you’d emulate with open sources).

Note: You can also take Llama 3.1 or Qwen2.5 and LoRA-finetune on your jurisdiction’s case law & statutes for better results while keeping them private.

3. Healthcare / Biomedical Models

For clinical language, drug names, and medical reasoning:

BioMistral — tuned for biomedical + clinical text.
BioGPT / PubMedBERT — Microsoft’s biomedical LMs.
Clinical Camel / MedAlpaca — tuned for medical dialogues & diagnosis explanation.
Med-PaLM 2 (Google) — not open weights, but open analogues like OpenBioLLM are emerging.

HIPAA note: Even with a medical LLM, you still have to wrap it in the same safety + logging + governance stack we’ve described — and ensure you don’t store PHI in logs unless encrypted + access-controlled.

4. Integration into Our Stack

We’d simply add three more “job” rows in the HTML matrix:

Software Engineering — DeepSeek-Coder-V2-Instruct, CodeLlama-34B, Qwen2.5-Coder-32B.
Law — Law-GPT, Legal-BERT, LexLM, or LoRA-tuned Llama/Qwen.
Healthcare — BioMistral, BioGPT, Clinical Camel, MedAlpaca.

They’d follow the same “Proto → Prod” pattern (Ollama for quick use, vLLM/TGI for production), get their own Basic-Auth route in Caddy, and be quantized/served based on VRAM.

‍

Private LLM Stack — Components Matrix

Models by job-to-be-done, serving options, sizing, and deployment notes.

Job	Primary Model(s)	Alternates	Serving (Proto → Prod)	Typical VRAM	Quant Options	Context (Guidance)	Customization	Notes
Coding	Qwen2.5-Coder (7B→32B)	Llama 3.1-Instruct, Mixtral	Ollama → vLLM	12–24 GB (size-dependent)	GGUF/GPTQ/AWQ	8k–32k	LoRA on repo/codebase; JSON tool calls	Best for snippet/IDE-style tasks
Software Engineering	DeepSeek-Coder-V2-Instruct (16B/33B)	CodeLlama-34B-Instruct; Qwen2.5-Coder-32B	Ollama → vLLM/TGI	24–48 GB	GPTQ/AWQ (quant), GGUF (CPU)	16k–64k	System prompts for ADRs, code review; repo tools	Multi-file reasoning, refactors, docs
Math / Reasoning	Qwen2.5-Math; DeepSeek-R1-distill	Llama 3.1; Gemma 2	Ollama → vLLM	12–24 GB	GGUF/GPTQ/AWQ	8k–32k	Require LaTeX + step-by-step; Python/sympy tool	Gate tool use through safety
Web Content Conversion	ReaderLM (HTML→Markdown)	Generalist LLM + rules	Ollama/TGI → TGI	8–16 GB	Quant optional	Short–mid	Domain-specific cleanup rules	Ideal for RAG ingestion
Safety / Moderation	Llama-Guard 3	Other classifiers	vLLM/TGI sidecar	12–16 GB	Quant possible	Short	Org policy JSON; ingress+egress gating	Segregate logs; tune thresholds
Creative Writing	MythoMax-L2-13B	Llama 3.1-Instruct; Mixtral	llama.cpp/Ollama → vLLM	20–24 GB	GGUF/GPTQ/AWQ	8k–16k	LoRA for brand tone; style presets	Keep guardrails for sensitive topics
Law (Domain)	Law-GPT / Legal-BERT / LegalLLaMA	Llama/Qwen + LoRA on your corpus	Ollama → vLLM/TGI	12–24 GB	GGUF/GPTQ/AWQ	8k–32k	Jurisdiction-specific LoRA; strict policy JSON	Human review; citation-aware prompts
Healthcare (Domain)	BioMistral; BioGPT; Clinical Camel/MedAlpaca	Llama/Qwen + LoRA on clinical notes	Ollama → vLLM/TGI	12–24 GB	GGUF/GPTQ/AWQ	8k–32k	PHI-aware logging; policy JSON (HIPAA-aligned)	Clinician-in-the-loop recommended
Generalist Assistant	Llama 3.1 (8B/70B); Qwen2.5-Instruct	Mixtral; Gemma 2; DBRX	Ollama → vLLM/TGI	8–48 GB	GGUF/GPTQ/AWQ	8k–128k (model-dependent)	LoRA; toolformer prompts; JSON I/O	Default assistant behind safety sidecar
RAG Layer	bge-/e5-embeddings + FAISS/SQLite	Qdrant; Weaviate	Local service → API	CPU/GPU optional	N/A	Chunk 300–800 tokens	Metadata + re-rank (optional)	Cite sources; log retrievals
Tools	Python/sympy; repo search	Browser fetch → ReaderLM	Local microservices	Lightweight	N/A	Short	Schema-guarded JSON	Sandbox exec; strict timeouts
Security	Caddy (TLS, auth, limits)	mTLS/IP allowlists	Reverse proxy front-door	N/A	N/A	N/A	Rotate keys; redact logs	Bind model ports to private nets
Observability	Prometheus/Grafana	OpenTelemetry	Sidecar/agent	N/A	N/A	N/A	Latency/tokens/SLOs	Drift watch; regression evals
Air-Gap	Local HF cache	Private registry	Offline mirrors	N/A	N/A	N/A	Block egress	Document import pipeline only

Notes: VRAM and context vary by checkpoint, quantization, batch size, and serving stack. Law/Healthcare outputs should be human-reviewed.

FAQ

Can this run fully offline? Yes. Mirror images/models to an internal registry and a local model cache, then block egress.
How private is it? When configured as above, no external inference calls are made; logs can be redacted, encrypted, and access-controlled.
Do I need domain-specific models for law/healthcare? They help, but LoRA-tuned generalists often outperform small niche models on your own corpus. Keep humans in the loop for regulated outputs.

Conclusion

A private LLM stack allows organizations to enjoy the benefits of modern AI while keeping sensitive data under their own control. By combining generalist assistants with models purpose-built for coding, software engineering, math, content conversion, moderation, creativity, law, and healthcare, you can meet a wide range of needs. The path from prototyping to production is straightforward: start small with Ollama, expand into vLLM/TGI and Caddy for serving and security, add RAG and tools for agentic workflows, and wrap everything with observability and governance. Done correctly, this approach provides enterprise-grade AI capabilities—without surrendering privacy or compliance.

‍

Eric Lamanna

Eric Lamanna is VP of Business Development at LLM.co, where he drives client acquisition, enterprise integrations, and partner growth. With a background as a Digital Product Manager, he blends expertise in AI, automation, and cybersecurity with a proven ability to scale digital products and align technical innovation with business strategy. Eric excels at identifying market opportunities, crafting go-to-market strategies, and bridging cross-functional teams to position LLM.co as a leader in AI-powered enterprise solutions.