Which AI models can power a private Claude Code deployment?

Any open-weight model with strong coding capability. In production we deploy: Qwen 2.5 Coder (7B, 14B, 32B), DeepSeek Coder V2, Llama 3 Instruct (8B, 70B), Mistral Large, Gemma 3, and StarCoder 2. Model selection depends on your GPU capacity, latency requirements, and specific use cases. We benchmark candidate models against your actual workloads during the engagement rather than relying on synthetic leaderboard scores.

Does private Claude Code work with OpenCode or other coding agents?

Yes. OpenCode, Aider, Continue.dev, and other coding agents that support OpenAI-compatible endpoints can all be wired to the same private local model backend. Many clients run Claude Code as the primary agent with OpenCode or Aider as secondary options for specific workflows. We configure whichever combination matches your team's preferences.

How is this different from using Claude Code with the Anthropic cloud API?

Two material differences. First, data governance: in the cloud setup, every prompt and completion is processed on Anthropic's servers. With private deployment, nothing leaves your network. This is legally and contractually required for HIPAA-covered workloads, attorney-client privileged work, classified projects, and many financial services engagements. Second, model choice: cloud Claude Code uses Anthropic's Claude family. Private deployment lets you pick any open-weight model (Qwen Coder, DeepSeek, Llama, etc.) based on your specific cost, latency, and capability needs. Cloud is easier and the Claude models are arguably stronger on coding tasks; private gives you sovereignty over both data and model choice.

Is this the same as your Local AI Deployment service?

Private Claude Code Deployment builds on our Local AI Deployment service and adds the Claude-Code-specific integration layer on top. If you just need a private LLM running behind your firewall for general inference (chat, RAG, embeddings), use Local AI Deployment. If you want the same private model to power an agentic coding workflow via Claude Code or OpenCode, use this service. Existing Local AI Deployment clients can upgrade to Private Claude Code Deployment for a reduced fee.

How do you handle audit logging and access control?

Every deployment includes per-user audit logging that captures prompts, completions, tool invocations, and costs. Mid and Large tiers integrate with your existing SSO (Okta, Azure AD, Google Workspace) so access is governed by your identity provider. Rate limiting is per-user or per-team. Audit logs can be shipped to your SIEM (Splunk, Elastic, Datadog, Grafana Loki) for compliance review.

Can we run this fully air-gapped?

Yes. Fully air-gapped deployments are supported for Large tier engagements. All model weights are downloaded and verified ahead of the installation, then the deployment runs with no outbound internet connectivity. Model updates require a manual air-gap transfer. We've delivered this topology for defense-adjacent and regulated industry clients.

Do you train our team to run the deployment?

Yes. Every deployment includes knowledge transfer: runbooks, operational training for your sysadmin or devops team, and 30 days of post-launch support. Teams that want deeper Claude Code training for their engineers should add our Training & Enablement service (see the Related section below). Many clients bundle a 2-day Team Enablement Workshop into the deployment at a reduced rate.

// SERVICE — CLAUDE CODE PRACTICE

Private Claude Code Deployment

Q: What is a private Claude Code deployment?

A private Claude Code deployment runs Claude Code or OpenCode against an open-weight AI model hosted on your own hardware, inside your own network. No prompts, context, or completions leave your infrastructure. The Claude Code CLI points at a local OpenAI-compatible endpoint (served by Ollama, vLLM, or equivalent) instead of the Anthropic cloud API. This gives regulated industries and IP-sensitive organizations the developer productivity of agentic coding tools without the data governance concerns of cloud AI.

Q: What hardware do we need?

For small deployments (single workstation, 1-3 engineers, 7B-30B models): one NVIDIA RTX 4090 (24GB VRAM), L40S (48GB), or A100 (40GB/80GB). For mid deployments (10-20 engineers, 30B-70B models): multi-GPU rig or a single H100 (80GB). For large deployments (70B+ models, high concurrency): H100 or H200 cluster with vLLM for throughput. Apple Silicon (M-series Ultra with 64GB+ unified memory) works well for Mac-first organizations at the small and mid tiers. We spec exact hardware during the discovery call.

Q: How long does a deployment take?

Small deployments: 1-2 weeks from contract signing to production. Mid: 3-4 weeks. Large: 6-10 weeks. The variable is your internal process: procurement, security review, network access provisioning. The actual technical work on our side is typically half the calendar time.

Run Claude Code or OpenCode against a private local AI model behind your firewall. One hundred percent private prompts and processing. Zero data egress. No cloud dependency. Built for healthcare, legal, financial services, government contractors, and IP-sensitive organizations that want agentic coding productivity without giving up data sovereignty.

The problem with cloud Claude Code for regulated work

Claude Code is one of the strongest agentic coding tools available in 2026. Run against Anthropic's cloud API, it ships production software ten times faster than traditional development. But for many organizations, cloud execution is a non-starter:

Healthcare — HIPAA protected health information cannot leave the covered entity's controlled infrastructure. Every prompt that includes PHI context becomes a potential breach.
Legal — attorney-client privilege attaches to every document your coding agent touches. Cloud execution puts that privilege at risk.
Financial services — SOC 2, PCI DSS, and SEC requirements limit where customer data and trading systems can be processed.
Government contractors and defense-adjacent — CUI, ITAR, EAR, and classification-adjacent work cannot run on commercial cloud APIs.
IP-sensitive organizations — the codebase is the competitive advantage. Sending it to a third party for AI inference is unacceptable regardless of contractual assurances.

Private Claude Code Deployment solves this by pointing the Claude Code CLI at a local OpenAI-compatible endpoint served by a model running on your own hardware, inside your own network. The Claude Code interface is identical; the data never leaves your firewall.

$40KSmall (1-3 engineers)

$65KMid (10-20 engineers)

$100KLarge (multi-node, 70B+)

0 bytesData egress

How the deployment works

01. Backend inference

We stand up an OpenAI-compatible inference endpoint on your hardware. The runtime depends on your scale: Ollama for operational simplicity and single-workstation deployments, vLLM for multi-engineer throughput, llama.cpp for edge and CPU deployments. GPU allocation, batching, KV-cache sizing, and context-window policy are tuned per deployment.

02. Model selection

Candidate open-weight coding models, benchmarked against your actual workloads:

Qwen 2.5 Coder (7B, 14B, 32B) — currently the strongest open-weight coding family per public benchmarks
DeepSeek Coder V2 (lite + full) — strong on instruction-following and long-context code tasks
Llama 3 Instruct (8B, 70B) — safe default for teams that want a general-purpose model
Mistral Large — strong tool-use performance
StarCoder 2 (15B) — code-specialized, lighter-weight

03. Claude Code integration

We configure Claude Code to route to your local endpoint. Custom skills, hooks, and MCP servers specific to your stack are developed during the engagement. OpenCode, Aider, and Continue.dev can be wired to the same backend for teams that want multiple coding-agent options.

04. Access control and audit

Per-engineer authentication (Mid and Large tiers integrate with your SSO), per-team rate limiting, and full audit logging of prompts, completions, tool invocations, and costs. Audit logs ship to your SIEM (Splunk, Elastic, Datadog, Grafana Loki) for compliance review.

05. Knowledge transfer

Runbooks, operational training for your sysadmin or devops team, and 30 days of post-launch support. Deployments can be bundled with our Team Enablement Workshop at a reduced rate for teams that want engineer-level training as part of the deployment.

Deployment tiers

Small — $40,000

Single workstation or small server, one GPU (RTX 4090 / L40S / A100)
1-3 engineers sharing the endpoint
7B-30B parameter open-weight models
Ollama backend, simple authentication, local audit logging
Custom skills and hooks for the client's stack
1-2 week deployment timeline

Mid — $65,000

Dedicated inference server (multi-GPU or A100/H100)
Shared endpoint for 10-20 engineers
30B-70B parameter models with quantization strategy
vLLM backend for concurrency, SSO integration
Per-engineer access control, SIEM-compatible audit logging
MCP integrations for 2-3 client systems
3-4 week deployment timeline

Large — $100,000

Multi-node, high-throughput inference cluster
Load-balanced Claude Code endpoints
70B+ parameter models with production quantization
Full observability stack (Prometheus, Grafana, cost tracking)
CI/CD for model updates, automated eval harness
Security review, threat model, governance framework
30-day post-launch support; optional maintenance retainer
Air-gapped delivery available
6-10 week deployment timeline

Who this is for

Healthcare — HIPAA-covered entities that want coding agents that never touch PHI
Legal practices — attorney-client privileged codebases and discovery platforms
Financial services — trading systems, customer data pipelines, compliance-heavy workflows
Government contractors — CUI, ITAR/EAR-adjacent work, pre-classification environments
Defense-adjacent — contractors who will eventually need the capability in fully air-gapped networks
Biotech and pharma — IP-sensitive research codebases
Robotics companies — proprietary control systems and simulation pipelines
Any organization whose codebase is the competitive advantage

Relationship to our other services

This service builds on our existing Local AI Deployment service and adds the Claude-Code-specific integration layer on top. Existing Local AI Deployment clients can upgrade to Private Claude Code Deployment at a reduced rate. For teams that want training on the deployed infrastructure, add our Training & Enablement service. For ongoing operation by our team rather than yours, see the Claude Code Agency retainer.

Frequently asked questions

What is a private Claude Code deployment?

Running Claude Code or OpenCode against an open-weight AI model hosted on your own hardware, inside your own network. No prompts, context, or completions leave your infrastructure. The CLI points at a local OpenAI-compatible endpoint (Ollama, vLLM, or equivalent) instead of the Anthropic cloud API.

Which AI models can power it?

Any open-weight model with strong coding capability: Qwen 2.5 Coder, DeepSeek Coder V2, Llama 3 Instruct, Mistral Large, Gemma 3, StarCoder 2. We benchmark candidates against your actual workloads during the engagement.

Does it work with OpenCode and other coding agents?

Yes. Any coding agent with OpenAI-compatible endpoint support — Claude Code, OpenCode, Aider, Continue.dev — can be wired to the same private local backend.

What hardware do we need?

Small (1-3 engineers, 7B-30B models): one RTX 4090 / L40S / A100. Mid (10-20 engineers, 30B-70B models): multi-GPU or single H100. Large (70B+, high concurrency): H100 or H200 cluster. Apple Silicon Ultra works for Mac shops at Small/Mid. We spec exact hardware in discovery.

How is this different from cloud Claude Code?

Data governance (nothing leaves your network vs. Anthropic processes everything) and model choice (any open-weight model vs. Anthropic's Claude family). Cloud is easier and Claude models are arguably stronger on coding; private gives you sovereignty over data and model choice. For regulated or IP-sensitive work, private is the only viable option.

Can we run fully air-gapped?

Yes. Fully air-gapped deployments are supported for Large tier. Model weights downloaded and verified ahead of installation, then deployment runs with no outbound connectivity. Model updates via manual air-gap transfer.

How long does deployment take?

Small: 1-2 weeks. Mid: 3-4 weeks. Large: 6-10 weeks. Variable is usually your internal process (procurement, security review, network access), not the technical work on our side.

Do you train our team on the deployment?

Every deployment includes runbooks, operational training for your devops team, and 30 days of post-launch support. For deeper engineer-level Claude Code training, add our Training & Enablement service (bundleable at reduced rate).

Ready to deploy privately?

Book a free 30-minute readiness call to discuss your team size, compliance posture, and hardware. We'll tell you honestly which tier fits and whether private deployment is the right answer for you.

Book free readiness call Email Marc directly