/dev/shm • Performance Guide

Serve LLM Weights
from Shared Memory

Keep models on disk. Mirror them into RAM. Point your inference engine at the fast copy. Verify every step.

How It Works

The canonical model files always live on persistent storage. At boot, they're synced into /dev/shm (a tmpfs filesystem backed by RAM). Your inference engine reads from the RAM copy for maximum throughput.

Persistent Disk
/data/models/
→ rsync →
Shared Memory
/dev/shm/models/
→ read →
Inference Engine
ollama / vllm / …

Prerequisites & Setup

Configure your paths and shared memory size, then verify your system is ready.

Setup Checklist

Generate Setup Commands

Resize /dev/shm Permanently

/etc/fstab
tmpfs  /dev/shm  tmpfs  defaults,size=64g  0  0

Load Script

load_models_to_shm.sh
#!/usr/bin/env bash
set -euo pipefail

SRC="/data/models"
DST="/dev/shm/models"

mkdir -p "$DST"
rsync -ah --inplace --delete --info=progress2 "$SRC/" "$DST/"
echo "Models loaded into shared memory."

systemd Service (Auto-load at Boot)

/etc/systemd/system/shm-models.service
[Unit]
Description=Load LLM weights into /dev/shm
After=local-fs.target
Before=ollama.service

[Service]
Type=oneshot
ExecStart=/usr/bin/rsync -a --inplace --delete /data/models/ /dev/shm/models/
ExecStop=/bin/rm -rf /dev/shm/models
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Verify SHM Contents

Run this in your terminal to confirm models are loaded:

terminal
ls -lh /dev/shm/models/
df -h /dev/shm
echo "---"
echo "Files on disk:"
ls -lh /data/models/
echo "---"
echo "Files in shm:"
ls -lh /dev/shm/models/
echo "---"
diff <(ls /data/models/) <(ls /dev/shm/models/) && echo "✓ In sync" || echo "✗ Out of sync"

Engine Configuration

Select an inference engine below to see its specific configuration and run verification tests against it.

Ollama
Blob-store model management. Needs env var or symlink redirect.
High SHM benefit
llama.cpp
Direct GGUF path. Add --no-mmap for full shm advantage.
High SHM benefit
vLLM
HuggingFace safetensors. Faster cold start from shm→GPU.
Load time benefit
TGI (HuggingFace)
Docker or native. Mount shm volume for model directory.
Load time benefit
LocalAI
Multi-format support. Point --models-path at shm.
High SHM benefit
koboldcpp
GGUF-based, same benefits as llama.cpp.
High SHM benefit

Ollama Configuration

Ollama uses its own internal blob store. You can redirect it via the OLLAMA_MODELS environment variable or a symlink.

Option A: Environment Variable (Recommended)

/etc/systemd/system/ollama.service.d/override.conf
[Unit]
After=shm-models.service
Requires=shm-models.service

[Service]
Environment="OLLAMA_MODELS=/dev/shm/ollama-models"

Option B: Symlink

terminal
sudo systemctl stop ollama
sudo rsync -a /usr/share/ollama/.ollama/models/ /data/ollama-models/
sudo mkdir -p /dev/shm/ollama-models
sudo rsync -a --inplace --delete /data/ollama-models/ /dev/shm/ollama-models/
sudo ln -sfn /dev/shm/ollama-models /usr/share/ollama/.ollama/models
sudo systemctl start ollama

Sync New Pulls Back to Disk

crontab
# Sync every 10 minutes
*/10 * * * * rsync -a --inplace /dev/shm/ollama-models/ /data/ollama-models/

Test Ollama Endpoint

Verify Model Location

Run these commands to confirm Ollama is reading from shm:

terminal
# Check where Ollama thinks its models are
ollama list

# Verify the OLLAMA_MODELS env var is set
systemctl show ollama | grep -i environment

# Check if symlink points to shm
ls -la /usr/share/ollama/.ollama/models

# Watch file access in real-time (run while generating)
sudo strace -e openat -p $(pgrep ollama) 2>&1 | grep -i "shm\|model"

# Verify shm has Ollama blobs
du -sh /dev/shm/ollama-models/

llama.cpp / llama-server

The simplest setup — just point -m at the shm path. Use --no-mmap to skip redundant memory mapping since the file is already in RAM.

terminal — llama-server
llama-server \
  -m /dev/shm/models/your-model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --no-mmap
/etc/systemd/system/llamacpp.service
[Unit]
Description=llama.cpp API server
After=shm-models.service
Requires=shm-models.service

[Service]
ExecStart=/usr/local/bin/llama-server \
  -m /dev/shm/models/your-model.gguf \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 99 --no-mmap --ctx-size 8192
Restart=on-failure
User=llama

[Install]
WantedBy=multi-user.target

Test llama.cpp Endpoint

Verify File Source

terminal
# Confirm the process has shm files open
sudo lsof -p $(pgrep llama-server) | grep shm

# Check memory-mapped regions
sudo cat /proc/$(pgrep llama-server)/maps | grep shm

# Quick benchmark: time loading from disk vs shm
time llama-cli -m /data/models/model.gguf -p "test" -n 1 --n-gpu-layers 0
time llama-cli -m /dev/shm/models/model.gguf -p "test" -n 1 --n-gpu-layers 0 --no-mmap

vLLM

vLLM loads HuggingFace-format models (safetensors) and pushes them onto GPU VRAM. The shm benefit is a faster initial load — once in VRAM, inference speed is the same.

terminal — vLLM OpenAI server
python -m vllm.entrypoints.openai.api_server \
  --model /dev/shm/models/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --max-model-len 8192
Note: Since vLLM copies weights to GPU VRAM, shm primarily accelerates cold-start time. For models >30GB this can save 30–60+ seconds of load time.

Test vLLM Endpoint

Verify & Benchmark Load Time

terminal
# Time the vLLM server startup from disk
time python -m vllm.entrypoints.openai.api_server \
  --model /data/models/Meta-Llama-3-70B-Instruct \
  --max-model-len 512 &  # short context for quick test
# Wait until "Uvicorn running" appears, then Ctrl+C

# Time the vLLM server startup from shm
time python -m vllm.entrypoints.openai.api_server \
  --model /dev/shm/models/Meta-Llama-3-70B-Instruct \
  --max-model-len 512 &

# Check GPU memory usage
nvidia-smi

Text Generation Inference (TGI)

HuggingFace's production inference server. Docker deployment is most common — mount the shm model directory as a volume.

terminal — Docker
docker run --gpus all --shm-size 64g \
  -v /dev/shm/models:/models \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /models/Meta-Llama-3-70B-Instruct \
  --max-input-length 4096 \
  --max-total-tokens 8192
Important: Docker's --shm-size flag is for PyTorch's internal shared memory (tensor parallelism), which is separate from your model-loading shm strategy. You need both.

Test TGI Endpoint

LocalAI

Multi-format engine supporting GGUF, HuggingFace, and more. Point the models directory at shm.

terminal — Docker
docker run -p 8080:8080 \
  -v /dev/shm/models:/models \
  localai/localai:latest
terminal — Native
local-ai --models-path /dev/shm/models --address :8080

Test LocalAI Endpoint

koboldcpp

GGUF-based inference. Same shm advantages as llama.cpp — continuous weight reads from RAM benefit greatly.

terminal
koboldcpp \
  --model /dev/shm/models/model.gguf \
  --contextsize 8192 \
  --gpulayers 99 \
  --host 0.0.0.0 \
  --port 5001

Test koboldcpp Endpoint

Sync Operations

Event Direction Command
Boot / manual load Disk → SHM rsync -a --inplace --delete /data/models/ /dev/shm/models/
ollama pull SHM → Disk rsync -a --inplace /dev/shm/ollama-models/ /data/ollama-models/
Model updated on disk Disk → SHM Re-run boot load script
Shutdown Clean SHM rm -rf /dev/shm/models (systemd ExecStop)

SHM Benefit by Engine

Engine Format Benefit Why
llama.cpp (CPU) GGUF ● High Weights read from RAM every token
llama.cpp (GPU split) GGUF ● Medium Faster layer upload to VRAM
Ollama GGUF blobs ● High Same as llama.cpp; faster model swap
vLLM Safetensors ● Load only Once in VRAM, shm irrelevant
TGI Safetensors ● Load only Same as vLLM
koboldcpp GGUF ● High Same as llama.cpp
LocalAI Mixed ● High Often uses llama.cpp backend