/dev/shm • Performance Guide

Serve LLM Weights
from Shared Memory

Keep models on disk. Mirror them into RAM. Point your inference engine at the fast copy. Verify every step.

Architecture

How It Works

The canonical model files always live on persistent storage. At boot, they're synced into /dev/shm (a tmpfs filesystem backed by RAM). Your inference engine reads from the RAM copy for maximum throughput.

Persistent Disk

/data/models/

→ rsync →

Shared Memory

/dev/shm/models/

→ read →

Inference Engine

ollama / vllm / …

Step 1

Prerequisites & Setup

Configure your paths and shared memory size, then verify your system is ready.

Persistent Model Path

SHM Model Path

SHM Size (GB)

Ollama Model Path (persistent)

Setup Checklist

Check current /dev/shm size with df -h /dev/shm
Resize shm: sudo mount -o remount,size=64g /dev/shm
Make persistent in /etc/fstab
Create persistent model directory on disk
Install rsync if not present

Generate Setup Commands

Resize /dev/shm Permanently

/etc/fstab

tmpfs  /dev/shm  tmpfs  defaults,size=64g  0  0

Load Script

load_models_to_shm.sh

#!/usr/bin/env bash
set -euo pipefail

SRC="/data/models"
DST="/dev/shm/models"

mkdir -p "$DST"
rsync -ah --inplace --delete --info=progress2 "$SRC/" "$DST/"
echo "Models loaded into shared memory."

systemd Service (Auto-load at Boot)

/etc/systemd/system/shm-models.service

[Unit]
Description=Load LLM weights into /dev/shm
After=local-fs.target
Before=ollama.service

[Service]
Type=oneshot
ExecStart=/usr/bin/rsync -a --inplace --delete /data/models/ /dev/shm/models/
ExecStop=/bin/rm -rf /dev/shm/models
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Verify SHM Contents

Run this in your terminal to confirm models are loaded:

terminal

ls -lh /dev/shm/models/
df -h /dev/shm
echo "---"
echo "Files on disk:"
ls -lh /data/models/
echo "---"
echo "Files in shm:"
ls -lh /dev/shm/models/
echo "---"
diff <(ls /data/models/) <(ls /dev/shm/models/) && echo "✓ In sync" || echo "✗ Out of sync"

Step 2

Engine Configuration

Select an inference engine below to see its specific configuration and run verification tests against it.

Ollama

Blob-store model management. Needs env var or symlink redirect.

High SHM benefit

llama.cpp

Direct GGUF path. Add --no-mmap for full shm advantage.

High SHM benefit

vLLM

HuggingFace safetensors. Faster cold start from shm→GPU.

Load time benefit

TGI (HuggingFace)

Docker or native. Mount shm volume for model directory.

Load time benefit

LocalAI

Multi-format support. Point --models-path at shm.

High SHM benefit

koboldcpp

GGUF-based, same benefits as llama.cpp.

High SHM benefit

Ollama Configuration

Ollama uses its own internal blob store. You can redirect it via the OLLAMA_MODELS environment variable or a symlink.

Option A: Environment Variable (Recommended)

/etc/systemd/system/ollama.service.d/override.conf

[Unit]
After=shm-models.service
Requires=shm-models.service

[Service]
Environment="OLLAMA_MODELS=/dev/shm/ollama-models"

Option B: Symlink

terminal

sudo systemctl stop ollama
sudo rsync -a /usr/share/ollama/.ollama/models/ /data/ollama-models/
sudo mkdir -p /dev/shm/ollama-models
sudo rsync -a --inplace --delete /data/ollama-models/ /dev/shm/ollama-models/
sudo ln -sfn /dev/shm/ollama-models /usr/share/ollama/.ollama/models
sudo systemctl start ollama

Sync New Pulls Back to Disk

crontab

# Sync every 10 minutes
*/10 * * * * rsync -a --inplace /dev/shm/ollama-models/ /data/ollama-models/

Test Ollama Endpoint

Verify Model Location

Run these commands to confirm Ollama is reading from shm:

terminal

# Check where Ollama thinks its models are
ollama list

# Verify the OLLAMA_MODELS env var is set
systemctl show ollama | grep -i environment

# Check if symlink points to shm
ls -la /usr/share/ollama/.ollama/models

# Watch file access in real-time (run while generating)
sudo strace -e openat -p $(pgrep ollama) 2>&1 | grep -i "shm\|model"

# Verify shm has Ollama blobs
du -sh /dev/shm/ollama-models/

llama.cpp / llama-server

The simplest setup — just point -m at the shm path. Use --no-mmap to skip redundant memory mapping since the file is already in RAM.

terminal — llama-server

llama-server \
  -m /dev/shm/models/your-model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --no-mmap

/etc/systemd/system/llamacpp.service

[Unit]
Description=llama.cpp API server
After=shm-models.service
Requires=shm-models.service

[Service]
ExecStart=/usr/local/bin/llama-server \
  -m /dev/shm/models/your-model.gguf \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 99 --no-mmap --ctx-size 8192
Restart=on-failure
User=llama

[Install]
WantedBy=multi-user.target

Test llama.cpp Endpoint

Verify File Source

terminal

# Confirm the process has shm files open
sudo lsof -p $(pgrep llama-server) | grep shm

# Check memory-mapped regions
sudo cat /proc/$(pgrep llama-server)/maps | grep shm

# Quick benchmark: time loading from disk vs shm
time llama-cli -m /data/models/model.gguf -p "test" -n 1 --n-gpu-layers 0
time llama-cli -m /dev/shm/models/model.gguf -p "test" -n 1 --n-gpu-layers 0 --no-mmap

vLLM

vLLM loads HuggingFace-format models (safetensors) and pushes them onto GPU VRAM. The shm benefit is a faster initial load — once in VRAM, inference speed is the same.

terminal — vLLM OpenAI server

python -m vllm.entrypoints.openai.api_server \
  --model /dev/shm/models/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --max-model-len 8192

Note: Since vLLM copies weights to GPU VRAM, shm primarily accelerates cold-start time. For models >30GB this can save 30–60+ seconds of load time.

Test vLLM Endpoint

Verify & Benchmark Load Time

terminal

# Time the vLLM server startup from disk
time python -m vllm.entrypoints.openai.api_server \
  --model /data/models/Meta-Llama-3-70B-Instruct \
  --max-model-len 512 &  # short context for quick test
# Wait until "Uvicorn running" appears, then Ctrl+C

# Time the vLLM server startup from shm
time python -m vllm.entrypoints.openai.api_server \
  --model /dev/shm/models/Meta-Llama-3-70B-Instruct \
  --max-model-len 512 &

# Check GPU memory usage
nvidia-smi

Text Generation Inference (TGI)

HuggingFace's production inference server. Docker deployment is most common — mount the shm model directory as a volume.

terminal — Docker

docker run --gpus all --shm-size 64g \
  -v /dev/shm/models:/models \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /models/Meta-Llama-3-70B-Instruct \
  --max-input-length 4096 \
  --max-total-tokens 8192

Important: Docker's --shm-size flag is for PyTorch's internal shared memory (tensor parallelism), which is separate from your model-loading shm strategy. You need both.

Test TGI Endpoint

LocalAI

Multi-format engine supporting GGUF, HuggingFace, and more. Point the models directory at shm.

terminal — Docker

docker run -p 8080:8080 \
  -v /dev/shm/models:/models \
  localai/localai:latest

terminal — Native

local-ai --models-path /dev/shm/models --address :8080

Test LocalAI Endpoint

koboldcpp

GGUF-based inference. Same shm advantages as llama.cpp — continuous weight reads from RAM benefit greatly.

terminal

koboldcpp \
  --model /dev/shm/models/model.gguf \
  --contextsize 8192 \
  --gpulayers 99 \
  --host 0.0.0.0 \
  --port 5001

Test koboldcpp Endpoint

Reference

Sync Operations

Event	Direction	Command
Boot / manual load	Disk → SHM	`rsync -a --inplace --delete /data/models/ /dev/shm/models/`
ollama pull	SHM → Disk	`rsync -a --inplace /dev/shm/ollama-models/ /data/ollama-models/`
Model updated on disk	Disk → SHM	`Re-run boot load script`
Shutdown	Clean SHM	`rm -rf /dev/shm/models (systemd ExecStop)`

Reference

SHM Benefit by Engine

Engine	Format	Benefit	Why
llama.cpp (CPU)	GGUF	● High	Weights read from RAM every token
llama.cpp (GPU split)	GGUF	● Medium	Faster layer upload to VRAM
Ollama	GGUF blobs	● High	Same as llama.cpp; faster model swap
vLLM	Safetensors	● Load only	Once in VRAM, shm irrelevant
TGI	Safetensors	● Load only	Same as vLLM
koboldcpp	GGUF	● High	Same as llama.cpp
LocalAI	Mixed	● High	Often uses llama.cpp backend