Keep models on disk. Mirror them into RAM. Point your inference engine at the fast copy. Verify every step.
The canonical model files always live on persistent storage. At boot, they're synced into /dev/shm (a tmpfs filesystem backed by RAM). Your inference engine reads from the RAM copy for maximum throughput.
Configure your paths and shared memory size, then verify your system is ready.
/dev/shm size with df -h /dev/shm
sudo mount -o remount,size=64g /dev/shm
/etc/fstab
rsync if not present
tmpfs /dev/shm tmpfs defaults,size=64g 0 0
#!/usr/bin/env bash set -euo pipefail SRC="/data/models" DST="/dev/shm/models" mkdir -p "$DST" rsync -ah --inplace --delete --info=progress2 "$SRC/" "$DST/" echo "Models loaded into shared memory."
[Unit] Description=Load LLM weights into /dev/shm After=local-fs.target Before=ollama.service [Service] Type=oneshot ExecStart=/usr/bin/rsync -a --inplace --delete /data/models/ /dev/shm/models/ ExecStop=/bin/rm -rf /dev/shm/models RemainAfterExit=yes [Install] WantedBy=multi-user.target
Run this in your terminal to confirm models are loaded:
ls -lh /dev/shm/models/ df -h /dev/shm echo "---" echo "Files on disk:" ls -lh /data/models/ echo "---" echo "Files in shm:" ls -lh /dev/shm/models/ echo "---" diff <(ls /data/models/) <(ls /dev/shm/models/) && echo "✓ In sync" || echo "✗ Out of sync"
Select an inference engine below to see its specific configuration and run verification tests against it.
Ollama uses its own internal blob store. You can redirect it via the OLLAMA_MODELS environment variable or a symlink.
[Unit] After=shm-models.service Requires=shm-models.service [Service] Environment="OLLAMA_MODELS=/dev/shm/ollama-models"
sudo systemctl stop ollama sudo rsync -a /usr/share/ollama/.ollama/models/ /data/ollama-models/ sudo mkdir -p /dev/shm/ollama-models sudo rsync -a --inplace --delete /data/ollama-models/ /dev/shm/ollama-models/ sudo ln -sfn /dev/shm/ollama-models /usr/share/ollama/.ollama/models sudo systemctl start ollama
# Sync every 10 minutes */10 * * * * rsync -a --inplace /dev/shm/ollama-models/ /data/ollama-models/
Run these commands to confirm Ollama is reading from shm:
# Check where Ollama thinks its models are ollama list # Verify the OLLAMA_MODELS env var is set systemctl show ollama | grep -i environment # Check if symlink points to shm ls -la /usr/share/ollama/.ollama/models # Watch file access in real-time (run while generating) sudo strace -e openat -p $(pgrep ollama) 2>&1 | grep -i "shm\|model" # Verify shm has Ollama blobs du -sh /dev/shm/ollama-models/
The simplest setup — just point -m at the shm path. Use --no-mmap to skip redundant memory mapping since the file is already in RAM.
llama-server \ -m /dev/shm/models/your-model.gguf \ --host 0.0.0.0 \ --port 8080 \ --n-gpu-layers 99 \ --ctx-size 8192 \ --no-mmap
[Unit] Description=llama.cpp API server After=shm-models.service Requires=shm-models.service [Service] ExecStart=/usr/local/bin/llama-server \ -m /dev/shm/models/your-model.gguf \ --host 0.0.0.0 --port 8080 \ --n-gpu-layers 99 --no-mmap --ctx-size 8192 Restart=on-failure User=llama [Install] WantedBy=multi-user.target
# Confirm the process has shm files open sudo lsof -p $(pgrep llama-server) | grep shm # Check memory-mapped regions sudo cat /proc/$(pgrep llama-server)/maps | grep shm # Quick benchmark: time loading from disk vs shm time llama-cli -m /data/models/model.gguf -p "test" -n 1 --n-gpu-layers 0 time llama-cli -m /dev/shm/models/model.gguf -p "test" -n 1 --n-gpu-layers 0 --no-mmap
vLLM loads HuggingFace-format models (safetensors) and pushes them onto GPU VRAM. The shm benefit is a faster initial load — once in VRAM, inference speed is the same.
python -m vllm.entrypoints.openai.api_server \ --model /dev/shm/models/Meta-Llama-3-70B-Instruct \ --tensor-parallel-size 2 \ --host 0.0.0.0 \ --port 8000 \ --dtype auto \ --max-model-len 8192
# Time the vLLM server startup from disk time python -m vllm.entrypoints.openai.api_server \ --model /data/models/Meta-Llama-3-70B-Instruct \ --max-model-len 512 & # short context for quick test # Wait until "Uvicorn running" appears, then Ctrl+C # Time the vLLM server startup from shm time python -m vllm.entrypoints.openai.api_server \ --model /dev/shm/models/Meta-Llama-3-70B-Instruct \ --max-model-len 512 & # Check GPU memory usage nvidia-smi
HuggingFace's production inference server. Docker deployment is most common — mount the shm model directory as a volume.
docker run --gpus all --shm-size 64g \ -v /dev/shm/models:/models \ -p 8080:80 \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id /models/Meta-Llama-3-70B-Instruct \ --max-input-length 4096 \ --max-total-tokens 8192
--shm-size flag is for PyTorch's internal shared memory (tensor parallelism), which is separate from your model-loading shm strategy. You need both.
Multi-format engine supporting GGUF, HuggingFace, and more. Point the models directory at shm.
docker run -p 8080:8080 \ -v /dev/shm/models:/models \ localai/localai:latest
local-ai --models-path /dev/shm/models --address :8080
GGUF-based inference. Same shm advantages as llama.cpp — continuous weight reads from RAM benefit greatly.
koboldcpp \ --model /dev/shm/models/model.gguf \ --contextsize 8192 \ --gpulayers 99 \ --host 0.0.0.0 \ --port 5001
| Event | Direction | Command |
|---|---|---|
| Boot / manual load | Disk → SHM | rsync -a --inplace --delete /data/models/ /dev/shm/models/ |
| ollama pull | SHM → Disk | rsync -a --inplace /dev/shm/ollama-models/ /data/ollama-models/ |
| Model updated on disk | Disk → SHM | Re-run boot load script |
| Shutdown | Clean SHM | rm -rf /dev/shm/models (systemd ExecStop) |
| Engine | Format | Benefit | Why |
|---|---|---|---|
| llama.cpp (CPU) | GGUF | ● High | Weights read from RAM every token |
| llama.cpp (GPU split) | GGUF | ● Medium | Faster layer upload to VRAM |
| Ollama | GGUF blobs | ● High | Same as llama.cpp; faster model swap |
| vLLM | Safetensors | ● Load only | Once in VRAM, shm irrelevant |
| TGI | Safetensors | ● Load only | Same as vLLM |
| koboldcpp | GGUF | ● High | Same as llama.cpp |
| LocalAI | Mixed | ● High | Often uses llama.cpp backend |