aiDAPTIVCache Inference
This guide demonstrates how to deploy a high-performance LLM inference service using aiDAPTIV Cache within the OtterScale cluster.
Overview
Section titled “Overview”aiDAPTIVCache Inference provides high-performance LLM inference service based on vLLM, supporting:
- High-throughput serving
- Tensor parallelism (Multi-GPU)
- KV Cache offloading
- Dynamic LoRA loading
- OpenAI-compatible API
Prerequisites
Section titled “Prerequisites”Prepare Model
Section titled “Prepare Model”You need to make your model files accessible to the inference service. There are two methods to achieve this, both configured via the prescript field in the Helm chart values.
Method 1: Using NFS Storage (Recommended)
Section titled “Method 1: Using NFS Storage (Recommended)”This method uses OtterScale’s NFS File System to store and access model files.
-
Create NFS File System in OtterScale:
Navigate to the Storage section:
- Go to Storage → File System
- Create a new NFS File System or use an existing one
- Record the NFS server address (format:
10.102.197.0:/volumes/_nogroup/xxx)
-
Upload your model files to NFS:
Mount the NFS on a machine that has access and copy your model files:
Terminal window # Mount NFS on a node with accessmkdir -p /mnt/nfsmount -t nfs4 10.102.197.0:/volumes/_nogroup/xxx /mnt/nfs# Copy your model files to NFScp -r /path/to/your-model /mnt/nfs/models/# Verify filesls /mnt/nfs/models/ -
Configure prescript to mount NFS:
In the Helm chart values, you’ll configure the
prescriptto mount this NFS (see NFS Mount Configuration section below).
Method 2: Using SCP to Copy Models
Section titled “Method 2: Using SCP to Copy Models”This method copies model files directly into the pod using SCP during initialization.
-
Prepare a remote server with model files:
Ensure you have a remote server (accessible from the cluster) that contains your model files.
-
Configure prescript with SCP:
Use the following prescript template in your Helm chart values:
prescript: |# Install required toolsapt-get update && apt-get install -y sshpass# Create model directorymkdir -p /mnt/data/models# Copy model from remote server using SCPecho "Copying model from remote server..."sshpass -p 'your-password' scp -o StrictHostKeyChecking=no -r \user@remote-host:/path/to/your-model /mnt/data/models/if [ $? -eq 0 ]; thenecho "Model copied successfully!"ls -lh /mnt/data/models/elseecho "Failed to copy model. Exiting..."exit 1fi
Install Helm Chart
Section titled “Install Helm Chart”-
Navigate to Application Store:
In the OtterScale web interface:
- Go to Application → Store
- This opens the Helm chart repository
-
Import Helm Chart:
- Click the Import button at the top of the page
- Enter the Helm Chart URL in the dialog:
https://github.com/otterscale/charts/releases/download/aidaptivcache-inference-0.1.3/aidaptivcache-inference-0.1.3.tgz
- Click Confirm to import
-
Install Chart:
- Find
aidaptivcache-inferencein the Store list - Click the Install button
- In the Install Release dialog:
- Enter a Name for your deployment (e.g.,
llama-inference) - Enter a Namespace (e.g.,
inference) - Click View/Edit button to open the configuration editor
- Enter a Name for your deployment (e.g.,
- Edit the
values.yamlto configure your inference service (see Configuration Guide below) - Click Confirm to start the installation
- Find
Configuration Guide
Section titled “Configuration Guide”Now that you’ve opened the configuration editor via View/Edit, you need to configure the following fields in the values.yaml.
Basic Configuration
Section titled “Basic Configuration”Image Settings
Section titled “Image Settings”image: repository: docker.io/library/aidaptiv tag: vNXUN_3_03AA pullPolicy: IfNotPresentrepository: Container image addresstag: Image version tagpullPolicy: Image pull policy (IfNotPresent/Always/Never)
Deployment Configuration
Section titled “Deployment Configuration”deployment: name: vllm-api replicas: 1name: Kubernetes Deployment namereplicas: Number of Pod replicas (typically 1 due to limited GPU resources)
vLLM Configuration
Section titled “vLLM Configuration”Environment Variables
Section titled “Environment Variables”vllm: env: vllmUseV1: "1" vllmWorkerMultiprocMethod: "spawn" tiktokenEncodingsBase: ""vllmUseV1: Use vLLM v1 APIvllmWorkerMultiprocMethod: Multi-process startup method (spawn/fork)
Command Line Arguments (Important)
Section titled “Command Line Arguments (Important)”vllm: args: model: /mnt/data/model/Meta-Llama-3.1-8B-Instruct/ nvmePath: /mnt/nvme0 port: 8000 gpuMemoryUtilization: 0.9 maxModelLen: 32768 tensorParallelSize: 4 dramKvOffloadGb: 0 ssdKvOffloadGb: 500 noResumeKvCache: true disableGpuReuse: false enableChunkedPrefill: trueKey Parameter Explanations:
-
model(required):- Model input path
- Must correspond to the container mount path
- Example:
/mnt/data/model/Meta-Llama-3.1-8B-Instruct/ - Note: This is the container path, not the NFS server path!
-
nvmePath(required):- NVMe cache path
- Used for KV Cache offloading
- Typically uses node NVMe:
/mnt/nvme0
-
port:- vLLM API service port
- Default 8000
-
gpuMemoryUtilization:- GPU memory utilization ratio (0.0-1.0)
- Recommended 0.8-0.9 to reserve some memory buffer
-
maxModelLen:- Maximum sequence length
- Adjust based on model and GPU memory
-
tensorParallelSize:- Number of GPUs for tensor parallelism
- Must match the
otterscale.com/vgpuvalue in resource configuration - Example: If
tensorParallelSize: 4, you must setotterscale.com/vgpu: 4
-
dramKvOffloadGb:- KV Cache offload to DRAM capacity (GB)
- Set to 0 to disable DRAM offload
-
ssdKvOffloadGb:- KV Cache offload to SSD capacity (GB)
- Used with
nvmePath
-
enableChunkedPrefill:- Enable chunked prefill for better long-text performance
Optional Parameters (commented by default):
# disableLongToken: true # Disable long token support# resumeKvCache: true # Resume KV Cache# cleanObsoleteKvCache: true # Clean obsolete KV Cache# enablePrefixCaching: true # Enable prefix caching# enforceEager: true # Force eager modeLoRA Configuration
Section titled “LoRA Configuration”vllm: lora: enable: false modules: "" maxRank: 32Enable LoRA Example:
vllm: lora: enable: true modules: "lora=/mnt/data/lora-adapters/llama3.1-8B-lora/" maxRank: 32enable: Set totrueto enable LoRAmodules: LoRA adapter path, format:lora=/path/to/adapter/maxRank: Maximum LoRA rank
Note: All three parameters must be configured together to enable LoRA.
NFS Mount Configuration (For Method 1)
Section titled “NFS Mount Configuration (For Method 1)”If you’re using NFS storage to access your model files, configure the prescript to mount your NFS storage.
prescript: | apt install -y nfs-common echo "Starting NFS mount process..." mkdir -p /mnt/data TIMEOUT=300 ELAPSED=0 while [ $ELAPSED -lt $TIMEOUT ]; do echo "Attempting to mount NFS to /mnt/data" mount -t nfs4 -o nfsvers=4.1 -v 10.102.197.0:/volumes/_nogroup/your-nfs-path /mnt/data if mountpoint -q /mnt/data; then echo "NFS mount successful!" break else echo "Mount failed, retrying in 5 seconds... (${ELAPSED}s/${TIMEOUT}s)" sleep 5 ELAPSED=$((ELAPSED + 5)) fi done if [ $ELAPSED -ge $TIMEOUT ]; then echo "NFS mount timeout after ${TIMEOUT} seconds. Exiting..." exit 1 fiConfiguration Details:
- Replace
10.102.197.0:/volumes/_nogroup/your-nfs-pathwith your actual NFS server address from the File System page - The script installs
nfs-commonpackage (required for NFS mounting) - Implements retry logic with a 5-minute timeout
- Mounts NFS to
/mnt/datawhich is used in the model path configuration
Service Configuration
Section titled “Service Configuration”service: type: NodePort port: 8000 targetPort: 8000 # nodePort: 30299type: Service type (NodePort/ClusterIP/LoadBalancer)port: Service external porttargetPort: Container internal portnodePort: (Optional) Specify a fixed NodePort
Volume Configuration
Section titled “Volume Configuration”volumes: dshm: enabled: true sizeLimit: 30Gidshm:enabled: Enable /dev/shm (shared memory)sizeLimit: Shared memory size (required for vLLM)
Resource Configuration
Section titled “Resource Configuration”resources: requests: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 60 phison.com/ai100: 1 limits: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 60 phison.com/ai100: 1otterscale.com/vgpu: vGPU count- Must match
vllm.args.tensorParallelSizefor multi-GPU deployment - Example: For 4-GPU tensor parallelism, set both
tensorParallelSize: 4andotterscale.com/vgpu: 4
- Must match
otterscale.com/vgpumem-percentage: vGPU memory percentage (0-100)otterscale.com/vgpumem: Alternative to percentage, directly specify memory amount (MB)phison.com/ai100: Phison aiDAPTIVCache accelerator count
Complete Configuration Examples
Section titled “Complete Configuration Examples”Example: Basic Inference Service
Section titled “Example: Basic Inference Service”image: repository: docker.io/library/aidaptiv tag: vNXUN_3_03AA pullPolicy: IfNotPresent
deployment: name: llama3-inference replicas: 1
vllm: env: vllmUseV1: "1" vllmWorkerMultiprocMethod: "spawn"
args: model: /mnt/data/model/Meta-Llama-3.1-8B-Instruct/ nvmePath: /mnt/nvme0 port: 8000 gpuMemoryUtilization: 0.9 maxModelLen: 32768 tensorParallelSize: 1 ssdKvOffloadGb: 500 enableChunkedPrefill: true
lora: enable: false
service: type: NodePort port: 8000 targetPort: 8000
securityContext: privileged: true
# NFS Mount Scriptprescript: | apt install -y nfs-common echo "Starting NFS mount process..." mkdir -p /mnt/data TIMEOUT=300 ELAPSED=0 while [ $ELAPSED -lt $TIMEOUT ]; do echo "Attempting to mount NFS to /mnt/data" mount -t nfs4 -o nfsvers=4.1 -v 10.102.197.0:/volumes/_nogroup/models /mnt/data if mountpoint -q /mnt/data; then echo "NFS mount successful!" break else echo "Mount failed, retrying in 5 seconds... (${ELAPSED}s/${TIMEOUT}s)" sleep 5 ELAPSED=$((ELAPSED + 5)) fi done if [ $ELAPSED -ge $TIMEOUT ]; then echo "NFS mount timeout after ${TIMEOUT} seconds. Exiting..." exit 1 fi
volumes: dshm: enabled: true sizeLimit: 30Gi
resources: requests: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 80 phison.com/ai100: 1 limits: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 80 phison.com/ai100: 1Using the Inference Service
Section titled “Using the Inference Service”-
Get Service Address:
- Go to Application → Services
- Find your inference service and note the NodePort.
-
Test API Connection:
Terminal window # Get Node IPNODE_IP="your-node-ip"NODE_PORT="your-node-port"# Test health checkcurl http://${NODE_IP}:${NODE_PORT}/health -
Use OpenAI-Compatible API:
Terminal window curl http://${NODE_IP}:${NODE_PORT}/v1/completions \-H "Content-Type: application/json" \-d '{"model": "Meta-Llama-3.1-8B-Instruct","prompt": "What is the capital of France?","max_tokens": 100,"temperature": 0.7}' -
Use Chat Completions API:
Terminal window curl http://${NODE_IP}:${NODE_PORT}/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Meta-Llama-3.1-8B-Instruct","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "What is machine learning?"}],"max_tokens": 200}' -
Python Client Example:
from openai import OpenAI# Connect to vLLM serviceclient = OpenAI(base_url=f"http://{NODE_IP}:{NODE_PORT}/v1",api_key="dummy" # vLLM doesn't require a real API key)# Perform inferenceresponse = client.chat.completions.create(model="Meta-Llama-3.1-8B-Instruct",messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}],max_tokens=150,temperature=0.8)print(response.choices[0].message.content)
Monitoring and Debugging
Section titled “Monitoring and Debugging”-
Check Deployment Status:
Navigate to the Workloads page:
- Go to Application → Workloads
- Find your deployment
-
View Pod Logs:
- Click on the Pod corresponding to the Deployment
- View the Logs tab
- Monitor model loading, inference requests, etc.
-
Check Resource Usage:
- View GPU and memory usage in the Pod details page
- Check for OOM (Out of Memory) errors
Performance Tuning Recommendations
Section titled “Performance Tuning Recommendations”Memory Optimization
Section titled “Memory Optimization”-
Small models (< 7B):
gpuMemoryUtilization: 0.9ssdKvOffloadGb: 0(no offload needed)
-
Medium models (7B-13B):
gpuMemoryUtilization: 0.85ssdKvOffloadGb: 200-500
-
Large models (> 13B):
gpuMemoryUtilization: 0.8ssdKvOffloadGb: 500-1000- Consider multi-GPU:
tensorParallelSize: 2-4withotterscale.com/vgpu: 2-4
Latency vs Throughput Trade-off
Section titled “Latency vs Throughput Trade-off”-
Low latency priority:
maxModelLen: 8192 # Shorter sequencesenableChunkedPrefill: falsedramKvOffloadGb: 0ssdKvOffloadGb: 0 -
High throughput priority:
maxModelLen: 32768 # Longer sequencesenableChunkedPrefill: trueenablePrefixCaching: truessdKvOffloadGb: 500