aiDAPTIVCache Finetune
This guide demonstrates how to fine-tune Large Language Models (LLMs) using aiDAPTIVCache within the OtterScale cluster.
Overview
Section titled “Overview”aiDAPTIVCache finetune provides efficient model fine-tuning capabilities with support for LoRA, full parameter training, and other training modes. Key features:
- Distributed training support (Multi-GPU)
- LoRA fine-tuning support
- Customizable training datasets
- Flexible resource allocation (vGPU, memory)
Prerequisites
Section titled “Prerequisites”Prepare Model and Data
Section titled “Prepare Model and Data”You need to make your model files accessible to the fine-tuning job. There are two methods to achieve this, both configured via the prescript field in the Helm chart values.
Method 1: Using NFS Storage (Recommended)
Section titled “Method 1: Using NFS Storage (Recommended)”This method uses OtterScale’s NFS File System to store and access model files.
-
Create NFS File System in OtterScale:
Navigate to the Storage section:
- Go to Storage → File System
- Create a new NFS File System or use an existing one
- Record the NFS server address (format:
10.102.197.0:/volumes/_nogroup/xxx)
-
Upload your model files to NFS:
Mount the NFS on a machine that has access and copy your model files:
Terminal window # Mount NFS on a node with accessmkdir -p /mnt/nfsmount -t nfs4 10.102.197.0:/volumes/_nogroup/xxx /mnt/nfs# Copy your model files to NFScp -r /path/to/your-model /mnt/nfs/models/# Verify filesls /mnt/nfs/models/ -
Configure prescript to mount NFS:
In the Helm chart values, you’ll configure the
prescriptto mount this NFS (see NFS Mount Configuration section below).
Method 2: Using SCP to Copy Models
Section titled “Method 2: Using SCP to Copy Models”This method copies model files directly into the pod using SCP during initialization.
-
Prepare a remote server with model files:
Ensure you have a remote server (accessible from the cluster) that contains your model files.
-
Configure prescript with SCP:
Use the following prescript template in your Helm chart values:
prescript: |# Install required toolsapt-get update && apt-get install -y sshpass# Create model directorymkdir -p /mnt/data/models# Copy model from remote server using SCPecho "Copying model from remote server..."sshpass -p 'your-password' scp -o StrictHostKeyChecking=no -r \user@remote-host:/path/to/your-model /mnt/data/models/if [ $? -eq 0 ]; thenecho "Model copied successfully!"ls -lh /mnt/data/models/elseecho "Failed to copy model. Exiting..."exit 1fiYou still need an NFS mount for storing training outputs. Add NFS mount commands after the SCP section:
postscript: |# Install toolsapt-get update && apt-get install -y sshpass nfs-common# Copy model via SCPmkdir -p /mnt/data/modelsecho "Copying model from remote server..."sshpass -p 'your-password' scp -o StrictHostKeyChecking=no -r \user@remote-host:/path/to/model /mnt/data/models/# Mount NFS for output storagemkdir -p /mnt/data/outputecho "Mounting NFS for output storage..."mount -t nfs4 -o nfsvers=4.1 10.102.197.0:/volumes/_nogroup/output-path /mnt/data/outputecho "Setup complete!"
Install Helm Chart
Section titled “Install Helm Chart”-
Navigate to Application Store:
In the OtterScale web interface:
- Go to Application → Store
- This opens the Helm chart repository
-
Import Helm Chart:
- Click the Import button at the top of the page
- Enter the Helm Chart URL in the dialog:
https://github.com/otterscale/charts/releases/download/aidaptivcache-finetune-0.1.3/aidaptivcache-finetune-0.1.3.tgz
- Click Confirm to import
-
Install Chart:
- Find
aidaptivcache-finetunein the Store list - Click the Install button
- In the Install Release dialog:
- Enter a Name for your deployment (e.g.,
ft1) - Enter a Namespace (e.g.,
ft1) - Click View/Edit button to open the configuration editor
- Enter a Name for your deployment (e.g.,
- Edit the
values.yamlto configure your fine-tuning job (see Configuration Guide below) - Click Confirm to start the installation
- Find
Configuration Guide
Section titled “Configuration Guide”Now that you’ve opened the configuration editor via View/Edit, you need to configure the following fields in the values.yaml.
Basic Configuration
Section titled “Basic Configuration”Image Settings
Section titled “Image Settings”image: repository: docker.io/library/aidaptiv tag: vNXUN_2_05BA0 pullPolicy: IfNotPresentrepository: Container image addresstag: Image version tagpullPolicy: Image pull policy (IfNotPresent/Always/Never)
Job Configuration
Section titled “Job Configuration”job: name: finetune-job backoffLimit: 1 restartPolicy: Never ttlSecondsAfterFinished: 60name: Kubernetes Job namebackoffLimit: Number of retry attempts on failurerestartPolicy: Restart policy (Never/OnFailure)ttlSecondsAfterFinished: Time to retain Job after completion (seconds)
Training Configuration
Section titled “Training Configuration”The training configuration is divided into three main sections: expConfig, envConfig, and trainDataConfig.
1. Experiment Configuration (expConfig)
Section titled “1. Experiment Configuration (expConfig)”The expConfig section controls GPU resources, distributed training settings, and training hyperparameters.
Process Settings:
expConfig: processSettings: numGpus: 1 # Number of GPUs to use specifyGpus: null # Specific GPU IDs (e.g., "0,1,2,3") masterPort: 8299 # Master port for distributed training multiNodeSettings: enable: false # Enable multi-node training masterAddr: "127.0.0.1" # Master node addressRun Settings:
expConfig: runSettings: taskType: "text-generation" # Task type taskMode: "train" # Mode: train/eval/inference perDeviceTrainBatchSize: 4 # Batch size per device perUpdateTotalBatchSize: 16 # Total batch size (gradient accumulation) numTrainEpochs: 1 # Number of training epochs maxIter: 12 # Maximum iterations maxSeqLen: 2048 # Maximum sequence length triton: true # Enable Triton optimization precisionMode: 1 # Precision mode (0: FP32, 1: Mixed)LoRA Settings:
expConfig: runSettings: lora: enableLora: false # Enable LoRA fine-tuning loraRank: 8 # LoRA rank loraAlpha: 16 # LoRA alpha parameter loraTaskType: "CAUSAL_LM" # Task type for LoRA loraTargetModules: null # Target modules (null for auto)Learning Rate and Optimizer:
expConfig: runSettings: lrScheduler: mode: 1 # LR scheduler mode learningRate: 0.000007 # Learning rate
optimizer: beta1: 0.9 # Adam beta1 beta2: 0.95 # Adam beta2 eps: 0.00000001 # Epsilon weightDecay: 0.01 # Weight decay2. Environment Configuration (envConfig)
Section titled “2. Environment Configuration (envConfig)”The envConfig section defines all file paths used during training.
envConfig: pathSettings: modelNameOrPath: "/mnt/data/models/TinyLlama-1.1B-Chat-v1.0" # Model input path nvmePath: "/mnt/nvme0" # NVMe cache path outputDir: "/mnt/data/output" # Training output path trainDataPath: # Training data config - /config/train_data/QA_dataset_config.yaml logName: "output.log" # Log file nameKey Path Explanations:
-
modelNameOrPath(Required):- Path to the pre-trained model
- Must point to where NFS mounts the model via prescript
- Example:
/mnt/data/models/TinyLlama-1.1B-Chat-v1.0
-
outputDir(Required):- Where fine-tuned model weights will be saved
- Must be on NFS mount for persistence:
/mnt/data/output
-
nvmePath(Required):- NVMe device path for temporary storage and cache
- Typically uses node’s NVMe:
/mnt/nvme0
-
trainDataPath:- Path to training data configuration file(s)
- Supports multiple datasets (array format)
3. Training Data Configuration (trainDataConfig)
Section titled “3. Training Data Configuration (trainDataConfig)”The trainDataConfig section defines the dataset format and prompts.
trainDataConfig: | instruction-dataset: data_path: "HuggingFaceH4/instruction-dataset" # HuggingFace dataset or local path strategy: "QA" # Data strategy (QA/Chat) system_prompt: "A chat between a curious user and an artificial intelligence assistant." user_prompt: "{question}" # User prompt template question_key: "prompt" # Column name for questions answer_key: "completion" # Column name for answers exp_type: train # Experiment type: train/eval/inference label_key: "completion" # Label column (same as answer_key)Configuration Fields:
data_path: HuggingFace dataset name or local file pathstrategy: Data processing strategy (QA/Chat/Custom)system_prompt: System instruction for the modeluser_prompt: Template for user questions (use{question}placeholder)question_key: Dataset column containing questionsanswer_key: Dataset column containing answersexp_type: Must betrainfor traininglabel_key: Column used as training labels (typically same asanswer_key)
Resource Configuration
Section titled “Resource Configuration”resources: limits: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 60 phison.com/ai100: 1 requests: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 60 phison.com/ai100: 1otterscale.com/vgpu: vGPU countotterscale.com/vgpumem-percentage: vGPU memory percentage (0-100)phison.com/ai100: Phison aiDAPTIVCache accelerator count
NFS Mount Configuration (Required)
Section titled “NFS Mount Configuration (Required)”To access your model files and save training outputs, you need to configure the prescript to mount your NFS storage.
prescript: | apt install -y nfs-common echo "Starting NFS mount process..." mkdir -p /mnt/data TIMEOUT=300 ELAPSED=0 while [ $ELAPSED -lt $TIMEOUT ]; do echo "Attempting to mount NFS to /mnt/data" mount -t nfs4 -o nfsvers=4.1 -v 10.102.197.0:/volumes/_nogroup/your-nfs-path /mnt/data if mountpoint -q /mnt/data; then echo "NFS mount successful!" break else echo "Mount failed, retrying in 5 seconds... (${ELAPSED}s/${TIMEOUT}s)" sleep 5 ELAPSED=$((ELAPSED + 5)) fi done if [ $ELAPSED -ge $TIMEOUT ]; then echo "NFS mount timeout after ${TIMEOUT} seconds. Exiting..." exit 1 fiConfiguration Details:
- Replace
10.102.197.0:/volumes/_nogroup/your-nfs-pathwith your actual NFS server address from the File System page - The script installs
nfs-commonpackage (required for NFS mounting) - Implements retry logic with a 5-minute timeout
- Mounts NFS to
/mnt/datawhich is used in the path configuration
Optional Post-execution Script:
postscript: | echo "Training job completed" echo "Model saved to: $outputDir"Complete Configuration Example
Section titled “Complete Configuration Example”image: repository: docker.io/library/aidaptiv tag: vNXUN_2_05BA0 pullPolicy: IfNotPresent
job: name: finetune-llama-job backoffLimit: 1 restartPolicy: Never ttlSecondsAfterFinished: 60
securityContext: privileged: true
# NFS Mount Script (Required)prescript: | apt install -y nfs-common echo "Starting NFS mount process..." mkdir -p /mnt/data TIMEOUT=300 ELAPSED=0 while [ $ELAPSED -lt $TIMEOUT ]; do echo "Attempting to mount NFS to /mnt/data" mount -t nfs4 -o nfsvers=4.1 -v 10.102.197.0:/volumes/_nogroup/my-nfs-path /mnt/data if mountpoint -q /mnt/data; then echo "NFS mount successful!" break else echo "Mount failed, retrying in 5 seconds... (${ELAPSED}s/${TIMEOUT}s)" sleep 5 ELAPSED=$((ELAPSED + 5)) fi done if [ $ELAPSED -ge $TIMEOUT ]; then echo "NFS mount timeout after ${TIMEOUT} seconds. Exiting..." exit 1 fi
envConfig: pathSettings: modelNameOrPath: "/mnt/data/models/TinyLlama-1.1B-Chat-v1.0" nvmePath: "/mnt/nvme0" outputDir: "/mnt/data/output" trainDataPath: - /config/train_data/QA_dataset_config.yaml logName: "finetune_output.log"
expConfig: processSettings: numGpus: 1 masterPort: 8299 multiNodeSettings: enable: false
runSettings: taskType: "text-generation" taskMode: "train" perDeviceTrainBatchSize: 4 perUpdateTotalBatchSize: 16 numTrainEpochs: 1 maxIter: 100 maxSeqLen: 2048 triton: true
lrScheduler: mode: 1 learningRate: 0.000007
lora: enableLora: true loraRank: 8 loraAlpha: 16 loraTaskType: "CAUSAL_LM"
trainDataConfig: | instruction-dataset: data_path: "HuggingFaceH4/instruction-dataset" strategy: "QA" system_prompt: "A chat between a curious user and an artificial intelligence assistant." user_prompt: "{question}" question_key: "prompt" answer_key: "completion" exp_type: train label_key: "completion"
resources: limits: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 60 phison.com/ai100: 1 requests: otterscale.com/vgpu: 1 otterscale.com/vgpumem-percentage: 60 phison.com/ai100: 1Monitor Training Progress
Section titled “Monitor Training Progress”-
Check Job Status:
Navigate to the Jobs page:
- Go to Applications → Jobs
- Find your fine-tune job under your namespace and check its status.
-
Retrieve Output Model:
After training completes, the fine-tuned model will be saved in the path specified by
outputDir:Terminal window # Access via the same NFS mount used during training# Mount the NFS on your local machine or a nodemount -t nfs4 10.102.197.0:/volumes/_nogroup/your-nfs-path /mnt/nfsls /mnt/nfs/output/