Distributed training with Kubernetes

Kensho's infrastructure team details how they used Kubernetes multi-node distributed training to tackle the cost and complexity of training large neural networks at scale.

Proof of concept

NCCL tests

Enabling password-less SSH

# Install SSH client/server
RUN apt-get install -y -qq openssh-client openssh-server

# Enable passwordless SSH.
RUN mkdir -p $HOME/.ssh /run/sshd && \
# Remove existing host keys.
rm -f /etc/ssh/ssh_host* && \
# Regenerate SSH host keys.
ssh-keygen -A && \
# Use the host keys as client keys too.
cp /etc/ssh/ssh_host_rsa_key $HOME/.ssh/id_rsa && \
cp /etc/ssh/ssh_host_rsa_key.pub $HOME/.ssh/id_rsa.pub && \
# Remove the user field at the end with 'cut'.
# Add the public key to the authorized_keys file.
cat /etc/ssh/ssh_host_rsa_key.pub | cut -d' ' -f1,2 > $HOME/.ssh/authorized_keys
  • NCCL creates shared memory segments in /dev/shm for inter-process communication within a node, so a dedicated ephemeral volume was mounted at dev/shm to provide extra disk space. See official troubleshooting documentation.

  • We use Linkerd as a Kubernetes service mesh which interferes with NCCL communication so we had to disable it on relevant nodes. If you use a different service mesh, you might encounter similar networking issues.

Pod template for StatefulSet

metadata:
annotations:
# Disable service mesh as it might interfere with networking.
linkerd.io/inject: disabled
spec:
containers:
- name: multinode
image: your-image-name # Image built from Dockerfile.
args:
- /usr/sbin/sshd
- -D # sshd won't daemonize.
ports:
- containerPort: 22
name: ssh
resources:
limits:
# Max number of GPUs in the host
nvidia.com/gpu: '4'
...
volumeMounts:
# Shared memory for comms within nodes.
- mountPath: /dev/shm
name: shm
volumes:
- emptyDir:
medium: Memory
sizeLimit: 13Gi
name: shm

PyTorch

Job manifest

apiVersion: batch/v1
kind: Job
metadata:
labels:
app: multinode
name: multinode
spec:
completionMode: Indexed
completions: 2 # Should match the number of nodes
parallelism: 2 # Should match the number of nodes
template:
spec:
containers:
- image: your-torchrun-image
name: multinode
args:
- sh
- -c
- torchrun --nproc_per_node $NGPUS --nnodes $NNODES --node_rank $JOB_COMPLETION_INDEX --master_addr $MASTER_ADDR --master_port $MASTER_PORT train.py
env:
# Node with rank 0 is chosen as the master node.
- name: MASTER_ADDR
value: multinode-0.multinode
- name: MASTER_PORT
value: '29500'
# Number of training nodes.
- name: NNODES
value: '2'
# Number of GPUs in the machine.
- name: NGPUS
value: '4'
ports:
- containerPort: 29500
name: nccl
resources:
limits:
# Request the max available amount of GPUs in your machine.
nvidia.com/gpu: '4'
volumeMounts:
# Mounting emptyDir as shared memory for communication within nodes.
- mountPath: /dev/shm
name: shm
restartPolicy: Never
# Required for pod-to-pod communication in Indexed-Jobs.
subdomain: multinode
volumes:
- emptyDir:
medium: Memory
sizeLimit: 13Gi
name: shm

Moving to production

An architectural diagram of our distributed deep learning training framework, Kensho Quilt.

Orchestration improvements with Airflow

Storage

  • Isolation of different training data for each training run.

  • Prevention of deadlocks during read/write operations by ensuring that each node within a run has isolated access to the training data.

  • Facilitation of distributing model checkpoints across nodes by storing them on S3.

  • Elimination of initial data download time by streaming the data instead.

Monitoring & metrics

The insights derived from tracking various metrics at both the training and GPU levels are instrumental in understanding model behavior and addressing potential challenges. A robust tool for monitoring training progress and experiment results is WandB (Weights & Biases), widely recognized as an industry standard. It compiles both training and system diagnostics such as loss, accuracy, CPU/GPU load, power consumption, and memory, as well as provides an interface for visualizing results and supporting collaborative work for model training.

Failure recovery

Networking

Enhanced network performance plays a pivotal role in optimizing the efficiency and cost-effectiveness of extended training processes. Most major cloud providers offer networking solutions tailored to applications demanding high levels of inter-node communication. Within this context, Kensho’s focus centers on AWS Elastic Fabric Adapter (EFA).

  • Nodes must be confined to the same Availability Zone.

  • EFA Kubernetes device plugin is necessary to detect and advertise the EFA interfaces.

  • Kubernetes pods must declare the resource type vpc.amazonaws.com/efa in the request specification, akin to CPU and memory resources, in order to have access to their host’s EFA interfaces.

Cost

As models become more advanced, training expenses increase, requiring a careful assessment of cost-performance trade-offs. The main consideration is the choice between a fully on-premise setup, a fully cloud service setup, or a hybrid on-premise and cloud approach. On-premise setups involve significant upfront costs and may not be ideal for initial model development. Cloud providers offer flexibility with on-demand and pay-as-you-go services, but other factors such as machine types, storage devices, and networking infrastructure also influence cost. For example, AWS offers specialized GPU-optimized machines like P4d instances with EFA, which enhance node communication, possibly leading to reduced training time and overall cost. But for smaller training jobs, the benefits of using P4d instances might not justify the cost.

Conclusion

Previous
Previous

Beyond innovation: Leveraging the power of machine learning for business growth

Next
Next

How human-reviewed machine-readable transcripts supercharge audio data