This document describes how to run a PyTorch training job.
The following steps are based on the official distributed training examples of PyTorch-Operator
.
The code sample mnist.py at the official website of Kubeflow is used.
Training image creation is easy. You only need to get an official image based on PyTorch 1.0, copy the above code to the image, and configure entrypoint
(if entrypoint
is not configured, you can also configure the startup command when submitting a PyTorchJob
).
Note:The training code is written based on PyTorch 1.0. As APIs of different PyTorch versions may be incompatible, you may need to adjust the above training code in a PyTorch environment on other versions.
Prepare a PyTorchJob
YAML file to define one master worker and one worker.
Note
- You need to replace the
<training image="">
placeholder with the address of the uploaded training image.- As GPU resources are configured in resource configuration, set
backend
for training to"nccl"
inargs
; in jobs using no (Nvidia) GPU resources, use another backend such asgloo
.
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: <training image>
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: <training image>
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
Run the following command to use kubectl
to submit the PyTorchJob
:
kubectl create -f ./pytorch_job_mnist_nccl.yaml
Run the following command to view the PyTorchJob
:
kubectl get -o yaml pytorchjobs pytorch-dist-mnist-nccl
Run the following command to view Pods created by the PyTorch job:
kubectl get pods -l pytorch_job_name=pytorch-dist-mnist-nccl
Was this page helpful?