This document describes how to run a TF training job.
The following steps are based on the official distributed training examples in parameter server/worker mode of TF-Operator
.
The code sample dist_mnist.py at the official website of Kubeflow is used.
Image creation is easy. You only need to get an official image based on TensorFlow 1.5.0, copy the above code to the image, and configure entrypoint
.
Note:If
entrypoint
is not configured, you can also configure the container startup command when submitting aTFJob
.
Prepare a TFJob
YAML file to define two parameter servers and four workers.
NoteYou need to replace the
<training image="">
placeholder with the address of the uploaded training image.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "dist-mnist-for-e2e-test"
spec:
tfReplicaSpecs:
PS:
replicas: 2
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: <training image>
Worker:
replicas: 4
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: <training image>
Run the following command to use kubectl
to submit the TFJob
:
kubectl create -f ./tf_job_mnist.yaml
Run the following command to view the job status:
kubectl get tfjob dist-mnist-for-e2e-test -o yaml
kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist
Apakah halaman ini membantu?