This series of documents describe how to deploy deep learning in TKE Serverless from direct TensorFlow deployment to subsequent Kubeflow deployment and are intended to provide a comprehensive scheme for implementing container-based deep learning.
This document proceeds to run a deep learning task in TKE Serverless by using a self-built cluster after the steps in Building Deep Learning Container Image are completed.The self-built image has been uploaded to the image repository ccr.ccs.tencentyun.com/carltk/tensorflow-model
, which can be directly pulled for use with no rebuild required.
Please create an TKE Serverless cluster as instructed in Connecting to a Cluster.
Note:As you need to run a GPU-based training task, when creating a cluster, please pay attention to the supported resources in the AZ of the selected container network and be sure to select an AZ that supports GPU as shown below:
The container will be automatically deleted, and the resources will be automatically released after the task ends. Therefore, to persistently store models and data, we recommend you mount an external storage service such as CBS, CFS, and COS.
In this example, CFS is used as an NFS disk to persistently store data with frequent reads and writes.
Note:The CFS file system must be created in the region of the cluster.
Note:Note down the IPv4 address in the mount target details, such as
10.0.0.161:/
, which will be used as the NFS path in subsequent mount configuration.
This task uses the MNIST handwritten digit recognition dataset and two-layer CNN as an example. The sample image is the self-built image created in the previous chapter. If you need to use a custom image, please see Creating Deep Learning Container Image. Two task creation methods are provided below:
Taking the essence of the deep learning task into account, Job node deployment is used as an example in this document. For more information on how to deploy a Job, please see Job Management.The following is the example of deployment in the console:
Note
- As the dataset may need to be downloaded online, you need to configure the public network access for the cluster. For more information, please see Public Network Access.
- After selecting a GPU model, when setting the request and limit, you need to assign the container CPU and memory resources meeting the resource specifications. The actual values do not need to be accurate down to the ones place.When configuring in the console, you can also delete the default configuration and leave it empty to configure "unlimited" resources, which also have the corresponding billing specifications. This approach is recommended.
- The container running command is inherited from Docker's
CMD
field, whose preferred form isexec
. If you do not call theshell
command, there will be no normal shell processing. Therefore, if you want to run a command in theshell
form, you need to add"sh"
and"-c"
at the beginning.When you enter multiple commands and parameters in the console, each command should take a line (subject to the line break)
You can view the running result either in the console or on the command line:
After creating a Job, you will be redirected to the Job management page by default. You can also enter the page as follows:
Deployment in TKE is almost the same as that in TKE Serverless. Taking deployment through kubectl with a YAML file as an example, TKE has the following differences:
annotations
and resources
are not needed. Practically, you can reserve annotations
, which TKE will not process. We recommend you comment out resources
, as it may cause unreasonable resource requirements.If you encounter any problems when performing this practice, please see FAQs for troubleshooting.
Was this page helpful?