tencent cloud

Using GPU Instance to Train ViT Model
Last updated: 2024-01-11 17:11:13
Using GPU Instance to Train ViT Model
Last updated: 2024-01-11 17:11:13
Note:
This document is written by a Cloud GPU Service user and is for study and reference only.

Overview

This document describes how to use a GPU instance to train a ViT model offline to complete a simple image classification task.

ViT Model Overview

The Vision Transformer (ViT) model is proposed by Alexey Dosovitskiy to get the state-of-the-art (SOTA) result from multiple tasks.



For an input image, ViT splits it into multiple subimage patches. Each patch is spliced with position embedding and combined with class labels to be input to transformer encoder together. After the corresponding output layer results of the class label positions pass through a network, the ViT result will be output. In the pretraining status, the ground truth of the result can replaced by a patch of the mask.

Instance Environment

Instance type: In this document, you can select a GN7 or GN8 model. Based on the GPU performance comparison provided in Tesla P40 vs Tesla T4, the performance of T4 in Turing architecture is higher than that of P40 in Pascal architecture. Therefore, GN7.5XLARGE80 is selected in this document.
Region: As large datasets may need to be uploaded, we recommend you select the region with the lowest latency. This document uses the online ping tool for testing. As the latency between the test region and Chongqing region where GN7 resides is the lowest, Chongqing region is selected in this example.
System disk: 100 GB Premium Cloud Storage disk.
Operating system: Ubuntu 18.04.
Bandwidth: 5 Mbps.
Local operating system: macOS

Directions

Setting passwordless login for your instance (optional)

1. (Optional) You can configure the server alias in ~/.ssh/config on your local server. In this document, the alias tcg is used.
2. Run the ssh-copy-id command to copy the SSH public key of the local server to the GPU instance.
3. Run the following command in the GPU instance to disable password login to enhance security:
echo 'PasswordAuthentication no' | sudo tee -a /etc/ssh/ssh\\_config
4. Run the following command to restart the SSH service.
sudo systemctl restart sshd

Configuring the PyTorch-GPU development environment

To use pytorch-gpu for development, you need to further configure the environment as follows:
1. Install the NVIDIA graphics card driver.
Run the following command to install the NVIDIA graphics card driver:
sudo apt install nvidia-driver-418
After the installation is completed, run the following command to check whether the installation is successful:
nvidia-smi
If the following result is returned, the installation is successful.



2. Configure the conda environment.
Run the following commands to configure the conda environment:
wget https://repo.anaconda.com/miniconda/Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
chmod +x Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
./Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
rm Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
3. Compile the ~/.condarc file to add the following software source information and replace the conda software source with the Qinghua source.
channels:

- defaults

show\\_channel\\_urls: true

default\\_channels:

- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r

- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2

custom\\_channels:

conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
4. Run the following command to set the pip source to the Tencent Cloud image source.
pip config set global.index-url https://mirrors.cloud.tencent.com/pypi/simple
5. Install PyTorch.
Run the following command to install PyTorch:
conda install pytorch torchvision cudatoolkit=11.4 -c pytorch --yes
Run the following commands to view whether PyTorch is installed successfully:
python
import torch
print(torch.cuda.is_avaliable())
If the following result is returned, PyTorch is installed successfully:




Preparing the experiment data

The test task in this training is an image classification task and uses the flower image classification dataset in the Tencent Cloud online document. The dataset contains five classes of flowers and is 218 MB in size. Below are the sampled dataset results (examples of images of flowers in each class):



The data of each class in the raw dataset is stored in the folder of the corresponding class. You need to convert it to the standard format of ImageNet and divide the training and verification datasets at the ratio of 4:1. Use the following code to convert the format:
# split data into train set and validation set, train:val=scale

import shutil

import os

import math

scale = 4

data\\_path = '../raw'

data\\_dst = '../train\\_val'

#create imagenet directory structure

os.mkdir(data\\_dst)

os.mkdir(os.path.join(data\\_dst, 'train'))

os.mkdir(os.path.join(data\\_dst, 'validation'))

for item in os.listdir(data\\_path):

item\\_path = os.path.join(data\\_path, item)

if os.path.isdir(item\\_path):

train\\_dst = os.path.join(data\\_dst, 'train', item)

val\\_dst = os.path.join(data\\_dst, 'validation', item)

os.mkdir(train\\_dst)

os.mkdir(val\\_dst)

files = os.listdir(item\\_path)

print(f'Class {item}:\\n\\t Total sample count is {len(files)}')

split\\_idx = math.floor(len(files) \\* scale / ( 1 + scale ))

print(f'\\t Train sample count is {split\\_idx}')

print(f'\\t Val sample count is {len(files) - split\\_idx}\\n')

for idx, file in enumerate(files):

file\\_path = os.path.join(item\\_path, file)

if idx <= split\\_idx:

shutil.copy(file\\_path, train\\_dst)

else:

shutil.copy(file\\_path, val\\_dst)

print(f'Split Complete. File path: {data\\_dst}')
Below is the dataset overview:
Class roses:

Total sample count is 641

Train sample count is 512

Validation sample count is 129

Class sunflowers:

Total sample count is 699

Train sample count is 559

Validation sample count is 140

Class tulips:

Total sample count is 799

Train sample count is 639

Validation sample count is 160

Class daisy:

Total sample count is 633

Train sample count is 506

Validation sample count is 127

Class dandelion:

Total sample count is 898

Train sample count is 718

Validation sample count is 180
To accelerate the training process, you need to further convert the dataset to a GPU-friendly format such as NVIDIA Data Loading Library (DALI). The DALI library can use GPU to replace CPU to accelerate data preprocessing. When data in the ImageNet format already exists, you can simply run the following command to use DALI:
git clone https://github.com/ver217/imagenet-tools.git

cd imagenet-tools && python3 make\\_tfrecords.py \\

--raw\\_data\\_dir="../train\\_val" \\

--local\\_scratch\\_dir="../train\\_val\\_tfrecord" && \\

python3 make\\_idx.py --tfrecord\\_root="../train\\_val\\_tfrecord"

Model training result

To facilitate subsequent training of large distributed models, this document describes how to train and develop a model based on the distributed training framework Colossal-AI. Colossal-AI provides a set of easy-to-use APIs, which enables you to easily perform data, model, pipeline, and mixed parallel training.
Based on the demo provided by Colossal-AI, this document uses ViT integrated in the pytorch-image-models repository for implementation. The minimum vit\\_tiny\\_patch16\\_224 model at a resolution of 224*224 is used, where each sample is divided into 16 patches.
1. Run the following command to install Colossal-AI and pytorch-image-models as instructed in Start Locally:
pip install colossalai==0.1.5+torch1.11cu11.3 -f https://release.colossalai.org
pip install timm
2. Write the following model training code based on the demo provided by Colossal-AI:
from pathlib import Path

from colossalai.logging import get\\_dist\\_logger

import colossalai

import torch

import os

from colossalai.core import global\\_context as gpc

from colossalai.utils import get\\_dataloader, MultiTimer

from colossalai.trainer import Trainer, hooks

from colossalai.nn.metric import Accuracy

from torchvision import transforms

from colossalai.nn.lr\\_scheduler import CosineAnnealingLR

from tqdm import tqdm

from titans.utils import barrier\\_context

from colossalai.nn.lr\\_scheduler import LinearWarmupLR

from timm.models import vit\\_tiny\\_patch16\\_224

from titans.dataloader.imagenet import build\\_dali\\_imagenet

from mixup import MixupAccuracy, MixupLoss

def main():

parser = colossalai.get\\_default\\_parser()

args = parser.parse\\_args()

colossalai.launch\\_from\\_torch(config='./config.py')

logger = get\\_dist\\_logger()

# build model

model = vit\\_tiny\\_patch16\\_224(num\\_classes=5, drop\\_rate=0.1)

# build dataloader

root = os.environ.get('DATA', '../train\\_val\\_tfrecord')

train\\_dataloader, test\\_dataloader = build\\_dali\\_imagenet(

root, rand\\_augment=True)

# build criterion

criterion = MixupLoss(loss\\_fn\\_cls=torch.nn.CrossEntropyLoss)

# optimizer

optimizer = torch.optim.SGD(

model.parameters(), lr=0.1, momentum=0.9, weight\\_decay=5e-4)

# lr\\_scheduler

lr\\_scheduler = CosineAnnealingLR(

optimizer, total\\_steps=gpc.config.NUM\\_EPOCHS)

engine, train\\_dataloader, test\\_dataloader, \\_ = colossalai.initialize(

model,

optimizer,

criterion,

train\\_dataloader,

test\\_dataloader,

)

# build a timer to measure time

timer = MultiTimer()

# create a trainer object

trainer = Trainer(engine=engine, timer=timer, logger=logger)

# define the hooks to attach to the trainer

hook\\_list = [

hooks.LossHook(),

hooks.LRSchedulerHook(lr\\_scheduler=lr\\_scheduler, by\\_epoch=True),

hooks.AccuracyHook(accuracy\\_func=MixupAccuracy()),

hooks.LogMetricByEpochHook(logger),

hooks.LogMemoryByEpochHook(logger),

hooks.LogTimingByEpochHook(timer, logger),

hooks.TensorboardHook(log\\_dir='./tb\\_logs', ranks=[0]),

hooks.SaveCheckpointHook(checkpoint\\_dir='./ckpt')

]

# start training

trainer.fit(train\\_dataloader=train\\_dataloader,

epochs=gpc.config.NUM\\_EPOCHS,

test\\_dataloader=test\\_dataloader,

test\\_interval=1,

hooks=hook\\_list,

display\\_progress=True)

if \\_\\_name\\_\\_ == '\\_\\_main\\_\\_':

main()
Below is the specific model configuration:
from colossalai.amp import AMP\\_TYPE

BATCH\\_SIZE = 128

DROP\\_RATE = 0.1

NUM\\_EPOCHS = 200

CONFIG = dict(fp16=dict(mode=AMP\\_TYPE.TORCH))

gradient\\_accumulation = 16

clip\\_grad\\_norm = 1.0

dali = dict(

gpu\\_aug=True,

mixup\\_alpha=0.2

)
Below is the model execution process. Each epoch time is within 20s:



The result shows that the highest accuracy of the model with the verification dataset is 66.62%. You can also increase the number of model parameters; for example, you can change the model to `v.


.

Summary

The biggest problem encountered in this example was that cloning from GitHub was very slow. To solve this, a tunnel and ProxyChains were used for acceleration. However, such operations violated the CVM use rules and caused a period of unavailability. Eventually, this problem was solved by deleting the proxy and submitting a ticket. Using a public network proxy doesn't comply with the CVM use regulations. To guarantee the stable operations of your business, do not violate the regulations.

References

[1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). [2] NVIDIA/DALI [3] Bian, Zhengda, et al. "Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training." arXiv preprint arXiv:2110.14883 (2021).
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback