Are You Sure That You Can Implement Image Classification Networks?

YoonwooJeong
Towards AI
Published in
6 min readMar 25, 2022

--

Image from: https://machinelearningmastery.com/applications-of-deep-learning-for-computer-vision/

Before my recent paper submission, I firmly believed that I could easily implement image classification networks due to well-constructed deep learning frameworks such as PyTorch or Jax. Although they provide modules for the networks, I realized there are so many implementation details that I should consider in practice. In my very first implementation, the model achieved 70% accuracy in my toy dataset. To achieve the desired performance I was forced to use many GPUS to enlarge the effective batch size. However, it took a long time for training with many GPUs. In contrast, after applying several techniques that I’m going to introduce soon, the model shows 14%p better accuracy than the first model with four times less training time and four times fewer GPUs. It was an amazing experience since the total expected GPU times have been reduced to 1/16, although the model architectures are the same.

Many lectures, describing deep learning often miss the details of the actual implementation. Hence, I’m going to share my personal experiences that might be beneficial for future research. I referred a famous paper, named “Bag of Tricks for Image Classification with Convolutional Neural Networks”, which shares useful skills for designing the training process.

Here is the list of things that I’ve learned in my project.

  • Consider using learning rate warmup and cosine annealing for learning rate.
  • The learning rate should be set differently depending on the number of GPUs in use.
  • Apply label smoothing for better training stability.
  • Remove bias decay.
  • Take benefits from PyTorch-lightning for convenient implementation
  • Several minor options in PyTorch can further boost and improve your code.

Linear Warmup + Cosine Annealing

When I used an exponentially decaying learning rate scheduler, I had to use a large batch size to get the desired performance quickly. Accordingly, I should use multiple GPUs and slightly modify the network architecture to use SyncBatchNorm instead of normal BatchNorm operations. To achieve 60% accuracy, I should train the network for 50 epochs; in other words, a day. As recommended in the paper “Bag of Tricks for Image Classification with Convolutional Neural Networks”, I changed the learning rate scheduler.

The proposed learning rate scheduler has two stages: linear warmup and cosine annealing. The diagram below shows the full shape of scheduling.

In the warmup stage, the learning rate linearly increases from 0 to the target learning rate. Using a large learning rate is numerically unstable since all parameters are randomized at the beginning of training. Consequently, simple learning rate decay strategies such as step decay and exponential decay fail to train ResNet34 and ResNet50 in my toy experiments, despite the success of ResNet18. Apart from that, continuity of learning rate even improves the training stability. In general, 5 epochs for linear warmup were sufficient in many experiments. After changing the learning rate scheduler, I could reduce the effective batch size from 4096 to 256 without performance degradation due to improved training stability.

According to a popular scholarship archive, “Papers with Code,” this learning rate scheduling is widely used in various tasks.

Batch-size Dependent Learning Rate

Controlling the batch size on train phase does not change the expectation of stochastic gradient descent but changes the variance. We may increase the learning rate when using a large batch size since its variance is smaller than the case we use a small batch size. Prior research substantiates that linearly increasing the learning rate with the batch size empirically works better. In detail, we should double the learning rate when using two times the large batch size and reduce it to half when the effective batch size becomes half.

Label Smoothing

Most classification networks with full supervision are trained with CELoss, an abbreviation of CrossEntropyLoss. The CELoss is formally defined as:

When we derive the closed-form solution for this training objective, we could get a negative infinite value, which is not handled on recent deep learning libraries. Moreover, this extreme value encourages the output scores to be dramatically distinctive, potentially leading to overfitting. Thus, it is recommended to use a soft version of CrossEntropyLoss.

Rather than using binary value 0 or 1, label smoothing distributes the target score 1 to different bins for smoother distributions. The optimal solution for smoothed CrossEntropyLoss is no more infinite or negative infinite value. In my case, I observed 2%p of performance improvements after using the label smoothing. For more detailed descriptions of CrossEntropyLoss, please refer to the link below:

Remove bias decay

Weight decay is a great regularization skill to avoid overfitting training models. As pointed out in the paper, “Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes”, applying weight decay on bias terms are not preferable for faster training. Thus, I removed the bias decay in my model. I got a negligible effect in my experiment by removing the bias decay. In contrast, there were remarkable differences in experiments in the paper below.

Several Misc

From now on, I introduce several engineering skills that could change your code much simpler and more efficient.

Put the data on local or SSD memory. Not HDD!

I spend much time handling this issue. This issue was terrible for me since there were no bugs, but the code was slow. After profiling the executable, I found that disk I/O operations were the bottleneck of my code. Since my remote server has a large HDD memory, I used it to store my large-scale toy dataset. However, when we use tremendous disk I/O operations, this could extremely slow down the code. If you cannot put your whole data on the runtime memory, I strongly recommend putting the data on your local memory, at least on SSD memory. (No HDD plz)

16bit-(half-) precision training with PyTorch-Lightning

As suggested in the official PyTorch-Lightning documents, using a 16-bit precision can boost the training time. Recent GPU devices based on Volta and Turing architectures dramatically reduce training time. According to the official NVIDIA document, up to 3x overall speedup has been observed on the most arithmetically intense model architectures with a slight performance drop. PyTorch-Lightning supports a convenient option to use the half-precision training by adding a single argument. In addition, multi-GPU training is also done similarly.

You can simply change the precision with a single argument when using PyTorch-Lightning.

Here, I shared many practical pieces of advice that could facilitate your faster and more convenient implementations. I’m pretty sure that many of the readers are familiar with the principles of classification since many relevant lectures have already thoroughly taught them. I was also confident about my implementation skills. However, this experience derived my self-reflection. I hope this article helps you to extend your implementation skills.

Had fun on my article? Visit my personal page for more articles!
Writer: Yoonwoo Jeong
Affiliation: POSTECH, computer vision lab
Mail: jeongyw12382@postech.ac.kr
Personal Page: http://github.com/jeongyw12382

--

--