Computer Vision

A Useful New Image Classification Method That Uses neither CNNs nor Attention

[Paper summary] MLP-mixer

Makoto TAKAMATSU
Towards AI
Published in
4 min readMay 19, 2021

--

In this post, I would like to introduce MLP-Mixer, which was presented by Google Research, Brain Team (the same team as Vision Transformers (ViT)) in May 2021. Interestingly, the MLP-Mixer, which is based on ViT, can be trained on large datasets almost three times faster and achieves similar results compared to state-of-the-art models (ViT and BiT).

MLP-Mixer is based on the Multilayer Perceptron (MLP), which is applied iteratively to spatial locations and feature channels and has attracted attention as an image classification algorithm that uses neither CNNs nor Transformers.

The advantages of using only MLPs are architectural simplicity and computational speed. In addition, the computational complexity of MLP-Mixer is linear in the number of input patches, unlike Vision Transformers (ViT), which is the square of the number of input patches.

Vision transformers (ViT) continues the long-lasting trend of removing hand-crafted visual features and inductive biases from models and relies further on learning from raw data.

We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. “mixing” the per-location features), and one with MLPs applied across patches (i.e. “mixing” spatial information).

https://arxiv.org/abs/2105.01601

Figure 1: MLP-Mixer architecture.

Points

・ A simple multilayer perceptron (MLP)-based model competes with CNN and Attention models and is 3X faster on large datasets (100 million image datasets).

・The MLP-Mixer architecture is based on the multilayer perceptron (MLP) and relies only on basic matrix multiplication routines, data layout modification (reshaping and transposition), and scalar nonlinearity.

・When pre-trained on a large dataset (about 100 million images), it reaches performance comparable to CNNs and transformers, with a trade-off between accuracy and cost.

MLP Mixer Architecture

As shown in Fig. 2, MLP-Mixer is similar in architecture to Vision Transformers (ViT), without the use of self-attention. MLP-Mixer, similar to ViT, first divides the input image into small mini-patches, and then feeds all the patches to the Fully-connected layer to obtain a potential embedded representation. In other words, every patch in the image corresponds to a vector.

Figure 2: (From left to right) Architecture of Vision Transformers (ViT), Architecture of MLP-Mixer.

As shown in Figure 3, the MLP-Mixer consists of a linear embedding for each patch, an MLP-Mixer block, and image classes classifier. In contrast to Vision Transformers (ViT), MLP-Mixer blocks do not use positional embedding. This is due to the property of Token-mixing MLP described below.

Figure 3: Architecture of the MLP-Mixer.

The model takes a linearly projected image patch as input and maintains a “patch × channel” dimension; the MLP-Mixer consists of two layers of MLPs, the Channel-mixing MLP and the Token mixing MLP, as shown in Fig. 4, and this block is represented as the Mixer layer. The mixer layer relies only on basic matrix multiplication, data layout modification (shape change and transposition), and scalar nonlinearity.

Figure 4: Details of the Mixer layer: token mixing MLP (green part), channel-mixing MLP (orange part).

Token-mixing MLP and Channel-mixing MLP allow the interaction of patch and channel input dimensions.

・ Token-mixing MLP; mix features of patches with different spatial locations

Token-mixing MLP (in green) transposes each column of the input data and shares the same MLP weights across all rows. Therefore, the channel for all patches in the image is 1, and all weights in the fully connected layer are shared. Thus, different patches share the same channel weights. To mix patches use a single-channel depth-wise convolution.

・Channel-mixing MLP; mix features in different channels

Channel-mixing MLP (orange part) transposes each row of input data again and transforms it into patches again. The MLP weights are then shared for all the patches. Therefore, different channels share the same channel weights. To mix the channels uses 1x1 Convolution.

Experimental results

As the following table shows, the proposed method achieves the same level of accuracy with lower computational cost compared to state-of-the-art image classification methods such as ViT and BiT.

Also, from the figure, MLP-Mixer shows a significant improvement in dataset size and image classification accuracy. As the dataset size is increased, the classification accuracy becomes comparable to ViT.

Conclusion

The MLP-Mixer is a network based on a very simple model called the Multilayer Perceptron (MLP), but it has proposed some very interesting results, such as low computational cost for training and improved classification accuracy as the data size increases.

Google Research, Brain Team also stated the following.

Hopefully, these results spark further research beyond the realms of well-established models based on convolutions and self-attention transformers

References

[MLP-Mixer] https://arxiv.org/abs/2105.01601

[ViT] https://arxiv.org/abs/2010.11929

https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py

--

--