By Manzil Zaheer, Guru Guruganesh, Avinava Dubey et al (Google Research), 2020
In this paper, the authors present a Transformer attention model with linear complexity that is mathematically proven to be Turing complete (and thus as powerful as the original quadratic attention model) and achieves new state-of-the-art results on many NLP tasks involving long sequences (e.g. question answering and summarization), as well as genomics data.
By Sinong Wang, Belinda Z. Li, Madian Khabsa et al (Facebook AI Research), 2020
This paper suggests an approximate way of calculating self-attention in Transformer architectures that has linear space and time complexity in terms of the sequence length, with the resulting performance on benchmark datasets similar to that of the RoBERTa model based on the original Transformers with much less efficient quadratic attention complexity.
By Nicolas Carion, Francisco Massa, Gabriel Synnaeve et al (Facebook AI Research), 2020
This paper describes a completely automated end-to-end object detection system combining convolutional networks and Transformers. The new model shows competitive performance on par with Faster R-CNN, and can be generalized to other tasks such as panoptic segmentation.
By Yi Tay, Dara Bahri, Donald Metzler et al (Google Research), 2020
Contrary to the common consensus that self-attention is largely responsible for the superior performance of Transformer models on various NLP tasks, this paper suggests that substituting outputs of self-attention layers with random or simply synthesized data is sufficient to achieve similar results with better efficiency.
By Prannay Khosla, Piotr Teterwak, Chen Wang et al (Google Research), 2020
The authors use contrastive loss, which has recently been shown to be very effective at learning deep neural network representations in the self-supervised setting, for supervised learning, and achieve better results than those obtained with cross entropy loss for ResNet-50 and ResNet-200.
The authors suggest a new ResNet-like network architecture that incorporates attention across groups of feature maps. Compared to previous attention models such as SENet and SKNet, the new attention block applies the squeeze-and-attention operation separately to each of the selected groups, which is done in a computationally efficient way and implemented in a simple modular structure.
This paper describes a new training approach for Transformer network architectures used for language modeling tasks. The authors demonstrate that their technique results in greatly improved training efficiency and better performance on common benchmark datasets (GLUE, SQuAD) compared to other state-of-the-art NLP models of similar size.