跳到主要内容

16 篇博文 含有标签「attention-mechanism」

查看所有标签

· 阅读需 10 分钟

论文名称:Gated Channel Transformation for Visual Recognition

作者:Zongxin Yang, Linchao Zhu, Y u Wu, and Yi Yang

Code:https://github.com/z-x-yang/GCT

摘要

  • GCT模块是一个普遍适用的门控转换单元,可与网络权重一起优化。
  • 不同于SEnet通过全连接的隐式学习,其使用可解释的变量显式地建模通道间的关系,决定是竞争或是合作。

关键词:可解释性、显式关系、门控

介绍

  • 单个卷积层只对Feature Map中每个空间位置的临近局部上下文进行操作,这可能会导致局部歧义。通常有两种方法解决这种问题:一是增加网络的深度,如VGG,Resnet,二是增加网络的宽度来获得更多的全局信息,如GEnet大量使用领域嵌入,SEnet通过全局嵌入信息来建模通道关系。
  • 然而SEnet中使用fc层会出现两个问题:
    1. 由于使用了fc层,出于节省参数的考虑,无法在所有层上使用
    2. fc层的参数较为复杂,难以分析不同通道间的关联性,这实际上是一种隐式学习
    3. 放在某些层之后会出现问题

· 阅读需 8 分钟

论文名称:CBAM: Convolutional Block Attention Module

作者:Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon,Korea Advanced Institute of Science and Technology, Daejeon, Korea

摘要

  • CBAM(Convolutional Block Attention Moudule)是一种简单有效的前馈卷积神经网络注意力模块。
  • 该模块为混合域注意力机制()从通道和空间两个方面依次推断attention map。
  • CBAM是一个轻量级的通用模块,可以无缝集成到任何CNN中。

关键词:物体识别,注意机制,门控卷积

介绍

  • 卷积神经网络(CNNs)基于其丰富的表达能力显著提高了视觉任务的性能,目前的主要关注网络的三个重要因素:深度,宽度和基数(Cardinality)。
  • 从LeNet到残差网络,网络变的更加深入,表达形式更加丰富;GoogLeNet表明宽度是提高模型性能的另一个重要因素;Xception和ResNext则通过增加网络的基数,在节省参数的同时,来获得比深度、宽度更强的表达能力(引用于ResNext论文)。
  • 除了这些因素之外,本文考察了与网络结构设计不同的方面——注意力。

· 阅读需 6 分钟

论文名称:Involution: Inverting the Inherence of Convolution for Visual Recognition

作者:Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen, The Hong Kong University of Science and Technology, ByteDance AI Lab, Peking University, Beijing University of Posts and Telecommunications

Convolution

  1. 空间无关性(spatial agnostic):same kernel for different position
    • 优点:参数共享,平移等变
    • 缺点:不能灵活改变参数,卷积核尺寸不能过大,只能通过堆叠来扩大感受野、捕捉长距离关系
  2. 通道特异性(channel specific):different kernels for different channels
    • 优点:充分提取不同通道上的信息
    • 缺点:有冗余

Convolution kernel 尺寸为 B,C_out,C_in,K,K

Involution

与convolution不同,involution拥有完全相反的性质:

  1. 空间特异性:kernel privatized for different position
  2. 通道不变性:kernel shared across different channels

involution kernel 的尺寸为B,G,KK,H,W.

· 阅读需 21 分钟

论文名称:Attention Is All you Need

作者:Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin

code:https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py

前言

基于RNN或CNN的Encoder-Decoder模型在NLP领域占据大壁江山,然而她们也并非是完美无缺的:

  • LSTM,GRU等RNN模型受限于固有的循环顺序结构,无法实现并行计算,在序列较长时,计算效率尤其低下,虽然最近的工作如因子分解技巧1条件计算2在一定程度上提高了计算效率和性能,但是顺序计算的限制依然存在;
  • Extended Neural GPU3,ByteNet4,和ConvS2S5 等CNN模型虽然可以进行并行计算,但是学习任意两个位置的信号的长距离关系依旧比较困难,其计算复杂度随距离线性或对数增长。

而谷歌选择抛弃了主流模型固有的结构,提出了完全基于注意力机制的Transformer,拥有其他模型无法比拟的优势:

  • Transformer可以高效的并行训练,因此速度十分快,在8个GPU上训练了3.5天;
  • 对于长距离关系的学习,Transformer将时间复杂度降低到了常数,并且使用多头注意力来抵消位置信息的平均加权造成的有效分辨率降低
  • Transform是一种自编码(Auto-Encoding)模型,能够同时利用上下文

整体结构

Transfromer的整体结构是一个Encoder-Decoder,自编码模型主要应用于语意理解,对于生成任务还是自回归模型更有优势

image-20210605151335569

我们可以将其分为四个部分:输入,编码块,解码块与输出

接下来让我们按照顺序来了解整个结构,希望在阅读下文前你可以仔细观察这幅图,阅读时也请参考该图

· 阅读需 9 分钟
Gavin Gong

Squeeze-and-Excitation Networks(SENet)是由自动驾驶公司Momenta在2017年公布的一种全新的图像识别结构,它通过对特征通道间的相关性进行建模,把重要的特征进行强化来提升准确率。这个结构是2017 ILSVR竞赛的冠军,top5的错误率达到了2.251%,比2016年的第一名还要低25%。

The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%. Models and code are available at this https URL.

image-20210703161305168

SENet的主要创新是一个模块。如上图,Ftr是传统卷积结构,其输入XX(C×W×HC'\times W' \times H')和输出UU(C×W×HC\times W \times H)也都是传统结构中已经存在的。SeNet的模块是UU之后的部分。SENet通过这种设在某种程度上引入了注意力。

· 阅读需 7 分钟
Gavin Gong

image-20210723203210974

Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon

We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS~COCO detection, and VOC~2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

CBAM是一篇结合了通道注意力和空间注意力的论文。它通过在同个模块中叠加通道注意力和空间注意力达到了良好的效果。为了提升 CNN 模型的表现,除了对网络的深度、宽度下手,还有一个方向是注意力。注意力不仅要告诉我们重点关注哪里,提高关注点的表达。 我们的目标是通过使用注意机制来增加表现力,关注重要特征并抑制不必要的特征。

为了强调空间和通道这两个维度上的有意义特征,作者依次应用通道和空间注意模块,来分别优化卷积神经网络在通道和空间维度上学习能力。作者将注意力过程分为通道和空间两个独立的部分,这样做不仅可以节约参数和计算力,而且保证了其可以作为即插即用的模块集成到现有的网络架构中去。

· 阅读需 19 分钟
Gavin Gong

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network.

Non-local旨在使用单个Layer实现长距离的像素关系构建,属于自注意力(self-attention)的一种。常见的CNN或是RNN结构基于局部区域进行操作。例如,卷积神经网络中,每次卷积试图建立一定区域内像素的关系。但这种关系的范围往往较小(由于卷积核不大)。

为了建立像素之间的长距离依赖关系,也就是图像中非相邻像素点之间的关系,本文另辟蹊径,提出利用non-local operations构建non-local神经网络。这篇论文通过非局部操作解决深度神经网络核心问题:捕捉长距离依赖关系。

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at this https URL .

本文另辟蹊径,提出利用non-local operations构建non-local神经网络,解决了长距离像素依赖关系的问题。很值得阅读论文原文。受计算机视觉中经典的非局部均值方法启发,作者的非局部操作是将所有位置对一个位置的特征加权和作为该位置的响应值。这种非局部操作可以应用于多种计算机视觉框架中,在视频分类、目标分类、识别、分割等等任务上,都有很好的表现。

· 阅读需 10 分钟
Gavin Gong

Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu

GCNet(原论文:GCNet: Non-local Networks Meet Squeeze-Excitation Networks and BeyondGlobal Context Networks)这篇论文的研究思路类似于DPN,深入探讨了Non-local和SENet的优缺点,然后结合Non-local和SENet的优点提出了GCNet。

The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by non-local network are almost the same for different query positions within an image. In this paper, we take advantage of this finding to create a simplified network based on a query-independent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further observe that this simplified design shares similar structure with Squeeze-Excitation Network (SENet). Hence we unify them into a three-step general framework for global context modeling. Within the general framework, we design a better instantiation, called the global context (GC) block, which is lightweight and can effectively model the global context. The lightweight property allows us to apply it for multiple layers in a backbone network to construct a global context network (GCNet), which generally outperforms both simplified NLNet and SENet on major benchmarks for various recognition tasks. The code and configurations are released at this https URL.

image-20210713150550619

GCNet提出一种模块框架称为Global context modeling framework(上图中(a)),并将其分为三步:Context modeling、Transform、Fusion。

这篇论文选用Non-Local Neural Networks(上图中(b)是其简化版)的Context modeling 和 Squeeze and Excitation Networks (上图中(c)是其一种形式)的 Transform过程组成新的模块Global Context (GC) block,同时训练spacial和channel-wise上的注意力。这是一篇很好的论文,有兴趣请阅读原文

· 阅读需 9 分钟
Gavin Gong

Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, Han Hu

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, such as semantic segmentation on Cityscapes, ADE20K and PASCAL Context, object detection on COCO, and action recognition on Kinetics.

从论文名称上来看,这篇论文分析了Non-Local Neural Networks中的Non-Local模块中所存在的注意力机制,并对其设计进行了解耦。解耦后该注意力分为两部分:成对项(pairwise term)用于表示像素之间的关系,一元项(unary term)用于表示像素自身的某种显著性。这两项在Non-Local块中是紧密耦合的。这篇论文发现当着两部分被分开训练后,会分别对不同的视觉线索进行建模,并达到不错的效果。

整篇论文从对Non-Local分析到新的方法提出都非常地有调理。有时间请阅读原论文Disentangled Non-Local Neural Networks

在阅读本文之前请先阅读Non-Local Neural Networks

· 阅读需 11 分钟

论文名称:VOLO: Vision Outlooker for Visual Recognition

作者:Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

Code: https://github.com/sail-sg/volo

摘要

  • 视觉识别任务已被CNNCNN主宰多年。基于自注意力的ViTViTImageNetImageNet分类方面表现出了极大的潜力,在没有额外数据前提下,TransformerTransformer的性能与最先进的CNNCNN模型仍具有差距。在这项工作中,我们的目标是缩小这两者之间的性能差距,并且证明了基于注意力的模型确实能够比CNNCNN表现更好。
  • 与此同时,我们发现限制ViTsViTsImageNetImageNet分类中的性能的主要因素是其在将细粒度级别的特征编码乘TokenToken表示过程中比较低效,为了解决这个问题,我们引入了一种新的outlookoutlook注意力,并提出了一个简单而通用的架构,称为VisionVision outlookeroutlooker (VOLOVOLO)。outlookoutlook注意力主要将finefine​-levellevel级别的特征和上下文信息更高效地编码到tokentoken表示中,这些tokentoken对识别性能至关重要,但往往被自注意力所忽视。
  • 实验表明,在不使用任何额外训练数据的情况下,VOLOVOLOImageNetImageNet-1K1K分类任务上达到了87.1%的toptop-11准确率,这是第一个超过87%的模型。此外,预训练好的VOLO模型还可以很好地迁移到下游任务,如语义分割。我们在CityscapesCityscapes验证集上获得了84.3% mIoUmIoU,在ADE20KADE20K验证集上获得了54.3%的mIoUmIoU,均创下了最新记录。

总结:本文提出了一种新型的注意力机制——Outlook AttentionOutlook\ Attention,与粗略建模全局长距离关系的Self AttentionSelf\ Attention不同,OutlookOutlook能在邻域上更精细地编码领域特征,弥补了Self AttentionSelf\ Attention对更精细特征编码的不足。

OutLooker Attention

OutLooker模块可视作拥有两个独立阶段的结构,第一个部分包含一堆OutLookerOutLooker用于生成精细化的表示(TokenToken representationsrepresentations),第二个部分部署一系列的转换器来聚合全局信息。在每个部分之前,都有块嵌入模块(patchpatch embeddingembedding modulemodule)将输入映射到指定形状。