跳到主要内容

6 篇博文 含有标签「non-convolution」

查看所有标签

· 阅读需 6 分钟

论文名称:Involution: Inverting the Inherence of Convolution for Visual Recognition

作者:Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen, The Hong Kong University of Science and Technology, ByteDance AI Lab, Peking University, Beijing University of Posts and Telecommunications

Convolution

  1. 空间无关性(spatial agnostic):same kernel for different position
    • 优点:参数共享,平移等变
    • 缺点:不能灵活改变参数,卷积核尺寸不能过大,只能通过堆叠来扩大感受野、捕捉长距离关系
  2. 通道特异性(channel specific):different kernels for different channels
    • 优点:充分提取不同通道上的信息
    • 缺点:有冗余

Convolution kernel 尺寸为 B,C_out,C_in,K,K

Involution

与convolution不同,involution拥有完全相反的性质:

  1. 空间特异性:kernel privatized for different position
  2. 通道不变性:kernel shared across different channels

involution kernel 的尺寸为B,G,KK,H,W.

· 阅读需 19 分钟
Gavin Gong

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network.

Non-local旨在使用单个Layer实现长距离的像素关系构建,属于自注意力(self-attention)的一种。常见的CNN或是RNN结构基于局部区域进行操作。例如,卷积神经网络中,每次卷积试图建立一定区域内像素的关系。但这种关系的范围往往较小(由于卷积核不大)。

为了建立像素之间的长距离依赖关系,也就是图像中非相邻像素点之间的关系,本文另辟蹊径,提出利用non-local operations构建non-local神经网络。这篇论文通过非局部操作解决深度神经网络核心问题:捕捉长距离依赖关系。

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at this https URL .

本文另辟蹊径,提出利用non-local operations构建non-local神经网络,解决了长距离像素依赖关系的问题。很值得阅读论文原文。受计算机视觉中经典的非局部均值方法启发,作者的非局部操作是将所有位置对一个位置的特征加权和作为该位置的响应值。这种非局部操作可以应用于多种计算机视觉框架中,在视频分类、目标分类、识别、分割等等任务上,都有很好的表现。

· 阅读需 9 分钟
Gavin Gong

Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, Han Hu

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, such as semantic segmentation on Cityscapes, ADE20K and PASCAL Context, object detection on COCO, and action recognition on Kinetics.

从论文名称上来看,这篇论文分析了Non-Local Neural Networks中的Non-Local模块中所存在的注意力机制,并对其设计进行了解耦。解耦后该注意力分为两部分:成对项(pairwise term)用于表示像素之间的关系,一元项(unary term)用于表示像素自身的某种显著性。这两项在Non-Local块中是紧密耦合的。这篇论文发现当着两部分被分开训练后,会分别对不同的视觉线索进行建模,并达到不错的效果。

整篇论文从对Non-Local分析到新的方法提出都非常地有调理。有时间请阅读原论文Disentangled Non-Local Neural Networks

在阅读本文之前请先阅读Non-Local Neural Networks

· 阅读需 11 分钟

论文名称:VOLO: Vision Outlooker for Visual Recognition

作者:Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

Code: https://github.com/sail-sg/volo

摘要

  • 视觉识别任务已被CNNCNN主宰多年。基于自注意力的ViTViTImageNetImageNet分类方面表现出了极大的潜力,在没有额外数据前提下,TransformerTransformer的性能与最先进的CNNCNN模型仍具有差距。在这项工作中,我们的目标是缩小这两者之间的性能差距,并且证明了基于注意力的模型确实能够比CNNCNN表现更好。
  • 与此同时,我们发现限制ViTsViTsImageNetImageNet分类中的性能的主要因素是其在将细粒度级别的特征编码乘TokenToken表示过程中比较低效,为了解决这个问题,我们引入了一种新的outlookoutlook注意力,并提出了一个简单而通用的架构,称为VisionVision outlookeroutlooker (VOLOVOLO)。outlookoutlook注意力主要将finefine​-levellevel级别的特征和上下文信息更高效地编码到tokentoken表示中,这些tokentoken对识别性能至关重要,但往往被自注意力所忽视。
  • 实验表明,在不使用任何额外训练数据的情况下,VOLOVOLOImageNetImageNet-1K1K分类任务上达到了87.1%的toptop-11准确率,这是第一个超过87%的模型。此外,预训练好的VOLO模型还可以很好地迁移到下游任务,如语义分割。我们在CityscapesCityscapes验证集上获得了84.3% mIoUmIoU,在ADE20KADE20K验证集上获得了54.3%的mIoUmIoU,均创下了最新记录。

总结:本文提出了一种新型的注意力机制——Outlook AttentionOutlook\ Attention,与粗略建模全局长距离关系的Self AttentionSelf\ Attention不同,OutlookOutlook能在邻域上更精细地编码领域特征,弥补了Self AttentionSelf\ Attention对更精细特征编码的不足。

OutLooker Attention

OutLooker模块可视作拥有两个独立阶段的结构,第一个部分包含一堆OutLookerOutLooker用于生成精细化的表示(TokenToken representationsrepresentations),第二个部分部署一系列的转换器来聚合全局信息。在每个部分之前,都有块嵌入模块(patchpatch embeddingembedding modulemodule)将输入映射到指定形状。

· 阅读需 7 分钟

论文名称:SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks

作者:Lingxiao Yang, Ru-Yuan Zhang, Lida Li, Xiaohua Xie

Code:https://github.com/ZjjConan/SimAM

介绍

本文提出了一种简单有效的3D注意力模块,基于著名的神经科学理论,提出了一种能量函数,并且推导出其快速解析解,能够为每一个神经元分配权重。主要贡献如下:

  • 受人脑注意机制的启发,我们提出了一个具有3D权重的注意模块,并设计了一个能量函数来计算权重;
  • 推导了能量函数的封闭形式的解,加速了权重计算,并保持整个模块的轻量;
  • 将该模块嵌入到现有ConvNet中在不同任务上进行了灵活性与有效性的验证。

· 阅读需 21 分钟
Gavin Gong

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, Fengbo Ren

Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same inpu t size, the proposed method achieves 1.41% and 0.66% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.

Comments: Accepted to CVPR 2020