跳到主要内容

· 阅读需 12 分钟
Zerorains

论文名称:RefineMask: Towards High-Quality Instance Segmentationwith Fine-Grained Features

作者:Gang Zhang, Xin Lu, Jingru Tan, Jianmin Li, Zhaoxiang Zhang, Quanquan Li, Xiaolin Hu

期刊:CVPR2021

代码:https://github.com/zhanggang001/RefineMask

原文摘要

The two-stage methods for instance segmentation, e.g.Mask R-CNN, have achieved excellent performance re-cently. However, the segmented masks are still very coarsedue to the downsampling operations in both the featurepyramid and the instance-wise pooling process, especiallyfor large objects. In this work, we propose a new methodcalled RefineMask for high-quality instance segmentationof objects and scenes, which incorporates fine-grained fea-tures during the instance-wise segmenting process in amulti-stage manner. Through fusing more detailed informa-tion stage by stage, RefineMask is able to refine high-qualitymasks consistently. RefineMask succeeds in segmentinghard cases such as bent parts of objects that are over-smoothed by most previous methods and outputs accurateboundaries. Without bells and whistles, RefineMask yieldssignificant gains of 2.6, 3.4, 3.8 AP over Mask R-CNN onCOCO, LVIS, and Cityscapes benchmarks respectively at asmall amount of additional computational cost. Further-more, our single-model result outperforms the winner of theLVIS Challenge 2020 by 1.3 points on the LVIS test-dev setand establishes a new state-of-the-art.

摘要

即使如Mask R-CNN这样二阶段的实例分割网路已经有了优秀的表现,但因为在特征金字塔和实例池化过程中使用了下采样操作,使得分割掩码仍然非常粗糙,尤其是对于大型物体。

在本文中,提出了RefineMask方法,用于对象和场景的高质量实例分割,它在实分割的过程中以多阶段的方式结合了细粒度特征。通过逐步融合更细节的信息,RefineMask能够始终如一地提炼出高质量的mask。

· 阅读需 9 分钟
PommesPeter

论文名称: Low-Light Enhancement Network with Global Awareness

论文作者: Wenjing Wang, Chen Wei, Wenhan Yang, Jiaying Liu

Code: https://github.com/weichen582/GLADNet

这是一篇讲解使用神经网络进行低照度增强的论文。

  • 先对图像的光照进行估计,根据估计的结果来调整原图像
  • 调整过程中会对图像中的细节重构,以便得到更加自然的结果。

Abstract (摘要)

In this paper, we address the problem of lowlight enhancement. Our key idea is to first calculate a global illumination estimation for the low-light input, then adjust the illumination under the guidance of the estimation and supplement the details using a concatenation with the original input. Considering that, we propose a GLobal illuminationAware and Detail-preserving Network (GLADNet). The input image is rescaled to a certain size and then put into an encoder-decoder network to generate global priori knowledge of the illumination. Based on the global prior and the original input image, a convolutional network is employed for detail reconstruction. For training GLADNet, we use a synthetic dataset generated from RAW images. Extensive experiments demonstrate the superiority of our method over other compared methods on the real low-light images captured in various conditions.

本文主要解决了低照度增强的问题,关键的思想是输入一张低照度图像进行全局光照估计,然后在估计所得的指导下对亮度进行调整,并于原始图像连接来补充细节。 提出了GladNet,输入图像resize成一定的大小,放入Encoder-Decoder网络中,以生成的光照作为先验基础。将先验结果与原图输入卷积神经网络进行细节重构。

· 阅读需 9 分钟
Gavin Gong

Squeeze-and-Excitation Networks(SENet)是由自动驾驶公司Momenta在2017年公布的一种全新的图像识别结构,它通过对特征通道间的相关性进行建模,把重要的特征进行强化来提升准确率。这个结构是2017 ILSVR竞赛的冠军,top5的错误率达到了2.251%,比2016年的第一名还要低25%。

The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%. Models and code are available at this https URL.

image-20210703161305168

SENet的主要创新是一个模块。如上图,Ftr是传统卷积结构,其输入XX(C×W×HC'\times W' \times H')和输出UU(C×W×HC\times W \times H)也都是传统结构中已经存在的。SeNet的模块是UU之后的部分。SENet通过这种设在某种程度上引入了注意力。

· 阅读需 10 分钟
Gavin Gong

BiSeNet的目标是更快速的实时语义分割。在语义分割任务中,空间分辨率和感受野很难两全,尤其是在实时语义分割的情况下,现有方法通常是利用小的输入图像或者轻量主干模型实现加速。但是小图像相较于原图像缺失了很多空间信息,而轻量级模型则由于裁剪通道而损害了空间信息。BiSegNet整合了Spatial Path (SP) 和 Context Path (CP)分别用来解决空间信息缺失和感受野缩小的问题。

Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.

论文原文:BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation。阅读后你会发现,这篇论文有很多思路受到SENet(Squeeze-and-Excitation Networks)的启发。

· 阅读需 17 分钟
Gavin Gong

Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, Zhenhua Chai, Junfeng Luo, Xiaolin Wei

image-20210719132305088

BiSeNet has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images.

在阅读本文前,请先阅读BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

该论文提出BiSeNet被证明是不错的双路实时分割网络。不过,在BiSeNet中:

  • 单独为空间信息开辟一条网络路径在计算上非常的耗时
  • 用于spatial path的预训练轻量级骨干网络从其他任务中(例如分类和目标检测)直接拿来,用在分割上效率不很高。

因此,作者提出Short-Term Dense Concatenate network(STDC network)来代替BiSeNet中的context path。其核心内容是移除冗余的结构,进一步加速分割。具体来说,本文将特征图的维数逐渐降低,并将特征图聚合起来进行图像表征,形成了STDC网络的基本模块。同时,在decoder中提出Detail Aggregation module将空间信息的学习以single-stream方式集成到low-level layers中,用于代替BiSeNet中的spatial path。最后,将low-level features和deep features融合以预测最终的分割结果。

image-20210719100139212

注:上图中红色虚线框中的部分是新提出的STDC network;ARM表示注意力优化模块(Attention Refinement Module),FFM表示特征融合模块(Feature Fusion Module)。这两个模块是在BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation就已经存在的设计。

有兴趣请阅读原论文Rethinking BiSeNet For Real-time Semantic Segmentation

· 阅读需 7 分钟
Gavin Gong

image-20210723203210974

Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon

We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS~COCO detection, and VOC~2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

CBAM是一篇结合了通道注意力和空间注意力的论文。它通过在同个模块中叠加通道注意力和空间注意力达到了良好的效果。为了提升 CNN 模型的表现,除了对网络的深度、宽度下手,还有一个方向是注意力。注意力不仅要告诉我们重点关注哪里,提高关注点的表达。 我们的目标是通过使用注意机制来增加表现力,关注重要特征并抑制不必要的特征。

为了强调空间和通道这两个维度上的有意义特征,作者依次应用通道和空间注意模块,来分别优化卷积神经网络在通道和空间维度上学习能力。作者将注意力过程分为通道和空间两个独立的部分,这样做不仅可以节约参数和计算力,而且保证了其可以作为即插即用的模块集成到现有的网络架构中去。

· 阅读需 19 分钟
Gavin Gong

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network.

Non-local旨在使用单个Layer实现长距离的像素关系构建,属于自注意力(self-attention)的一种。常见的CNN或是RNN结构基于局部区域进行操作。例如,卷积神经网络中,每次卷积试图建立一定区域内像素的关系。但这种关系的范围往往较小(由于卷积核不大)。

为了建立像素之间的长距离依赖关系,也就是图像中非相邻像素点之间的关系,本文另辟蹊径,提出利用non-local operations构建non-local神经网络。这篇论文通过非局部操作解决深度神经网络核心问题:捕捉长距离依赖关系。

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at this https URL .

本文另辟蹊径,提出利用non-local operations构建non-local神经网络,解决了长距离像素依赖关系的问题。很值得阅读论文原文。受计算机视觉中经典的非局部均值方法启发,作者的非局部操作是将所有位置对一个位置的特征加权和作为该位置的响应值。这种非局部操作可以应用于多种计算机视觉框架中,在视频分类、目标分类、识别、分割等等任务上,都有很好的表现。

· 阅读需 10 分钟
Gavin Gong

Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu

GCNet(原论文:GCNet: Non-local Networks Meet Squeeze-Excitation Networks and BeyondGlobal Context Networks)这篇论文的研究思路类似于DPN,深入探讨了Non-local和SENet的优缺点,然后结合Non-local和SENet的优点提出了GCNet。

The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by non-local network are almost the same for different query positions within an image. In this paper, we take advantage of this finding to create a simplified network based on a query-independent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further observe that this simplified design shares similar structure with Squeeze-Excitation Network (SENet). Hence we unify them into a three-step general framework for global context modeling. Within the general framework, we design a better instantiation, called the global context (GC) block, which is lightweight and can effectively model the global context. The lightweight property allows us to apply it for multiple layers in a backbone network to construct a global context network (GCNet), which generally outperforms both simplified NLNet and SENet on major benchmarks for various recognition tasks. The code and configurations are released at this https URL.

image-20210713150550619

GCNet提出一种模块框架称为Global context modeling framework(上图中(a)),并将其分为三步:Context modeling、Transform、Fusion。

这篇论文选用Non-Local Neural Networks(上图中(b)是其简化版)的Context modeling 和 Squeeze and Excitation Networks (上图中(c)是其一种形式)的 Transform过程组成新的模块Global Context (GC) block,同时训练spacial和channel-wise上的注意力。这是一篇很好的论文,有兴趣请阅读原文

· 阅读需 9 分钟
Gavin Gong

Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, Han Hu

The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, such as semantic segmentation on Cityscapes, ADE20K and PASCAL Context, object detection on COCO, and action recognition on Kinetics.

从论文名称上来看,这篇论文分析了Non-Local Neural Networks中的Non-Local模块中所存在的注意力机制,并对其设计进行了解耦。解耦后该注意力分为两部分:成对项(pairwise term)用于表示像素之间的关系,一元项(unary term)用于表示像素自身的某种显著性。这两项在Non-Local块中是紧密耦合的。这篇论文发现当着两部分被分开训练后,会分别对不同的视觉线索进行建模,并达到不错的效果。

整篇论文从对Non-Local分析到新的方法提出都非常地有调理。有时间请阅读原论文Disentangled Non-Local Neural Networks

在阅读本文之前请先阅读Non-Local Neural Networks

· 阅读需 21 分钟
PommesPeter

论文名称: Deep Retinex Decomposition for Low-Light Enhancement

论文作者: Chen Wei, Wenjing Wang, Wenhan Yang, Jiaying Liu

Code: https://github.com/weichen582/RetinexNet

这是一篇讲解使用卷积神经网络进行低照度增强的论文。

  • 采用了分解网络和增强网络,使用Retinex理论构建分解网络,分解后再进行增强。

Abstract (摘要)

Retinex model is an effective tool for low-light image enhancement. It assumes that observed images can be decomposed into the reflectance and illumination. Most existing Retinex-based methods have carefully designed hand-crafted constraints and parameters for this highly ill-posed decomposition, which may be limited by model capacity when applied in various scenes. In this paper, we collect a LOw-Light dataset (LOL) containing low/normal-light image pairs and propose a deepRetinex-Netlearned on this dataset, including a Decom-Net for decomposition and an Enhance-Net for illumination adjusment. In the training process for Decom-Net, there is no ground truth of decomposed reflectance and illumination. The network is learned with only key constraints including the consistent reflectance shared by paired low/normal-light images, and the smoothness of illumination. Based on the decomposition, subsequent lightness enhancement is conducted on illumination by an enhancement network called Enhance-Net, and for joint denoising there is a denoising operation on reflectance. The Retinex-Net is end-to-end trainable, so that the learned decomposition is by nature good for lightness adjustment. Extensive experiments demonstrate that our method not only achieves visually pleasing quality for low-light enhancement but also provides a good representation of image decomposition.

Retinex模型是弱光图像增强的有效工具。它假设观察到的图像可以分解为反射率和照度。大多数现有的基于视网膜的方法都为这种高度不适定的分解精心设计了手工制作的约束和参数,当应用于各种场景时,这可能会受到模型容量的限制。在本文中,我们收集了一个包含弱光/正常光图像对的弱光数据集,并在此数据集上提出了一个Deeprinex Netlearn,包括一个用于分解的Decom-Net和一个用于光照调整的增强-Net。在Decom-Net的训练过程中,不存在分解反射率和光照的基本事实。该网络仅在关键约束条件下学习,包括成对的弱光/正常光图像共享的一致反射率以及照明的平滑度。在分解的基础上,通过增强网络对光照进行后续的亮度增强,对于联合去噪,对反射率进行去噪操作。RetinexNet是端到端训练的,因此学习的分解本质上有利于亮度调整。大量实验表明,我们的方法不仅在弱光增强方面获得了视觉上令人满意的质量,而且提供了图像分解的良好表示。