跳到主要内容

· 阅读需 15 分钟
PommesPeter

论文名称: MSR-net:Low-light Image Enhancement Using Deep Convolutional Network

论文作者: Liang Shen, Zihan Y ue, Fan Feng, Quan Chen, Shihao Liu, Jie Ma

Code: None

这是一篇讲解基于Retinex理论使用卷积神经网络进行低照度增强的论文。

  • 基于MSR传统理论构造卷积神经网络模型
  • 直接学习暗图像和亮图像之间的端到端映射

Abstract (摘要)

Images captured in low-light conditions usually suffer from very low contrast, which increases the difficulty of sub-sequent computer vision tasks in a great extent. In this paper, a low-light image enhancement model based on convolutional neural network and Retinex theory is proposed. Firstly, we show that multi-scale Retinex is equivalent to a feedforward convolutional neural network with different Gaussian convolution kernels. Motivated by this fact, we consider a Convolutional Neural Network(MSR-net) that directly learns an end-to-end mapping between dark and bright images. Different fundamentally from existing approaches, low-light image enhancement in this paper is regarded as a machine learning problem. In this model, most of the parameters are optimized by back-propagation, while the parameters of traditional models depend on the artificial setting. Experiments on a number of challenging images reveal the advantages of our method in comparison with other state-of-the-art methods from the qualitative and quantitative perspective.

本文提出了一种基于卷积神经网络和视网膜理论(Retinex Theory)的低照度图像增强模型。证明了多尺度视网膜等价于一个具有不同高斯卷积核的前馈卷积神经网络。考虑一种卷积神经网络(MSR网络),它直接学习暗图像和亮图像之间的端到端映射

· 阅读需 11 分钟
PommesPeter

论文名称: LLCNN: A convolutional neural network for low-light image enhancement

论文作者: Li Tao, Chuang Zhu, Guoqing Xiang, Yuan Li, Huizhu Jia, Xiaodong Xie

Code: https://github.com/BestJuly/LLCNN

这篇笔记的写作者是PommesPeter

这是一篇讲解使用卷积神经网络进行低照度增强的论文。

  • 本文使用卷积神经网络进行低照度增强
  • 使用SSIM损失更好地评价图像好坏和梯度收敛

Abstract (摘要)

In this paper, we propose a CNN based method to perform low-light image enhancement. We design a special module to utilize multiscale feature maps, which can avoid gradient vanishing problem as well. In order to preserve image textures as much as possible, we use SSIM loss to train our model. The contrast of low-light images can be adaptively enhanced using our method. Results demonstrate that our CNN based method outperforms other contrast enhancement methods.

本文提出了一种基于CNN的低照度图像增强方法。我们设计了一个特殊的模块来利用多尺度特征映射,这样可以避免梯度消失的问题。为了尽可能地保留图像纹理,我们使用SSIM损失来训练我们的模型。使用我们的方法可以自适应地增强弱光图像的对比度

· 阅读需 11 分钟

论文名称:VOLO: Vision Outlooker for Visual Recognition

作者:Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

Code: https://github.com/sail-sg/volo

摘要

  • 视觉识别任务已被CNNCNN主宰多年。基于自注意力的ViTViTImageNetImageNet分类方面表现出了极大的潜力,在没有额外数据前提下,TransformerTransformer的性能与最先进的CNNCNN模型仍具有差距。在这项工作中,我们的目标是缩小这两者之间的性能差距,并且证明了基于注意力的模型确实能够比CNNCNN表现更好。
  • 与此同时,我们发现限制ViTsViTsImageNetImageNet分类中的性能的主要因素是其在将细粒度级别的特征编码乘TokenToken表示过程中比较低效,为了解决这个问题,我们引入了一种新的outlookoutlook注意力,并提出了一个简单而通用的架构,称为VisionVision outlookeroutlooker (VOLOVOLO)。outlookoutlook注意力主要将finefine​-levellevel级别的特征和上下文信息更高效地编码到tokentoken表示中,这些tokentoken对识别性能至关重要,但往往被自注意力所忽视。
  • 实验表明,在不使用任何额外训练数据的情况下,VOLOVOLOImageNetImageNet-1K1K分类任务上达到了87.1%的toptop-11准确率,这是第一个超过87%的模型。此外,预训练好的VOLO模型还可以很好地迁移到下游任务,如语义分割。我们在CityscapesCityscapes验证集上获得了84.3% mIoUmIoU,在ADE20KADE20K验证集上获得了54.3%的mIoUmIoU,均创下了最新记录。

总结:本文提出了一种新型的注意力机制——Outlook AttentionOutlook\ Attention,与粗略建模全局长距离关系的Self AttentionSelf\ Attention不同,OutlookOutlook能在邻域上更精细地编码领域特征,弥补了Self AttentionSelf\ Attention对更精细特征编码的不足。

OutLooker Attention

OutLooker模块可视作拥有两个独立阶段的结构,第一个部分包含一堆OutLookerOutLooker用于生成精细化的表示(TokenToken representationsrepresentations),第二个部分部署一系列的转换器来聚合全局信息。在每个部分之前,都有块嵌入模块(patchpatch embeddingembedding modulemodule)将输入映射到指定形状。

· 阅读需 11 分钟

论文名称:Polarized Self-Attention: Towards High-quality Pixel-wise Regression

作者:Huajun Liu, Fuqiang Liu, Xinyi Fan

Code:https://github.com/DeLightCMU/PSA

这篇笔记的写作者是AsTheStarsFall

摘要

细粒度的像素级任务(比如语义分割)一直都是计算机视觉中非常重要的任务。不同于分类或者检测,细粒度的像素级任务要求模型在低计算开销下,能够建模高分辨率输入/输出特征的长距离依赖关系,进而来估计高度非线性的像素语义。CNNCNN​​​中的注意力机制能够捕获长距离的依赖关系,但是这种方式十分复杂且对噪声敏感

本文提出了即插即用的极化自注意力模块,该模块包含两个关键设计,以保证高质量的像素回归:

  1. 极化滤波(Polarized filteringPolarized\ filtering​):在通道和空间维度保持比较高的分辨率(在通道上保持C/2C/2​的维度,在空间上保持[H,W][H,W]​的维度 ),进一步减少低分辨率、低通道数和上采样造成的信息损失。
  2. 增强(EnhancementEnhancement​):采用细粒度回归输出分布的非线性函数。

· 阅读需 7 分钟

论文名称:SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks

作者:Lingxiao Yang, Ru-Yuan Zhang, Lida Li, Xiaohua Xie

Code:https://github.com/ZjjConan/SimAM

介绍

本文提出了一种简单有效的3D注意力模块,基于著名的神经科学理论,提出了一种能量函数,并且推导出其快速解析解,能够为每一个神经元分配权重。主要贡献如下:

  • 受人脑注意机制的启发,我们提出了一个具有3D权重的注意模块,并设计了一个能量函数来计算权重;
  • 推导了能量函数的封闭形式的解,加速了权重计算,并保持整个模块的轻量;
  • 将该模块嵌入到现有ConvNet中在不同任务上进行了灵活性与有效性的验证。

· 阅读需 20 分钟
Gavin Gong

Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, Lei Li

We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.

实例分割相比于语义分割,不仅需要预测出每一个像素点的语义类别,还要判断出该像素点属于哪一个实例。以往二阶段的方法主要是:

  1. 先检测后分割:例如Mask R-CNN ,先用检测的方法到得每一个实例,然后对该实例进行语义分割,分割得到的像素都属于此实例。
  2. 先分割后分类:先采用语义分割的方法对整个图的所有像素点做语义类别的预测,然后学习一个嵌入向量,使用聚类方法拉近属于同一实例的像素点,使它们属于同一类(同个实体)。

单阶段方法(Single Stage Instance Segmentation)方面的工作受到单阶段目标检测的影响大体上也分为两类:一种是受one-stage, anchot-based检测模型如YOLO,RetinaNet启发,代表作有YOLACT和SOLO;一种是受anchor-free检测模型如 FCOS 启发,代表作有PolarMask和AdaptIS。上述这些实例分割的方法都不那么直接,也不那么简单。SOLO的出发点就是做更简单、更直接的实例分割。

基于对MSCOCO数据集的统计,作者提出,验证子集中总共有36780个对象,其中98.3%的对象对的中心距离大于30个像素。至于其余的1.7%的对象对,其中40.5%的大小比率大于1.5倍。在这里,我们不考虑像X形两个物体这样的少数情况。总之,在大多数情况下,图像中的两个实例要么具有不同的中心位置,要么具有不同的对象大小

于是作者提出通过物体在图片中的位置形状来进行实例的区分。同一张图片中,位置和形状完全相同,就是同一个实例,由于形状有很多方面,文章中朴素地使用尺寸描述形状。

该方法与 Mask R-CNN 实现了同等准确度,并且在准确度上优于最近的单次实例分割器。

· 阅读需 14 分钟
Gavin Gong

Daniel Bolya, Chong Zhou, Fanyi Xiao, Yong Jae Lee

We present a simple, fully-convolutional model for real-time instance segmentation that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.

YOLACT是You Only Look At CoefficienTs 的简写,其中 coefficients 是这个模型的输出之一,这个命名风格应是致敬了另一目标检测模型 YOLO。

image-20210818180207356

上图:YOLACT的网络结构图。YOLACT的目标是将掩模分支添加到现有的一阶段(one-stage)目标检测模型。我个人觉得这是夹在一阶段和二阶段中间的产物。将其分为一阶段的依据是其实现“将掩模分支添加到现有的一阶段目标检测模型”的方式与Mask R-CNN对 Faster-CNN 操作相同,但没有诸如feature repooling和ROI align等明确的目标定位步骤。也就是,定位-分类-分割的操作被变成了分割-剪裁

根据评估,当YOLACT 处理550×550550\times 550​​​大小的图片时,其速度达到了 33FPS,而互联网上多数视频一般是 30FPS 的,这也就是实时的含义了。这是单阶段比较早的一份工作,虽然这个速度不快但也还行了。

· 阅读需 17 分钟
Gavin Gong

Qiang Chen, Yingming Wang, Tong Yang, Xiangyu Zhang, Jian Cheng, Jian Sun

This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - {\em utilizing only one-level feature for detection}. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being 2.5× faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with 7× less training epochs. With an image size of 608×608, YOLOF achieves 44.3 mAP running at 60 fps on 2080Ti, which is 13% faster than YOLOv4. Code is available at this https URL.

本文简称YOLOF。截至到本文写作时,二阶段和单阶段目标检测的SOTA方法中广泛使用了多尺度特征融合的方法。FPN方法几乎已经称为了网络中理所应当的一个组件。

本文中作者重新回顾了FPN模块,并指出FPN的两个优势分别是其分治(divide-and-conquer)的解决方案、以及多尺度特征融合。本文在单阶段目标检测器上研究了FPN的这两个优势,并在RetinaNet上进行了实验,将上述两个优势解耦,分别研究其发挥的作用,并指出,FPN在多尺度特征融合上发挥的作用可能没有想象中那么大。

最后,作者提出YOLOF,这是一个不使用FPN的目标检测网络。其主要创新是:

  1. Dilated Encoder
  2. Uniform Matching

该网络在达到RetinaNet对等精度的情况下速度提升了2.5倍。

· 阅读需 14 分钟
Gavin Gong

Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun

Fully convolutional networks (FCNs) have been proven very successful for semantic segmentation, but the FCN outputs are unaware of object instances. In this paper, we develop FCNs that are capable of proposing instance-level segment candidates. In contrast to the previous FCN that generates one score map, our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances. On top of these instance-sensitive score maps, a simple assembling module is able to output instance candidate at each position. In contrast to the recent DeepMask method for segmenting instances, our method does not have any high-dimensional layer related to the mask resolution, but instead exploits image local coherence for estimating instances. We present competitive results of instance segment proposal on both PASCAL VOC and MS COCO.

这篇工作又名InstanceFCN。实例分割方面,由于网络难以同时进行分类和分割任务,因此首先流行的是二阶段实例分割网络,首先对输入找到实例的proposal,然后在其中进行密集预测(也就是先框框再分割)。本文从名称上看不是一篇讲实例分割的文章,是讲如何通过FCN获得实例级别的分割mask的的。

在阅读之前我想提醒一下,这篇工作的效果是比较差的,毕竟是早期工作。不过这篇工作具有不错的启发意义,值得读一读。后面的一篇工作FCIS(Fully Convolutional Instance-aware Semantic Segmentation)中就借鉴了本文中提出的instance-sensitive score maps(请不要弄混本篇工作和FCIS)。本文的一大贡献就是提出使用instance-sensitive score maps区分不同个体。

· 阅读需 21 分钟
Gavin Gong

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, Fengbo Ren

Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same inpu t size, the proposed method achieves 1.41% and 0.66% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.

Comments: Accepted to CVPR 2020