跳到主要内容

11 篇博文 含有标签「backbone」

查看所有标签

· 阅读需 10 分钟
Gavin Gong

DeepLab系列中包含了三篇论文:DeepLab-v1、DeepLab-v2、DeepLab-v3。

DeepLab-v1:Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

DeepLab-v2:Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

DeepLab-v3:Rethinking Atrous Convolution for Semantic Image Segmentation

在这里我们将这三篇放在一起阅读。

后来甚至还出现了后续:

DeepLab-v3+:Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

不过暂时没有写进来的打算。

· 阅读需 11 分钟
Gavin Gong

这篇笔记的写作者是VisualDust

原论文Feature Pyramid Networks for Object Detection

这篇论文就是大家熟知的FPN了。FPN是比较早期的一份工作(请注意,这篇论文只是多尺度特征融合的一种方式。不过这篇论文提出的比较早(CVPR2017),在当时看来是非常先进的),在当时具有很多亮点:FPN主要解决的是物体检测中的多尺度问题,通过简单的网络连接改变,在基本不增加原有模型计算量情况下,大幅度提升了小物体检测的性能。

Abstract(摘要)

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

这篇论文对以后的许多网络设计产生了较大的影响,推荐你阅读原文。这里只是对这篇论文的粗浅阅读笔记。

· 阅读需 17 分钟
PommesPeter

这是一篇讲解一种轻量级主干网络的论文。原论文(MobileNetV2: Inverted Residuals and Linear Bottlenecks)

  • 本文主要针对轻量特征提取网络中结构上的三个修改提高了网络性能。
  • 本文总思路:使用低维度的张量得到足够多的特征

摘要:

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and bench- marks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. is based on an inverted residual structure where the shortcut connections are between the thin bottle- neck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demon- strate that this improves performance and provide an in- tuition that led to this design. Finally, our approach allows decoupling of the in- put/output domains from the expressiveness of the trans- formation, which provides a convenient framework for further analysis. We measure our performance on ImageNet classification, COCO object detection [2], VOC image segmentation [3]. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as actual latency, and the number of parameters.

· 阅读需 15 分钟
Zerorains

这是一篇讲解一种快速语义分割的论文。论文名:Fast-SCNN: Fast Semantic Segmentation Network

  • 主要是采用双流模型的架构设计这个网络
  • 本文总思路:减少冗余的卷积过程,从而提高速度

摘要:

The encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024 × 2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our ‘learning to downsample’ module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.

· 阅读需 9 分钟
PommesPeter

这是一篇讲解一种轻量级主干网络的论文。原论文(MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications)

  • 本文提出了一种应用于移动或者嵌入式设备的高效神经网络
  • 本文提出了一种操作数较小的卷积模块深度可分离卷积(Depthwise Separable Convolution,以下称DSC)

摘要:

We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

· 阅读需 21 分钟
PommesPeter

论文名称: Deep Retinex Decomposition for Low-Light Enhancement

论文作者: Chen Wei, Wenjing Wang, Wenhan Yang, Jiaying Liu

Code: https://github.com/weichen582/RetinexNet

这是一篇讲解使用卷积神经网络进行低照度增强的论文。

  • 采用了分解网络和增强网络,使用Retinex理论构建分解网络,分解后再进行增强。

Abstract (摘要)

Retinex model is an effective tool for low-light image enhancement. It assumes that observed images can be decomposed into the reflectance and illumination. Most existing Retinex-based methods have carefully designed hand-crafted constraints and parameters for this highly ill-posed decomposition, which may be limited by model capacity when applied in various scenes. In this paper, we collect a LOw-Light dataset (LOL) containing low/normal-light image pairs and propose a deepRetinex-Netlearned on this dataset, including a Decom-Net for decomposition and an Enhance-Net for illumination adjusment. In the training process for Decom-Net, there is no ground truth of decomposed reflectance and illumination. The network is learned with only key constraints including the consistent reflectance shared by paired low/normal-light images, and the smoothness of illumination. Based on the decomposition, subsequent lightness enhancement is conducted on illumination by an enhancement network called Enhance-Net, and for joint denoising there is a denoising operation on reflectance. The Retinex-Net is end-to-end trainable, so that the learned decomposition is by nature good for lightness adjustment. Extensive experiments demonstrate that our method not only achieves visually pleasing quality for low-light enhancement but also provides a good representation of image decomposition.

Retinex模型是弱光图像增强的有效工具。它假设观察到的图像可以分解为反射率和照度。大多数现有的基于视网膜的方法都为这种高度不适定的分解精心设计了手工制作的约束和参数,当应用于各种场景时,这可能会受到模型容量的限制。在本文中,我们收集了一个包含弱光/正常光图像对的弱光数据集,并在此数据集上提出了一个Deeprinex Netlearn,包括一个用于分解的Decom-Net和一个用于光照调整的增强-Net。在Decom-Net的训练过程中,不存在分解反射率和光照的基本事实。该网络仅在关键约束条件下学习,包括成对的弱光/正常光图像共享的一致反射率以及照明的平滑度。在分解的基础上,通过增强网络对光照进行后续的亮度增强,对于联合去噪,对反射率进行去噪操作。RetinexNet是端到端训练的,因此学习的分解本质上有利于亮度调整。大量实验表明,我们的方法不仅在弱光增强方面获得了视觉上令人满意的质量,而且提供了图像分解的良好表示。

· 阅读需 20 分钟
Gavin Gong

Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, Lei Li

We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.

实例分割相比于语义分割,不仅需要预测出每一个像素点的语义类别,还要判断出该像素点属于哪一个实例。以往二阶段的方法主要是:

  1. 先检测后分割:例如Mask R-CNN ,先用检测的方法到得每一个实例,然后对该实例进行语义分割,分割得到的像素都属于此实例。
  2. 先分割后分类:先采用语义分割的方法对整个图的所有像素点做语义类别的预测,然后学习一个嵌入向量,使用聚类方法拉近属于同一实例的像素点,使它们属于同一类(同个实体)。

单阶段方法(Single Stage Instance Segmentation)方面的工作受到单阶段目标检测的影响大体上也分为两类:一种是受one-stage, anchot-based检测模型如YOLO,RetinaNet启发,代表作有YOLACT和SOLO;一种是受anchor-free检测模型如 FCOS 启发,代表作有PolarMask和AdaptIS。上述这些实例分割的方法都不那么直接,也不那么简单。SOLO的出发点就是做更简单、更直接的实例分割。

基于对MSCOCO数据集的统计,作者提出,验证子集中总共有36780个对象,其中98.3%的对象对的中心距离大于30个像素。至于其余的1.7%的对象对,其中40.5%的大小比率大于1.5倍。在这里,我们不考虑像X形两个物体这样的少数情况。总之,在大多数情况下,图像中的两个实例要么具有不同的中心位置,要么具有不同的对象大小

于是作者提出通过物体在图片中的位置形状来进行实例的区分。同一张图片中,位置和形状完全相同,就是同一个实例,由于形状有很多方面,文章中朴素地使用尺寸描述形状。

该方法与 Mask R-CNN 实现了同等准确度,并且在准确度上优于最近的单次实例分割器。

· 阅读需 14 分钟
Gavin Gong

Daniel Bolya, Chong Zhou, Fanyi Xiao, Yong Jae Lee

We present a simple, fully-convolutional model for real-time instance segmentation that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.

YOLACT是You Only Look At CoefficienTs 的简写,其中 coefficients 是这个模型的输出之一,这个命名风格应是致敬了另一目标检测模型 YOLO。

image-20210818180207356

上图:YOLACT的网络结构图。YOLACT的目标是将掩模分支添加到现有的一阶段(one-stage)目标检测模型。我个人觉得这是夹在一阶段和二阶段中间的产物。将其分为一阶段的依据是其实现“将掩模分支添加到现有的一阶段目标检测模型”的方式与Mask R-CNN对 Faster-CNN 操作相同,但没有诸如feature repooling和ROI align等明确的目标定位步骤。也就是,定位-分类-分割的操作被变成了分割-剪裁

根据评估,当YOLACT 处理550×550550\times 550​​​大小的图片时,其速度达到了 33FPS,而互联网上多数视频一般是 30FPS 的,这也就是实时的含义了。这是单阶段比较早的一份工作,虽然这个速度不快但也还行了。

· 阅读需 17 分钟
Gavin Gong

Qiang Chen, Yingming Wang, Tong Yang, Xiangyu Zhang, Jian Cheng, Jian Sun

This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - {\em utilizing only one-level feature for detection}. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being 2.5× faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with 7× less training epochs. With an image size of 608×608, YOLOF achieves 44.3 mAP running at 60 fps on 2080Ti, which is 13% faster than YOLOv4. Code is available at this https URL.

本文简称YOLOF。截至到本文写作时,二阶段和单阶段目标检测的SOTA方法中广泛使用了多尺度特征融合的方法。FPN方法几乎已经称为了网络中理所应当的一个组件。

本文中作者重新回顾了FPN模块,并指出FPN的两个优势分别是其分治(divide-and-conquer)的解决方案、以及多尺度特征融合。本文在单阶段目标检测器上研究了FPN的这两个优势,并在RetinaNet上进行了实验,将上述两个优势解耦,分别研究其发挥的作用,并指出,FPN在多尺度特征融合上发挥的作用可能没有想象中那么大。

最后,作者提出YOLOF,这是一个不使用FPN的目标检测网络。其主要创新是:

  1. Dilated Encoder
  2. Uniform Matching

该网络在达到RetinaNet对等精度的情况下速度提升了2.5倍。

· 阅读需 14 分钟
Gavin Gong

Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun

Fully convolutional networks (FCNs) have been proven very successful for semantic segmentation, but the FCN outputs are unaware of object instances. In this paper, we develop FCNs that are capable of proposing instance-level segment candidates. In contrast to the previous FCN that generates one score map, our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances. On top of these instance-sensitive score maps, a simple assembling module is able to output instance candidate at each position. In contrast to the recent DeepMask method for segmenting instances, our method does not have any high-dimensional layer related to the mask resolution, but instead exploits image local coherence for estimating instances. We present competitive results of instance segment proposal on both PASCAL VOC and MS COCO.

这篇工作又名InstanceFCN。实例分割方面,由于网络难以同时进行分类和分割任务,因此首先流行的是二阶段实例分割网络,首先对输入找到实例的proposal,然后在其中进行密集预测(也就是先框框再分割)。本文从名称上看不是一篇讲实例分割的文章,是讲如何通过FCN获得实例级别的分割mask的的。

在阅读之前我想提醒一下,这篇工作的效果是比较差的,毕竟是早期工作。不过这篇工作具有不错的启发意义,值得读一读。后面的一篇工作FCIS(Fully Convolutional Instance-aware Semantic Segmentation)中就借鉴了本文中提出的instance-sensitive score maps(请不要弄混本篇工作和FCIS)。本文的一大贡献就是提出使用instance-sensitive score maps区分不同个体。