A Review on Deep Learning Techniques Applied to Semantic Segmentation

2023年12月31日 · 阅读需 15 分钟

Rubbish CVer | Poor LaTex speaker | Half stack developer | 键圈躺尸砖家

这是一篇关于综述论文的解读。原论文（A Review on Deep Learning Techniques Applied to Semantic Segmentation）

摘要：

Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.

我看了这篇综述受益匪浅，如果有时间的话请阅读原作。本文只是对原作阅读的粗浅笔记。

MobileNetV2 - Inverted Residuals and Linear Bottlenecks

2023年12月31日 · 阅读需 17 分钟

PommesPeter

I want to be strong. But it seems so hard.

这是一篇讲解一种轻量级主干网络的论文。原论文（MobileNetV2: Inverted Residuals and Linear Bottlenecks）。

本文主要针对轻量特征提取网络中结构上的三个修改提高了网络性能。
本文总思路：使用低维度的张量得到足够多的特征

摘要:

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and bench- marks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. is based on an inverted residual structure where the shortcut connections are between the thin bottle- neck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demon- strate that this improves performance and provide an in- tuition that led to this design. Finally, our approach allows decoupling of the in- put/output domains from the expressiveness of the trans- formation, which provides a convenient framework for further analysis. We measure our performance on ImageNet classification, COCO object detection [2], VOC image segmentation [3]. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as actual latency, and the number of parameters.

Fast-SCNN - Fast Semantic Segmentation Network

2023年12月31日 · 阅读需 15 分钟

Zerorains

life is but a span, I use python

这是一篇讲解一种快速语义分割的论文。论文名:Fast-SCNN: Fast Semantic Segmentation Network

主要是采用双流模型的架构设计这个网络
本文总思路：减少冗余的卷积过程，从而提高速度

摘要：

The encoder-decoder framework is state-of-the-art for offline semantic image segmentation. Since the rise in autonomous systems, real-time computation is increasingly desirable. In this paper, we introduce fast segmentation convolutional neural network (Fast-SCNN), an above real-time semantic segmentation model on high resolution image data (1024 × 2048px) suited to efficient computation on embedded devices with low memory. Building on existing two-branch methods for fast segmentation, we introduce our ‘learning to downsample’ module which computes low-level features for multiple resolution branches simultaneously. Our network combines spatial detail at high resolution with deep features extracted at lower resolution, yielding an accuracy of 68.0% mean intersection over union at 123.5 frames per second on Cityscapes. We also show that large scale pre-training is unnecessary. We thoroughly validate our metric in experiments with ImageNet pre-training and the coarse labeled data of Cityscapes. Finally, we show even faster computation with competitive results on subsampled inputs, without any network modifications.

MobileNets - Efficient Convolutional Neural Networks for Mobile Vision Applications

2023年12月31日 · 阅读需 9 分钟

PommesPeter

I want to be strong. But it seems so hard.

这是一篇讲解一种轻量级主干网络的论文。原论文（MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications）。

本文提出了一种应用于移动或者嵌入式设备的高效神经网络
本文提出了一种操作数较小的卷积模块深度可分离卷积(Depthwise Separable Convolution，以下称DSC)

摘要:

We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Gated Channel Transformation for Visual Recognition

2023年12月31日 · 阅读需 10 分钟

AsTheStarsFall

None

论文名称：Gated Channel Transformation for Visual Recognition
作者：Zongxin Yang, Linchao Zhu, Y u Wu, and Yi Yang
Code：https://github.com/z-x-yang/GCT

摘要

GCT模块是一个普遍适用的门控转换单元，可与网络权重一起优化。
不同于SEnet通过全连接的隐式学习，其使用可解释的变量显式地建模通道间的关系，决定是竞争或是合作。

关键词：可解释性、显式关系、门控

介绍

单个卷积层只对Feature Map中每个空间位置的临近局部上下文进行操作，这可能会导致局部歧义。通常有两种方法解决这种问题：一是增加网络的深度，如VGG，Resnet，二是增加网络的宽度来获得更多的全局信息，如GEnet大量使用领域嵌入，SEnet通过全局嵌入信息来建模通道关系。
然而SEnet中使用fc层会出现两个问题：
1. 由于使用了fc层，出于节省参数的考虑，无法在所有层上使用
2. fc层的参数较为复杂，难以分析不同通道间的关联性，这实际上是一种隐式学习
3. 放在某些层之后会出现问题

Convolutional Block Attention Module

2023年12月31日 · 阅读需 8 分钟

AsTheStarsFall

None

论文名称：CBAM: Convolutional Block Attention Module
作者：Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon，Korea Advanced Institute of Science and Technology, Daejeon, Korea

摘要

CBAM（Convolutional Block Attention Moudule)是一种简单有效的前馈卷积神经网络注意力模块。
该模块为混合域注意力机制（）从通道和空间两个方面依次推断attention map。
CBAM是一个轻量级的通用模块，可以无缝集成到任何CNN中。

关键词:物体识别，注意机制，门控卷积

介绍

卷积神经网络(CNNs)基于其丰富的表达能力显著提高了视觉任务的性能，目前的主要关注网络的三个重要因素：深度，宽度和基数（Cardinality）。
从LeNet到残差网络，网络变的更加深入，表达形式更加丰富；GoogLeNet表明宽度是提高模型性能的另一个重要因素；Xception和ResNext则通过增加网络的基数，在节省参数的同时，来获得比深度、宽度更强的表达能力（引用于ResNext论文）。
除了这些因素之外，本文考察了与网络结构设计不同的方面——注意力。

Boundary IoU - Improving Object-Centric Image Segmentation Evaluation

2023年12月31日 · 阅读需 19 分钟

AsTheStarsFall

None

论文名称：Boundary IoU: Improving Object-Centric Image Segmentation Evaluation
作者：Bowen Cheng，Ross Girshick，Piotr Dollár，Alexander C. Berg，Alexander Kirillov
Code：https://github.com/bowenc0221/boundary-iou-api

写在前面：

正如它的名字，Boundary IoU就是边界轮廓之间的IoU。

重点为3.4节、5.1节，其他基本都是对比实验。

摘要

提出了一种新的基于边界质量的分割评价方法——Boundary IoU；
Boundary IoU对大对象的边界误差比标准掩码IoU测量明显更敏感，并且不会过分惩罚较小对象的误差；
比其他方法更适合作为评价分割的指标。

Involution - Inverting the Inherence of Convolution for Visual Recognition

2023年12月31日 · 阅读需 6 分钟

AsTheStarsFall

None

论文名称：Involution: Inverting the Inherence of Convolution for Visual Recognition
作者：Duo Li， Jie Hu， Changhu Wang， Xiangtai Li， Qi She， Lei Zhu， Tong Zhang， Qifeng Chen， The Hong Kong University of Science and Technology， ByteDance AI Lab， Peking University， Beijing University of Posts and Telecommunications

Convolution

空间无关性(spatial agnostic)：same kernel for different position
- 优点：参数共享，平移等变
- 缺点：不能灵活改变参数，卷积核尺寸不能过大，只能通过堆叠来扩大感受野、捕捉长距离关系
通道特异性(channel specific)：different kernels for different channels
- 优点：充分提取不同通道上的信息
- 缺点：有冗余

Convolution kernel 尺寸为 B,C_out,C_in,K,K

Involution

与convolution不同，involution拥有完全相反的性质：

空间特异性：kernel privatized for different position
通道不变性：kernel shared across different channels

involution kernel 的尺寸为B,G,KK,H,W.

PointRend - Image Segmentation as Rendering

2023年12月31日 · 阅读需 18 分钟

Gavin Gong

Rubbish CVer | Poor LaTex speaker | Half stack developer | 键圈躺尸砖家

“我们希望预测分割图的边界区域更加准确，我们就不应该使用均匀采样，而应该更加倾向于图像边界区域。”

这是一篇用于改善图像分割问题中边缘分割效果的方法的论文的阅读笔记。该方法“将分割问题看作渲染问题”，达到了较好的效果。论文原文：PointRend: Image Segmentation as Rendering。在阅读这篇笔记之前，请确保先了解图像分割技术。对分割的技术进行简要的了解，可以参考另一篇笔记。

Abstract（摘要）

We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend's efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at this https URL.

Transformer - Attention is all you need

2023年12月31日 · 阅读需 21 分钟

AsTheStarsFall

None

论文名称：Attention Is All you Need
作者：Ashish Vaswani，Noam Shazeer，Niki Parmar，Jakob Uszkoreit，Llion Jones，Aidan N. Gomez，Łukasz Kaiser，Illia Polosukhin
code：https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py

前言

基于RNN或CNN的Encoder-Decoder模型在NLP领域占据大壁江山，然而她们也并非是完美无缺的：

LSTM，GRU等RNN模型受限于固有的循环顺序结构，无法实现并行计算，在序列较长时，计算效率尤其低下，虽然最近的工作如因子分解技巧¹，条件计算²在一定程度上提高了计算效率和性能，但是顺序计算的限制依然存在；
Extended Neural GPU³,ByteNet⁴,和ConvS2S⁵ 等CNN模型虽然可以进行并行计算，但是学习任意两个位置的信号的长距离关系依旧比较困难，其计算复杂度随距离线性或对数增长。

而谷歌选择抛弃了主流模型固有的结构，提出了完全基于注意力机制的Transformer，拥有其他模型无法比拟的优势：

Transformer可以高效的并行训练，因此速度十分快，在8个GPU上训练了3.5天；
对于长距离关系的学习，Transformer将时间复杂度降低到了常数，并且使用多头注意力来抵消位置信息的平均加权造成的有效分辨率降低
Transform是一种自编码（Auto-Encoding）模型，能够同时利用上下文

整体结构

Transfromer的整体结构是一个Encoder-Decoder，自编码模型主要应用于语意理解，对于生成任务还是自回归模型更有优势

我们可以将其分为四个部分：输入，编码块，解码块与输出

接下来让我们按照顺序来了解整个结构，希望在阅读下文前你可以仔细观察这幅图，阅读时也请参考该图

摘要​

介绍​

摘要​

介绍​

摘要

Convolution

Involution

Abstract（摘要）​

前言​

整体结构​

摘要

介绍

摘要

介绍

Abstract（摘要）

前言

整体结构