Object-Contextual Representations for Semantic Segmentation

728x90

Abstract

context aggregation problem in semantic segmentation

가정: 주변 픽셀들까지

object-contextual representations → representation of the correspoding object class.

learn object regions: ground-truth
compute the object region representation
- compute the relation between each pixel and each object region

augment the representation of each pixel with ocr
// Transformer encoder-decoder framework: rephrase the ocr scheme
- cross-attenton module (decoder)
  1. object region learning(super: ground-truth segmentation)
  2. (compute) the object region representation (aggregate the repr of the pixels in the object region)
  - output: object region represenations
  - category quries: linear projection
- cross-attention module (encoder)
  1. - compute the relation between each pixel ad each object region
      - augment the repr of each pixel with ocr. (weighted aggregation of all object region repr)
  - key, value = decoder output
  - queries = repr at each position

various benchmark ⇒ competitive performance!

1. Introduction

Semantic segmentation: image의 각 pixel에 class label 할당

Contextual aggregation

이전 연구: spatial scale of contexts
- multi-scale context: ASPP, PPM
- relational context: DANet, CFNet, OCNet (relations: position & contextual positions)

contextual represenation scheme: relation between a position and its context

목표

: augment the repr of one pixel by exploiting the representation of the object region of the corespoding class.

: obejct region 해당하는 픽셀을 classify할 때, represenatiaon과 obejct region을 사용하겠다.

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled.png

ground-truth object region) segmentation quality 향상!

: soft region가 100%의 역할을 한다고 가정했을 때 == ground truth가 주어졌을 때, 모델의 성능이 더 좋음을 보여줌으로써 soft region이 실제로 도움이 됨을 증명!

Main step

contextual pixel을 set of soft object regions(class)로 divide! → coarse soft segmentation

division: GT segmentation

pixels의 represenation aggregate → 각 object region의 representation을 추정!
object-contextual represenation(OCR)로 각 pixel의 representation을 augment!

OCR은 모든 object region representations with the weights의 weighted aggregation이다.

weight: pixels과 object regions의 relations에 따라 계산

Differ from multi-scale context schemes

OCR

different object class contextual pixel과 same object class contextual pixel을 구별

:자동차와 자동차가 아닌 pixel을 구별한다는 의미

multi-scale(ASPP, PPM)

다른 spatial position을 가진 pixel만 구별

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%201.png

our OCR: context는 set of pixels in the object (파란 마크)

Differ from relational context schemes

OCR

contextual pixels을 object region으로 구성
pixel과 obejct region간의 관계 사용

relational

contextual pixel을 각각 고려
pixel과 contextual pixel간의 관계만 사용
region 고려없이 pixel로만 관계 예측

2. Related Work

Multi-scale context

PSPNet: pyramid pooling represenations/ multi-scale context capture (초창기, 중요X)
DeepLab series: 병렬 dilated(팽창) convolutions 채택 (diff dilation rates) 多
- ASPP
recent works: DesnseASPP, encoder-decoder structures

Relational context

DANet, CFNet, OCNet: 각 pixel representation augment

self-attention scheme기반의 픽셀간 관계(유사성) 고려
Double Attention, ACFNet

group: the pixels → set of regions

augment pixel repr ← region repr을 context relation을 고려해서 aggregate
Our approach

relational context approach + related to Double Attention, ACFNet

Difference
: region formation(learning) & pixel-region relation computation
- 논문:
  
  supervised with ground-truth segmentation
  
  relations: pixel과 region 둘다 고려
- prev(ACFNet 빼고):
  
  unsupervisedly
  
  relations: pixel만 고려

Coarse-to-fine Segmentation

논문:

어떤 점에서는 coarse-to-fine scheme이다. → soft object region(coarse)에서 점차 상세하게.

BUT: use the coarse segmentation map for generating a contextual representation.

Region-wise segmentation

Our: 각 region을 분류하는 것이 아니라 더 나은 학습을 위한 부가적 정보로 사용.

3. Approach

Semantic segmentation이란?
label li를 image I의 각 픽셀 pi에 할당하는 문제 (li는 K개의 다른 클래스)

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%202.png

3.1 Background

Multi-scale context

based on dilated convolutions/ capture the context of multiple scales without losing the resolution

ASPP

captures the multi-scale context info ← 다른 dilation rates로 병렬 팽창 conv 수행

출력 multi-scale contextual representation: 병렬 팽창 conv를 통한 representation output의 concatenation(연결)
PSPNet - pyramid pooling module

regular conv on representations of different scale

captures the contexts of multiple scales

Reltaitonal context

computes the context for each pixel by considering the relation

3.2 Formulation (각 항의 의미)

(1) structurie all the pixel in image I into K soft object regions

(2) represent each object region as fk

(3) augment the represenation for each pixel

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%203.png

K개의 region featur를 가져오겠다.
transformation function: 특징을 정제한다. 1x1 conv → BN → ReLU # 모양만 같고 paramerters가 다르다.
k: soft obejct region의 인덱스

Soft object regions

클래스에 해당하는 pixel의 representation만 가져오겠다.

compute K object regions from an intermmediate represenataion output from a backbone

During training: object region generator를 ground-truth segmentation을 supervision으로 학습 & cross-entropy loss

Object region representations

: k번째 object regoin에 속하는 정도를 가중치로 하는 모든 픽셀의 representaions을 aggregate

k번째 object region representation

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%204.png

xi: representation of pixel pi

Object contextual representations

: relation between each pixel and each obejct region

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%205.png

softamx 처럼수행하는 이유?

ReLU값은 범위가 없다. 하지만 weight의 범위가 정해져 있지않으면 터질 수 있다.
→ 범위를 0~1로 만들기 위해 normalized를 사용
기존의 차이를 더 벌리기 위해 지수형태로 계산한다.

Augmented representations

: final representation for pixel pi

aggregation 2 parts

(1) the original representation xi

(2) the object contextual representation yi

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%206.png

zi: transform function - xi와 yi를 fuse(aggregate)
3.3 Segmentation Transformer: Rephrasing the OCR Method

: OCR pipeline을 Transformer encoder-decoder 구조로 rephrase한 것

OCR pipeline 3단계

1) soft object region extraction

2) object region representation computation

3) object-contextual representation computation for each position

: 주로 decoder와 encoder의 cross-attention module을 탐색한다.

Attention: scaled dot-product로 계산
- attention weight aij: softmax normalization of query qi와 key kj의 내적
- 각 쿼리 qi의 attention 출력 = aggregation of values weighted by attention weights
Decoder cross-attention
- 2 roles
  
  1) soft object region extraction ← K category queires
  
  2) object region representation computation
Encoder corss-attention

: aggregating the object region representations (yi 계산식 수행)
- querires: image features at each position
- keys, values: decoder outputs
Connection to class embedding and class attention.

category queries are close to the class embedding
- embeding for each class # llearn ejffffffembedding for all the classes
- encoder and decoder architecture is close to self-attention over both the class embedding and image features.
Connection to OCNet and interlaced self-attention

OCNet: self-attention 사용
- self-attention unit: interlaced self-attention unit에 의해 가속화됨
- local self-attention + global self-atteention
- category queries → 대체 → refulary sampled or adaptively-pooled image features (not learned as model parameters)

3.3 Archituecture

Backbone:

dilated ResNet-101(with output stride 8)
- representations input: 2개
  - object regions: coarse segmentation 예측용
  - 3x3 convolution(512 output channels)을 통과 후, OCR module로 들어감
HRNet-W48(with output stride 4)
- final representation만 input으로 사용!

OCR module

above formulation → OCR module
linear function

:predict the coarse segmentation (soft object region)

loss: pixel-wise cross-entropy
all the transform function: 1x1 conv → BN → ReLU
1st 3 output 256 channels, last 2 output 512 channels
predict the final segmentation from the final representation using a linear function
apply pixel-wise cross-entropy loss on the final segmentation prediction

3.4 Empirical Analysis (실증 분석)

Object region supervision:

We can see that the supervicion for forming the object regions is crucial for the performance.

Pixel-region relations:

object region supervision과 pixel-region relation estimation scheme의 영향

with supervision/ 논문의 relation scheme이 성능에 중요함을 알 수 있음

Reason: reltations 계산에 pixel representation과 region representation을 둘다 사용하기 때문!
- region representation은 특정 영상에서 개체의 특성을 분석 가능(able to characterize) → pixel representation만 사용하는 것보다 특정 영상에 대한 reltation이 더 정확하다!

Ground-truth OCR:

: study the segmentation performance

using

the ground-truth segmentation (to form the object regions)
the pixel-region relations(GT-OCR) (to justify our motivation)
Object region formation using ground-truth

mki: kth object region에 속한 pixel i의 confidence

if ground truth label li가 k) mki = 1

else: mki = 0
Pixel-region relation computation using ground-truth

wik: pixel-region relation

if ground-truth label li = k) wik = 1

else: mki = 0

4. Experiments: Semantic Segmentation

4.1 Datasets

Cityscapes: urban scene understanding

30 classes & only 19 classes 만 분석 평가에 사용됨
5K 고화질 pixel-level finely annotated images

→ train: 2975/valid: 500/vaild: 1525 images
20K coarsely annotated images

ADE20K: Imagenet scene parsing challenge 2016에 사용됨

150 classes & diverse scenes with 1038 image-level label
→ train: 20K/ valid:2K/ test: 3K images

LIP: LIP challenge 2016 for single human parsing task 에 사용됨

50K images with 20 classes(19 semantic human part + 1 background)
→ train: 30K/ valid: 10K/ test: 10K images

PASCAL-Context: challenging scene parsing dataset

class: 59 semantic + 1 background
train: 4998/ test: 5105 images

COCO-Stuff: challenging scene parsing dataset

171 semantic classes
train: 9K/ test: 1K

4.2 Implementation Details

Trainining setting # 그냥 이런 거 쓴다~ 정도

initialize the backbones

: ImageNet으로 pre-trained된 모델과 OCR module 사용 → random

perform polynomial learning rate policy

the weight on final loss ⇒ 1

the weight on the loss(used to supervise the object region estimation) ⇒ 0

InPlace-ABNsync → 여러 GPU에서 BN의 평균 및 표준 편차를 동기화(synchronize) //GPU synch맞춤, 어려운 일이다.

Data augmetation

perfrom random flipping horizontally
random scaling [0.5, 2]
random brightness jittering [-10, 10]
perform the same training settings for the reproduced approaches(PPM, ASPP) → fairness 보장

□ Cityscapes

Default: initial lr = 0.01/ weight decay = 0.0005/
crop size = 769x769/ batch size = 8

실험

val/test : training iterations = 40K/100K on train/train+val
augmented with extra data
- coarse: first train model on train+val for 100K iterations, initial learning rate = 0.01 → fine-tune the model on coarse 50K iter → fine-tune on train+val = 20K iter ilr = 0.001
- coarse + Mapillary
  - pre-train model on the Mapillary train
    - 500K iter, batch=16, ilr=0.01
  - fine-tune model on Cityscapes
    : train+val (100K iter) → coarse(50K iter) → train+val(20K iter)
    - ilr = 0.001, batch=8

□ ADE20K

(if not specified)
ilr = 0.02/ weight decay = 0.0001/ crop size = 520x520/
batch size = 16/ training iterations = 150K

□ LIP

(if not specified)
ilr = 0.007/ weight-decay - 0.0005/ crop size = 473x473/
batch size=32/ training iterations = 100K

□ PASCAL-Context:

(if not specified)
ilr=0.001/ weight decay=0.0001/ crop size=520x520/
batch size=16/training iterations = 30K

□ COCO-Stuff:

(if not specified)
ilr=0.001/ weight decay=0.0001/ crop size=520x520/
batch size=16/ training iterations = 60K

4.3 Comparison with Existing Context Schemes

Backbone을 dilated ResNet-101로 실험 수행

same training/testing settings → fairness 보장하기 위해

Multi-scale contexts

our OCR과 multi-scale context scheme(PPM, ASPP) 을 비교 on 3 benchmark(Cityscapes tet, ADE20K val, LIP val)
reproduced PPM/ASPP outperforms original
OCR outperforms both multi-scale context schemes

Relational contexts

our OCR과 relational context scheme(Self-Attention, Criss-Corss attention, DANet, Double Attention)을 비교 on 3 benchmark
reproduced Double Attention: fine-tune # of regions (64)
OCR outperforms relational context schemes.

Complexity

much smaller complexity
효율성 비교: OCR vs multi-scal, relational
- increased parameters, GPU memory, computation complexity
제안된 OCR이 우수하다!

⇒ 성능, 메모리 복잡도, 계산 복잡도와 소요시간의 balance를 고려한다면, 논문의 OCR이 좋은 선택!

4.4 Comparison with the State-of-the-Art

M: multi-scale/ R: relational context

(1) simple baseline

(2) advanced baseline

Cityscapes

final submission "HRNet + OCR + SegFix"

직접 PPM or aSPP를 적용하는 것 → 성능 향상X
our OCR → consistently 성능 향상O

ADE20K

our OCR: 45.28%, 45.66%

LIP

our OCR: 55.60%, 56.65%

PASCAL-Context

HRNet-W48 + OCR: 56.2%

COCO-Stuff

our: 39.5%(ResNet-101), 40.5%(HRNetV2-48)

5. Experiments: Panoptic Segmentation

: 어려운 panoptic segmentation task에 OCR을 적용하여 일반화 능력을 증명

panoptic segmentation = instance + semantic seg

Dataset

: COCO dataset/ effectiveness

Training Details

: default training setup of "COCO Panoptic Segmentation Baselines with Panoptic FPN(3x learning schedule)

ResNet-50, ResNet-101 둘 다 PQ performance 향상

Results

: "Panoptic-FPN + OCR" 성능이 (다른 최신 방식과 비교해서) 매우 competitive하다.

6. Conclusion

sementic segmentation을 위한 object-contextual representation

성공요인

label of pixel == label of the object(pixel이 있는)
pixel representation ← 각 pixel을 object region representation으로 characterizing

: 다양한 benchmark에서 consistent(일관된) improvements 보인다.

728x90

저작자표시 (새창열림)

'📓 Papers' 카테고리의 다른 글

SIMPLE ONLINE AND REALTIME TRACKING (0)	2021.10.03
YOLOv4: Optimal Speed and Accuracy of Object Detection (0)	2021.10.03
CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence (0)	2021.10.03
Semi-supervised semantic segmentation needs strong, varied perturbations (0)	2021.10.03
PointRend: Image Segmentation as Rendering (0)	2021.10.03

Abstract

1. Introduction

2. Related Work

3. Approach

3.1 Background

3.2 Formulation (각 항의 의미)

3.3 Archituecture

3.4 Empirical Analysis (실증 분석)

4. Experiments: Semantic Segmentation

4.1 Datasets

4.2 Implementation Details

4.3 Comparison with Existing Context Schemes

4.4 Comparison with the State-of-the-Art

5. Experiments: Panoptic Segmentation

6. Conclusion

'📓 Papers' 카테고리의 다른 글

티스토리툴바