Object-Contextual Representations for Semantic Segmentation
- -
Abstract
context aggregation problem in semantic segmentation
가정: 주변 픽셀들까지
object-contextual representations → representation of the correspoding object class.
- learn object regions: ground-truth
- compute the object region representation
- compute the relation between each pixel and each object region
augment the representation of each pixel with ocr
// Transformer encoder-decoder framework: rephrase the ocr scheme
cross-attenton module (decoder)
- object region learning(super: ground-truth segmentation)
- (compute) the object region representation (aggregate the repr of the pixels in the object region)
- output: object region represenations
- category quries: linear projection
cross-attention module (encoder)
- compute the relation between each pixel ad each object region
- augment the repr of each pixel with ocr. (weighted aggregation of all object region repr)
- compute the relation between each pixel ad each object region
- key, value = decoder output
- queries = repr at each position
various benchmark ⇒ competitive performance!
1. Introduction
Semantic segmentation: image의 각 pixel에 class label 할당
Contextual aggregation
- 이전 연구: spatial scale of contexts
- multi-scale context: ASPP, PPM
- relational context: DANet, CFNet, OCNet (relations: position & contextual positions)
contextual represenation scheme: relation between a position and its context
목표
: augment the repr of one pixel by exploiting the representation of the object region of the corespoding class.
: obejct region 해당하는 픽셀을 classify할 때, represenatiaon과 obejct region을 사용하겠다.
ground-truth object region) segmentation quality 향상!
: soft region가 100%의 역할을 한다고 가정했을 때 == ground truth가 주어졌을 때, 모델의 성능이 더 좋음을 보여줌으로써 soft region이 실제로 도움이 됨을 증명!
Main step
- contextual pixel을 set of soft object regions(class)로 divide! → coarse soft segmentation
- division: GT segmentation
- pixels의 represenation aggregate → 각 object region의 representation을 추정!
- object-contextual represenation(OCR)로 각 pixel의 representation을 augment!
OCR은 모든 object region representations with the weights의 weighted aggregation이다.
- weight: pixels과 object regions의 relations에 따라 계산
Differ from multi-scale context schemes
OCR
different object class contextual pixel과 same object class contextual pixel을 구별
:자동차와 자동차가 아닌 pixel을 구별한다는 의미
multi-scale(ASPP, PPM)
- 다른 spatial position을 가진 pixel만 구별
our OCR: context는 set of pixels in the object (파란 마크)
Differ from relational context schemes
OCR
- contextual pixels을 object region으로 구성
- pixel과 obejct region간의 관계 사용
relational
- contextual pixel을 각각 고려
- pixel과 contextual pixel간의 관계만 사용
- region 고려없이 pixel로만 관계 예측
2. Related Work
Multi-scale context
- PSPNet: pyramid pooling represenations/ multi-scale context capture (초창기, 중요X)
- DeepLab series: 병렬 dilated(팽창) convolutions 채택 (diff dilation rates) 多
- ASPP
- recent works: DesnseASPP, encoder-decoder structures
Relational context
DANet, CFNet, OCNet: 각 pixel representation augment
self-attention scheme기반의 픽셀간 관계(유사성) 고려
Double Attention, ACFNet
group: the pixels → set of regions
augment pixel repr ← region repr을 context relation을 고려해서 aggregate
Our approach
relational context approach + related to Double Attention, ACFNet
Difference
: region formation(learning) & pixel-region relation computation논문:
supervised with ground-truth segmentation
relations: pixel과 region 둘다 고려
prev(ACFNet 빼고):
unsupervisedly
relations: pixel만 고려
Coarse-to-fine Segmentation
논문:
어떤 점에서는 coarse-to-fine scheme이다. → soft object region(coarse)에서 점차 상세하게.
BUT: use the coarse segmentation map for generating a contextual representation.
Region-wise segmentation
- Our: 각 region을 분류하는 것이 아니라 더 나은 학습을 위한 부가적 정보로 사용.
3. Approach
Semantic segmentation이란?
label li를 image I의 각 픽셀 pi에 할당하는 문제 (li는 K개의 다른 클래스)
3.1 Background
Multi-scale context
based on dilated convolutions/ capture the context of multiple scales without losing the resolution
ASPP
captures the multi-scale context info ← 다른 dilation rates로 병렬 팽창 conv 수행
출력 multi-scale contextual representation: 병렬 팽창 conv를 통한 representation output의 concatenation(연결)
PSPNet - pyramid pooling module
regular conv on representations of different scale
captures the contexts of multiple scales
Reltaitonal context
computes the context for each pixel by considering the relation
3.2 Formulation (각 항의 의미)
(1) structurie all the pixel in image I into K soft object regions
(2) represent each object region as fk
(3) augment the represenation for each pixel
- K개의 region featur를 가져오겠다.
- transformation function: 특징을 정제한다. 1x1 conv → BN → ReLU # 모양만 같고 paramerters가 다르다.
- k: soft obejct region의 인덱스
Soft object regions
클래스에 해당하는 pixel의 representation만 가져오겠다.
compute K object regions from an intermmediate represenataion output from a backbone
During training: object region generator를 ground-truth segmentation을 supervision으로 학습 & cross-entropy loss
Object region representations
: k번째 object regoin에 속하는 정도를 가중치로 하는 모든 픽셀의 representaions을 aggregate
- k번째 object region representation
- xi: representation of pixel pi
Object contextual representations
: relation between each pixel and each obejct region
softamx 처럼수행하는 이유?
- ReLU값은 범위가 없다. 하지만 weight의 범위가 정해져 있지않으면 터질 수 있다.
→ 범위를 0~1로 만들기 위해 normalized를 사용 - 기존의 차이를 더 벌리기 위해 지수형태로 계산한다.
Augmented representations
: final representation for pixel pi
aggregation 2 parts
(1) the original representation xi
(2) the object contextual representation yi
zi: transform function - xi와 yi를 fuse(aggregate)
3.3 Segmentation Transformer: Rephrasing the OCR Method
: OCR pipeline을 Transformer encoder-decoder 구조로 rephrase한 것
OCR pipeline 3단계
1) soft object region extraction
2) object region representation computation
3) object-contextual representation computation for each position
: 주로 decoder와 encoder의 cross-attention module을 탐색한다.
Attention: scaled dot-product로 계산
- attention weight aij: softmax normalization of query qi와 key kj의 내적
- 각 쿼리 qi의 attention 출력 = aggregation of values weighted by attention weights
Decoder cross-attention
2 roles
1) soft object region extraction ← K category queires
2) object region representation computation
Encoder corss-attention
: aggregating the object region representations (yi 계산식 수행)
- querires: image features at each position
- keys, values: decoder outputs
Connection to class embedding and class attention.
category queries are close to the class embedding
- embeding for each class # llearn ejffffffembedding for all the classes
- encoder and decoder architecture is close to self-attention over both the class embedding and image features.
Connection to OCNet and interlaced self-attention
OCNet: self-attention 사용
- self-attention unit: interlaced self-attention unit에 의해 가속화됨
- local self-attention + global self-atteention
- category queries → 대체 → refulary sampled or adaptively-pooled image features (not learned as model parameters)
3.3 Archituecture
Backbone:
- dilated ResNet-101(with output stride 8)
- representations input: 2개
- object regions: coarse segmentation 예측용
- 3x3 convolution(512 output channels)을 통과 후, OCR module로 들어감
- representations input: 2개
- HRNet-W48(with output stride 4)
- final representation만 input으로 사용!
OCR module
above formulation → OCR module
linear function
:predict the coarse segmentation (soft object region)
loss: pixel-wise cross-entropy
all the transform function: 1x1 conv → BN → ReLU
1st 3 output 256 channels, last 2 output 512 channels
predict the final segmentation from the final representation using a linear function
apply pixel-wise cross-entropy loss on the final segmentation prediction
3.4 Empirical Analysis (실증 분석)
Object region supervision:
We can see that the supervicion for forming the object regions is crucial for the performance.
Pixel-region relations:
object region supervision과 pixel-region relation estimation scheme의 영향
with supervision/ 논문의 relation scheme이 성능에 중요함을 알 수 있음
Reason: reltations 계산에 pixel representation과 region representation을 둘다 사용하기 때문!
- region representation은 특정 영상에서 개체의 특성을 분석 가능(able to characterize) → pixel representation만 사용하는 것보다 특정 영상에 대한 reltation이 더 정확하다!
Ground-truth OCR:
: study the segmentation performance
using
the ground-truth segmentation (to form the object regions)
the pixel-region relations(GT-OCR) (to justify our motivation)
Object region formation using ground-truth
mki: kth object region에 속한 pixel i의 confidence
if ground truth label li가 k) mki = 1
else: mki = 0
Pixel-region relation computation using ground-truth
wik: pixel-region relation
if ground-truth label li = k) wik = 1
else: mki = 0
4. Experiments: Semantic Segmentation
4.1 Datasets
Cityscapes: urban scene understanding
30 classes & only 19 classes 만 분석 평가에 사용됨
5K 고화질 pixel-level finely annotated images
→ train: 2975/valid: 500/vaild: 1525 images
20K coarsely annotated images
ADE20K: Imagenet scene parsing challenge 2016에 사용됨
- 150 classes & diverse scenes with 1038 image-level label
- → train: 20K/ valid:2K/ test: 3K images
LIP: LIP challenge 2016 for single human parsing task 에 사용됨
- 50K images with 20 classes(19 semantic human part + 1 background)
- → train: 30K/ valid: 10K/ test: 10K images
PASCAL-Context: challenging scene parsing dataset
- class: 59 semantic + 1 background
- train: 4998/ test: 5105 images
COCO-Stuff: challenging scene parsing dataset
- 171 semantic classes
- train: 9K/ test: 1K
4.2 Implementation Details
Trainining setting # 그냥 이런 거 쓴다~ 정도
initialize the backbones
: ImageNet으로 pre-trained된 모델과 OCR module 사용 → random
perform polynomial learning rate policy
the weight on final loss ⇒ 1
the weight on the loss(used to supervise the object region estimation) ⇒ 0
InPlace-ABNsync → 여러 GPU에서 BN의 평균 및 표준 편차를 동기화(synchronize) //GPU synch맞춤, 어려운 일이다.
Data augmetation
- perfrom random flipping horizontally
- random scaling [0.5, 2]
- random brightness jittering [-10, 10]
- perform the same training settings for the reproduced approaches(PPM, ASPP) → fairness 보장
□ Cityscapes
Default: initial lr = 0.01/ weight decay = 0.0005/
crop size = 769x769/ batch size = 8
실험
val/test : training iterations = 40K/100K on train/train+val
augmented with extra data
coarse: first train model on train+val for 100K iterations, initial learning rate = 0.01 → fine-tune the model on coarse 50K iter → fine-tune on train+val = 20K iter ilr = 0.001
coarse + Mapillary
pre-train model on the Mapillary train
- 500K iter, batch=16, ilr=0.01
fine-tune model on Cityscapes
: train+val (100K iter) → coarse(50K iter) → train+val(20K iter)- ilr = 0.001, batch=8
□ ADE20K
- (if not specified)
ilr = 0.02/ weight decay = 0.0001/ crop size = 520x520/
batch size = 16/ training iterations = 150K
□ LIP
- (if not specified)
ilr = 0.007/ weight-decay - 0.0005/ crop size = 473x473/
batch size=32/ training iterations = 100K
□ PASCAL-Context:
- (if not specified)
ilr=0.001/ weight decay=0.0001/ crop size=520x520/
batch size=16/training iterations = 30K
□ COCO-Stuff:
- (if not specified)
ilr=0.001/ weight decay=0.0001/ crop size=520x520/
batch size=16/ training iterations = 60K
4.3 Comparison with Existing Context Schemes
Backbone을 dilated ResNet-101로 실험 수행
same training/testing settings → fairness 보장하기 위해
Multi-scale contexts
- our OCR과 multi-scale context scheme(PPM, ASPP) 을 비교 on 3 benchmark(Cityscapes tet, ADE20K val, LIP val)
- reproduced PPM/ASPP outperforms original
- OCR outperforms both multi-scale context schemes
Relational contexts
- our OCR과 relational context scheme(Self-Attention, Criss-Corss attention, DANet, Double Attention)을 비교 on 3 benchmark
- reproduced Double Attention: fine-tune # of regions (64)
- OCR outperforms relational context schemes.
Complexity
- much smaller complexity
- 효율성 비교: OCR vs multi-scal, relational
- increased parameters, GPU memory, computation complexity
- 제안된 OCR이 우수하다!
⇒ 성능, 메모리 복잡도, 계산 복잡도와 소요시간의 balance를 고려한다면, 논문의 OCR이 좋은 선택!
4.4 Comparison with the State-of-the-Art
M: multi-scale/ R: relational context
(1) simple baseline
(2) advanced baseline
Cityscapes
final submission "HRNet + OCR + SegFix"
- 직접 PPM or aSPP를 적용하는 것 → 성능 향상X
- our OCR → consistently 성능 향상O
ADE20K
- our OCR: 45.28%, 45.66%
LIP
- our OCR: 55.60%, 56.65%
PASCAL-Context
- HRNet-W48 + OCR: 56.2%
COCO-Stuff
- our: 39.5%(ResNet-101), 40.5%(HRNetV2-48)
5. Experiments: Panoptic Segmentation
: 어려운 panoptic segmentation task에 OCR을 적용하여 일반화 능력을 증명
- panoptic segmentation = instance + semantic seg
Dataset
: COCO dataset/ effectiveness
Training Details
: default training setup of "COCO Panoptic Segmentation Baselines with Panoptic FPN(3x learning schedule)
- ResNet-50, ResNet-101 둘 다 PQ performance 향상
Results
: "Panoptic-FPN + OCR" 성능이 (다른 최신 방식과 비교해서) 매우 competitive하다.
6. Conclusion
sementic segmentation을 위한 object-contextual representation
성공요인
- label of pixel == label of the object(pixel이 있는)
- pixel representation ← 각 pixel을 object region representation으로 characterizing
: 다양한 benchmark에서 consistent(일관된) improvements 보인다.
'📓 Paper Review' 카테고리의 다른 글
SIMPLE ONLINE AND REALTIME TRACKING (0) | 2021.10.03 |
---|---|
YOLOv4: Optimal Speed and Accuracy of Object Detection (0) | 2021.10.03 |
CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence (0) | 2021.10.03 |
Semi-supervised semantic segmentation needs strong, varied perturbations (0) | 2021.10.03 |
PointRend: Image Segmentation as Rendering (0) | 2021.10.03 |
당신이 좋아할만한 콘텐츠
-
YOLOv4: Optimal Speed and Accuracy of Object Detection 2021.10.03
-
CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence 2021.10.03
-
Semi-supervised semantic segmentation needs strong, varied perturbations 2021.10.03
-
PointRend: Image Segmentation as Rendering 2021.10.03
소중한 공감 감사합니다