새소식

Paper Review

Object-Contextual Representations for Semantic Segmentation

  • -
728x90

Abstract

context aggregation problem in semantic segmentation

가정: 주변 픽셀들까지

object-contextual representations → representation of the correspoding object class.

  1. learn object regions: ground-truth
  2. compute the object region representation
    • compute the relation between each pixel and each object region
  • augment the representation of each pixel with ocr

  • // Transformer encoder-decoder framework: rephrase the ocr scheme

    • cross-attenton module (decoder)

      1. object region learning(super: ground-truth segmentation)
      2. (compute) the object region representation (aggregate the repr of the pixels in the object region)
      • output: object region represenations
      • category quries: linear projection
    • cross-attention module (encoder)

        • compute the relation between each pixel ad each object region
          • augment the repr of each pixel with ocr. (weighted aggregation of all object region repr)
      • key, value = decoder output
      • queries = repr at each position

various benchmark ⇒ competitive performance!

1. Introduction

Semantic segmentation: image의 각 pixel에 class label 할당

Contextual aggregation

  • 이전 연구: spatial scale of contexts
    • multi-scale context: ASPP, PPM
    • relational context: DANet, CFNet, OCNet (relations: position & contextual positions)

contextual represenation scheme: relation between a position and its context

목표

: augment the repr of one pixel by exploiting the representation of the object region of the corespoding class.

: obejct region 해당하는 픽셀을 classify할 때, represenatiaon과 obejct region을 사용하겠다.

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled.png

ground-truth object region) segmentation quality 향상!

: soft region가 100%의 역할을 한다고 가정했을 때 == ground truth가 주어졌을 때, 모델의 성능이 더 좋음을 보여줌으로써 soft region이 실제로 도움이 됨을 증명!

Main step

  1. contextual pixel을 set of soft object regions(class)로 divide! → coarse soft segmentation
  • division: GT segmentation
  1. pixels의 represenation aggregate → 각 object region의 representation을 추정!
  2. object-contextual represenation(OCR)로 각 pixel의 representation을 augment!

OCR은 모든 object region representations with the weights의 weighted aggregation이다.

  • weight: pixels과 object regions의 relations에 따라 계산

Differ from multi-scale context schemes

OCR

  • different object class contextual pixel과 same object class contextual pixel을 구별

    :자동차와 자동차가 아닌 pixel을 구별한다는 의미

multi-scale(ASPP, PPM)

  • 다른 spatial position을 가진 pixel만 구별

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%201.png

our OCR: context는 set of pixels in the object (파란 마크)

Differ from relational context schemes

OCR

  • contextual pixels을 object region으로 구성
  • pixel과 obejct region간의 관계 사용

relational

  • contextual pixel을 각각 고려
  • pixel과 contextual pixel간의 관계만 사용
  • region 고려없이 pixel로만 관계 예측

2. Related Work

Multi-scale context

  • PSPNet: pyramid pooling represenations/ multi-scale context capture (초창기, 중요X)
  • DeepLab series: 병렬 dilated(팽창) convolutions 채택 (diff dilation rates)
    • ASPP
  • recent works: DesnseASPP, encoder-decoder structures

Relational context

  • DANet, CFNet, OCNet: 각 pixel representation augment

    self-attention scheme기반의 픽셀간 관계(유사성) 고려

  • Double Attention, ACFNet

    group: the pixels → set of regions

    augment pixel repr ← region repr을 context relation을 고려해서 aggregate

  • Our approach

    relational context approach + related to Double Attention, ACFNet

    Difference
    : region formation(learning) & pixel-region relation computation

    • 논문:

      supervised with ground-truth segmentation

      relations: pixel과 region 둘다 고려

    • prev(ACFNet 빼고):

      unsupervisedly

      relations: pixel만 고려

Coarse-to-fine Segmentation

  • 논문:

    어떤 점에서는 coarse-to-fine scheme이다. → soft object region(coarse)에서 점차 상세하게.

    BUT: use the coarse segmentation map for generating a contextual representation.

Region-wise segmentation

  • Our: 각 region을 분류하는 것이 아니라 더 나은 학습을 위한 부가적 정보로 사용.

3. Approach

Semantic segmentation이란?
label li를 image I의 각 픽셀 pi에 할당하는 문제 (li는 K개의 다른 클래스)

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%202.png

3.1 Background

Multi-scale context

based on dilated convolutions/ capture the context of multiple scales without losing the resolution

  • ASPP

    captures the multi-scale context info ← 다른 dilation rates로 병렬 팽창 conv 수행

    출력 multi-scale contextual representation: 병렬 팽창 conv를 통한 representation output의 concatenation(연결)

  • PSPNet - pyramid pooling module

    regular conv on representations of different scale

    captures the contexts of multiple scales

Reltaitonal context

computes the context for each pixel by considering the relation

3.2 Formulation (각 항의 의미)

(1) structurie all the pixel in image I into K soft object regions

(2) represent each object region as fk

(3) augment the represenation for each pixel

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%203.png

  • K개의 region featur를 가져오겠다.
  • transformation function: 특징을 정제한다. 1x1 conv → BN → ReLU # 모양만 같고 paramerters가 다르다.
  • k: soft obejct region의 인덱스

Soft object regions

클래스에 해당하는 pixel의 representation만 가져오겠다.

compute K object regions from an intermmediate represenataion output from a backbone

During training: object region generator를 ground-truth segmentation을 supervision으로 학습 & cross-entropy loss

Object region representations

: k번째 object regoin에 속하는 정도를 가중치로 하는 모든 픽셀의 representaions을 aggregate

  • k번째 object region representation

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%204.png

  • xi: representation of pixel pi

Object contextual representations

: relation between each pixel and each obejct region

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%205.png

softamx 처럼수행하는 이유?

  • ReLU값은 범위가 없다. 하지만 weight의 범위가 정해져 있지않으면 터질 수 있다.
    → 범위를 0~1로 만들기 위해 normalized를 사용
  • 기존의 차이를 더 벌리기 위해 지수형태로 계산한다.

Augmented representations

: final representation for pixel pi

  • aggregation 2 parts

    (1) the original representation xi

    (2) the object contextual representation yi

%5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%206.png

  • zi: transform function - xi와 yi를 fuse(aggregate)

  • 3.3 Segmentation Transformer: Rephrasing the OCR Method

    %5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%207.png

    : OCR pipeline을 Transformer encoder-decoder 구조로 rephrase한 것

    OCR pipeline 3단계

    1) soft object region extraction

    2) object region representation computation

    3) object-contextual representation computation for each position

    : 주로 decoder와 encoder의 cross-attention module을 탐색한다.

    %5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%208.png

    Attention: scaled dot-product로 계산

    • attention weight aij: softmax normalization of query qi와 key kj의 내적
    • 각 쿼리 qi의 attention 출력 = aggregation of values weighted by attention weights

    %5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%209.png

    Decoder cross-attention

    • 2 roles

      1) soft object region extraction ← K category queires

      2) object region representation computation

    Encoder corss-attention

    : aggregating the object region representations (yi 계산식 수행)

    • querires: image features at each position
    • keys, values: decoder outputs

    Connection to class embedding and class attention.

    category queries are close to the class embedding

    • embeding for each class # llearn ejffffffembedding for all the classes
    • encoder and decoder architecture is close to self-attention over both the class embedding and image features.

    Connection to OCNet and interlaced self-attention

    OCNet: self-attention 사용

    • self-attention unit: interlaced self-attention unit에 의해 가속화됨
    • local self-attention + global self-atteention
    • category queries → 대체 → refulary sampled or adaptively-pooled image features (not learned as model parameters)

3.3 Archituecture

Backbone:

  • dilated ResNet-101(with output stride 8)
    • representations input: 2개
      • object regions: coarse segmentation 예측용
      • 3x3 convolution(512 output channels)을 통과 후, OCR module로 들어감
  • HRNet-W48(with output stride 4)
    • final representation만 input으로 사용!

OCR module

  • above formulation → OCR module

  • linear function

    :predict the coarse segmentation (soft object region)

    loss: pixel-wise cross-entropy

  • all the transform function: 1x1 conv → BN → ReLU

  • 1st 3 output 256 channels, last 2 output 512 channels

  • predict the final segmentation from the final representation using a linear function

  • apply pixel-wise cross-entropy loss on the final segmentation prediction

3.4 Empirical Analysis (실증 분석)

Object region supervision:

We can see that the supervicion for forming the object regions is crucial for the performance.

Pixel-region relations:

  • object region supervision과 pixel-region relation estimation scheme의 영향

    with supervision/ 논문의 relation scheme이 성능에 중요함을 알 수 있음

    %5B%E1%84%82%E1%85%A9%E1%86%AB%E1%84%86%E1%85%AE%E1%86%AB%5D%20Object-Contextual%20Representations%20for%20Sem%204f313cb85ae042c6a95a9edee46f75c9/Untitled%2010.png

    Reason: reltations 계산에 pixel representation과 region representation을 둘다 사용하기 때문!

    • region representation은 특정 영상에서 개체의 특성을 분석 가능(able to characterize) → pixel representation만 사용하는 것보다 특정 영상에 대한 reltation이 더 정확하다!

Ground-truth OCR:

: study the segmentation performance

using

  1. the ground-truth segmentation (to form the object regions)

  2. the pixel-region relations(GT-OCR) (to justify our motivation)

  3. Object region formation using ground-truth

    mki: kth object region에 속한 pixel i의 confidence

    if ground truth label li가 k) mki = 1

    else: mki = 0

  4. Pixel-region relation computation using ground-truth

    wik: pixel-region relation

    if ground-truth label li = k) wik = 1

    else: mki = 0

4. Experiments: Semantic Segmentation

4.1 Datasets

Cityscapes: urban scene understanding

  • 30 classes & only 19 classes 만 분석 평가에 사용됨

  • 5K 고화질 pixel-level finely annotated images

    → train: 2975/valid: 500/vaild: 1525 images

  • 20K coarsely annotated images

ADE20K: Imagenet scene parsing challenge 2016에 사용됨

  • 150 classes & diverse scenes with 1038 image-level label
  • → train: 20K/ valid:2K/ test: 3K images

LIP: LIP challenge 2016 for single human parsing task 에 사용됨

  • 50K images with 20 classes(19 semantic human part + 1 background)
  • → train: 30K/ valid: 10K/ test: 10K images

PASCAL-Context: challenging scene parsing dataset

  • class: 59 semantic + 1 background
  • train: 4998/ test: 5105 images

COCO-Stuff: challenging scene parsing dataset

  • 171 semantic classes
  • train: 9K/ test: 1K

4.2 Implementation Details

Trainining setting # 그냥 이런 거 쓴다~ 정도

initialize the backbones

: ImageNet으로 pre-trained된 모델과 OCR module 사용 → random

perform polynomial learning rate policy

the weight on final loss ⇒ 1

the weight on the loss(used to supervise the object region estimation) ⇒ 0

InPlace-ABNsync → 여러 GPU에서 BN의 평균 및 표준 편차를 동기화(synchronize) //GPU synch맞춤, 어려운 일이다.

Data augmetation

  • perfrom random flipping horizontally
  • random scaling [0.5, 2]
  • random brightness jittering [-10, 10]
  • perform the same training settings for the reproduced approaches(PPM, ASPP) → fairness 보장

□ Cityscapes

Default: initial lr = 0.01/ weight decay = 0.0005/
crop size = 769x769/ batch size = 8

실험

  • val/test : training iterations = 40K/100K on train/train+val

  • augmented with extra data

    • coarse: first train model on train+val for 100K iterations, initial learning rate = 0.01 → fine-tune the model on coarse 50K iter → fine-tune on train+val = 20K iter ilr = 0.001

    • coarse + Mapillary

      • pre-train model on the Mapillary train

        • 500K iter, batch=16, ilr=0.01
      • fine-tune model on Cityscapes
        : train+val (100K iter) → coarse(50K iter) → train+val(20K iter)

        • ilr = 0.001, batch=8

□ ADE20K

  • (if not specified)
    ilr = 0.02/ weight decay = 0.0001/ crop size = 520x520/
    batch size = 16/ training iterations = 150K

□ LIP

  • (if not specified)
    ilr = 0.007/ weight-decay - 0.0005/ crop size = 473x473/
    batch size=32/ training iterations = 100K

□ PASCAL-Context:

  • (if not specified)
    ilr=0.001/ weight decay=0.0001/ crop size=520x520/
    batch size=16/training iterations = 30K

□ COCO-Stuff:

  • (if not specified)
    ilr=0.001/ weight decay=0.0001/ crop size=520x520/
    batch size=16/ training iterations = 60K

4.3 Comparison with Existing Context Schemes

Backbone을 dilated ResNet-101로 실험 수행

same training/testing settings → fairness 보장하기 위해

Multi-scale contexts

  • our OCR과 multi-scale context scheme(PPM, ASPP) 을 비교 on 3 benchmark(Cityscapes tet, ADE20K val, LIP val)
  • reproduced PPM/ASPP outperforms original
  • OCR outperforms both multi-scale context schemes

Relational contexts

  • our OCR과 relational context scheme(Self-Attention, Criss-Corss attention, DANet, Double Attention)을 비교 on 3 benchmark
  • reproduced Double Attention: fine-tune # of regions (64)
  • OCR outperforms relational context schemes.

Complexity

  • much smaller complexity
  • 효율성 비교: OCR vs multi-scal, relational
    • increased parameters, GPU memory, computation complexity
  • 제안된 OCR이 우수하다!

⇒ 성능, 메모리 복잡도, 계산 복잡도와 소요시간의 balance를 고려한다면, 논문의 OCR이 좋은 선택!

4.4 Comparison with the State-of-the-Art

M: multi-scale/ R: relational context

(1) simple baseline

(2) advanced baseline

Cityscapes

final submission "HRNet + OCR + SegFix"

  • 직접 PPM or aSPP를 적용하는 것 → 성능 향상X
  • our OCR → consistently 성능 향상O

ADE20K

  • our OCR: 45.28%, 45.66%

LIP

  • our OCR: 55.60%, 56.65%

PASCAL-Context

  • HRNet-W48 + OCR: 56.2%

COCO-Stuff

  • our: 39.5%(ResNet-101), 40.5%(HRNetV2-48)

5. Experiments: Panoptic Segmentation

: 어려운 panoptic segmentation task에 OCR을 적용하여 일반화 능력을 증명

  • panoptic segmentation = instance + semantic seg

Dataset

: COCO dataset/ effectiveness

Training Details

: default training setup of "COCO Panoptic Segmentation Baselines with Panoptic FPN(3x learning schedule)

  • ResNet-50, ResNet-101 둘 다 PQ performance 향상

Results

: "Panoptic-FPN + OCR" 성능이 (다른 최신 방식과 비교해서) 매우 competitive하다.

6. Conclusion

sementic segmentation을 위한 object-contextual representation

성공요인

  • label of pixel == label of the object(pixel이 있는)
  • pixel representation ← 각 pixel을 object region representation으로 characterizing

: 다양한 benchmark에서 consistent(일관된) improvements 보인다.

728x90
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.