YOLOv4: Optimal Speed and Accuracy of Object Detection

2021. 10. 3. 17:41·📓 Papers
728x90

Abstract

a huge number of features → to improve CNN accuracy

  • YOLO v4 - use new features and achieve state-of -the-art results.2) CSP (Cross-Stage-Partial-Connections)4) SAT (Self-Adversarial-Training)6) Mosaic Data Agumentation8) CIOU Loss
  • 7) Drop Block Regularization
  • 5) Mish Activation
  • 3) CmBN (Cross mini-Batch Normalizations)
  • 1) WRC (Weighted-Residual-Connections)

⇒ acheive: MS COCO dataset AP: 43,5%, 65 FPS(realtime)

 

1. Introduction

Problem

The most accurate modern NN

  • do not operate in real time
  • require large number of GPUs for training with large mini-batch-size
  • Figure 1. YOLO v4 vs. state-of-the-art object detectors
    • YOLO v4 vs. EfficientDet: comparable performance, x2 faster FPS
    • improve YOLO v3's AP 10%, FP 12%

 

contributions

  1. efficient and powerful object detection model
  2. verify the influence of BoF and BoS methods during training
  3. modify state-of-the-art methods more effecient and suitable for single GPU training

 

2. Related Work

2.1 Object detection models

  • backbone: pre-trained on ImageNet
  • neck: collect featuremaps from each stages
  • head: class + bounding box prediction
    • Dense Prediction(one-stage)
    • Sparse Prediction(two-stage)
      : class prediction, bounding box regression 부분이 분리

2.2 Bag of freebies(BOF)

: methods that only change training strategy or only increase training cost

⇒ better accuracy without increasing inference cost

  1. Data augmentation
    1. photometric or geometric distortions
    2. CutOut
    3. CutMix
  2. Regularization
    1. DropOut
    2. DropPath
    3. Spatial DropOut
    4. DropBlock
  3. Objective function of BBox Regression: Loss function
    1. MSE
    2. IoU
      • GIoU, CIoU, DIoU

2.3 Bag of Specials(BOS)

: plugin modules or post-processing(후처리) methods that increase the inference cost (small) + significantly improve the accuracy

  1. enhance receptive field
    : SPP, ASPP, RFB
  2. attention module
    : SE(Squeeze and Excitation), SAM(Spatial attention module)
  3. feature integration
    1. skip connection, hyper-column
    2. SFAM(SE), ASFF(softmax), BiFPN(multi-input weighted residual connections)
  4. good activation function
    1. ReLU
    2. LReLU, PReLU
    3. SELU
    4. Swish, hard-Swish
    5. Mish
  5. post-processing → no longer required in anchor-free
    1. NMS(Non-Maximum Suppression): optimize the objective function

 

3. Methodology

fast operating speed of neural net + optimization for parallel computations

3.1 Selection of architecture

Objective

  1. optimal balance among the network resolution, convolution layer number, parameter number, layer outputs number
  2. select additional blocks for increasing the receptive field
    and best parameter aggregation

CSPNet

→ design and use a CSPNet based backbone

  • propose Cross Stage Partial Network structure
    : reduce extremely heavy inference cost and minimize accuracy loss
  • Figure: CSPNet based backbone architecture: After dividing input feature map into 2 parts, one part doesn't participate in the operation and then merges into output.
  • → reduce the inference cost, memory cost, etc.

 

  • detector requires...
    • YOLO의 문제 : 작은 object에 취약하다. → 다양한 작은 object를 잘 검출하기 위해 input resolution을 크게 사용했다.
    • receptive field를 물리적으로 키워주기 위해 layer 수를 늘림
    • 하나의 image에서 다양한 종류, 다양한 크기의 object들을 동시에 검출하려면 높은 표현력이 필요하므로 paraeters 수를 늘림
  • higher input network resolution
    : detect multiple small-sized objects
  • more layers
    : higher receptive field (increased size of input network)
  • more parmeters
    : to detect multiple objects of different sizes in a single image
  • larger receptive field, larger number of parameters → backbone.
    • CSPDarknet53: larger receptive field, larger number of parameter, FPS fastest
    • CSPDarknet53 → optimal backbone for a detector!

 

YOLOv4 > CSPDarknet53

  1. additional blocks: SPP block
    1. increase receptive field
    2. separate out the context feature
    3. no reduction of the network operating speed
  2. parameter aggregation: PAN
    1. YOLOv3: FPN

 

 📌   Final architecture    

  1. backbone: CSPDarknet53
  2. neck:
    1. additional blocks: SPP(Spational Pyramid Pooling)
    2. path-aggregation: PANet(Path Aggregation Network)
  3. head: YOLOv3

 

3.2 Selection of BoF and BoS

  1. BoF
    • Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
    • Data augmentation: CutOut, MixUp, CutMix
    • Regularization method: DropOut, DropPath, Spatial DropOut, DropBlock
  2. BoS
    • Activations: ReLu, leaky-ReLu,
      PReLU, ReLU6, SELU,
      Swish, Mish→ ReLU6: for quantization network
    • → PReLU, SELU: difficult to train
    • Normalization of the network activations by their mean and variance
      : BN,
      CGBN(or SyncBN)
      , FRN(Filter Response Normalization), CBN(Cross-Iteration Batch Normalization)
    • → single GPU
    • Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, Cross stage partial connections (CSP)

 

3.3 Additional improvements

: designed and improved the detector more suitable for training on single GPU

  1. introduce new method of data augmentation
    • Mosaic, SAT
  2. select optimal hyper-parameters: genetic algorithms
  3. modify existing methods → suitable for efficient training and detection
    • modified SAM, modified PAN, CmBN

BOF

  1. Mosaic
    • : mix 4 trainining images
    • detect objects outside normal context
    • batch normalization
      : calculate activation statistics from 4 images on each layer
    • → reduce the need for large mini-batch size
  2. SAT (Self-Adversarial Training)
    1. alter the original image
      • adversarial attack (itself)
      → create the deception that there's no desired object on the image.
    2. train the neural network to detect an object on this modified image in the noraml way
  3. : 2 forward backward stages

BOS

  1. CmBN
    • Figure 4: Cross mini-Batch Normalization
    • Untitled
    • collect statistics only between mini-batches within a single batch
  2. modified SAM
    • spatial-wise attention → point-wise attention
    • Figure 5: Modified SAM
    • Untitled
  3. modified PAN
    • PAN's shortcut connection → concatenation (replace)
    • Figure 6: Modified PAN
    • Untitled

 

3.4 YOLOv4

YOLOv4

  • Backbone: CSPDarknet53
  • Neck: SPP, PAN
  • Head: YOLOv3

BOF

  1. backbone
    • data augmentation: CutMix, Mosaic
    • imbalance sampling: Class labeling smoothing
    • Regularization: DropBlock
  2. detector
    • objective function: CIoU-loss
    • normalization of network activation: CmBN
    • regularization: DropBlock
    • data augmentation: Mosaic, SAT
    • hyper-parameters optimization: Genetic algorithms
    • learning rate scheduler: Cosine annealing scheduler
    • 기타:
      • eliminate grid sensitivity
      • use multiple anchors for a single ground truth
      • random training shapes

BOS

  1. backbone
    • activation: Mish
    • skip connections: CSP, MiWRC
  2. detector
    • activation: Mish
    • receptive field enchancement: SPP
    • attention: modified SAM
    • feature integration: modified PAN
    • post-processing: DIoU-NMS

 

4. Experiments

4.1 Experimental SetUp

4.2 Influence of different features on Classifier training

features

  • Class label smoothing
  • data augmentation
    : bilateral blurring, MixUp, CutMix, Mosaic
  • activations
    : Leaky-ReLU(by default), Swish, Mish

Result

  • improve accuracy
  • BoF-backbone: CutMix, Mosaic, Class label smoothing
  • additional option: Mish

Untitled

 

4.3 Influence of different feature on Detector training

BOF

  • 1) BOF
    • loss: MSE 고정
    Untitled
    • M, GA, CBN, CA → good performance
  • 2) BOF: S, M, IT, GA
    • loss: GIoU, DIoU, CIoU
    Untitled
    • S, M, IT, GA + CIoU → improve performance
  • 3) OA(Optimized Anchors)
    • CIoU + S, M, IT, GA
    Untitled
    • OA → improve performance
  • 4) Loss
    • Loss: MSE, GIoU, CIoU
    Untitled
    • GIoU, CIoU → high performance

BOS

  • backbone: CSPResNeXt50
  • features
  • : PAN, RFB, SAM, Gaussian YOLO (G), ASFF

Untitled

  • SPP + PAN + SAM → BEST

 

4.4 Influence of different backbones and pre-trained weightings on Detector training

: the influence of different backbone models on the detector accuracy

  • best classification accuracy model is not always the best detector accuracy model.
  1. CSPResNexT50: classifier
    • BoF, Mish + CSPResNeXT50
      : increase classifier, decrease detector acc
  2. CSPDarknet53: detector
    • BoF, Mish + CSPDarknet53
      : increase both accuracy
    • more suitable for detector

 

4.5 Influence of different mini-batch size on Detector training

: compare the results of models trained with different mini-batch sizes.

  • After BoF, BoS, mini-batch size → no effect on the detector's performance
  • ⇒ After BoF, BoS, no need for expensive GPUs

 

5. Results

  • Figure 8: comparison of the speed and accuracy of object detectors
  • Untitled

YOLOv4

  • located on the Pareto optimality curve
  • superior in speed + accuracy

 

6. Conclusion

  • faster(FPS), more accurate(MS COCO AP50...95, AP50) detector
  • be trained and used on a conventional GPU with 8 to 16 GB VRAM → broad use
  • one-stage anchor-based detector의 viability 입증
  • verify features/ select features that improve the accuracy of bot classifier and detector
728x90
저작자표시 (새창열림)

'📓 Papers' 카테고리의 다른 글

HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching  (0) 2021.10.03
SIMPLE ONLINE AND REALTIME TRACKING  (0) 2021.10.03
CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence  (0) 2021.10.03
Semi-supervised semantic segmentation needs strong, varied perturbations  (0) 2021.10.03
PointRend: Image Segmentation as Rendering  (0) 2021.10.03
'📓 Papers' 카테고리의 다른 글
  • HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching
  • SIMPLE ONLINE AND REALTIME TRACKING
  • CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence
  • Semi-supervised semantic segmentation needs strong, varied perturbations
soyang.
soyang.
코딩 및 개발 일지를 기록합니다.
  • soyang.
    소소한 코딩일지
    soyang.
  • 전체
    오늘
    어제
  • 링크

    • Github 🐾
    • 포트폴리오 📓 (리뉴얼중)
    • LinkedIn 👩🏻‍💼
  • 공지사항

    • 소소한 코딩일지
  • 블로그 메뉴

    • 방명록
    • 분류 전체보기 (181)
      • 🚩 목표 & 회고 (9)
      • 📓 Papers (10)
      • 🧇 Algorithm (44)
        • 이론 (1)
        • LeetCode (2)
        • 프로그래머스 (30)
        • 백준 (11)
      • 💻 Study (47)
        • 🤖 AI 인공지능 (3)
        • Python 파이썬 (3)
        • Docker 도커 (4)
        • 웹 (20)
        • 안드로이드 (2)
        • JAVA 자바 (1)
        • Firebase (3)
        • Linux 리눅스 (10)
      • 🍪 Projects (2)
      • 🎒 학교 (44)
        • 대학원 도비 (2)
        • 21 동계 모각코: 슈붕팥붕 (13)
        • 21 하계 모각코: 와팬호 (13)
        • 20 동계 모각코: 와팬호 (13)
      • 활동들 (16)
        • 인프런 대학생 LEAF 2기 (9)
        • 2021 Silicon Valley Online .. (7)
  • 태그

    모각코
    Linux
    error
    프로그래머스
    알고리즘스터디
    Artificial Intelligence
    알고리즘
    노마드코더
    목표
    Algorithm
    Ai
    Gentoo
    인프런대학생Leaf
    Python
    코딩테스트
    리액트
    React
    공부
    백준
    programmers
  • 최근 댓글

  • hELLO· Designed By정상우.v4.10.3
soyang.
YOLOv4: Optimal Speed and Accuracy of Object Detection
상단으로

티스토리툴바