새소식

Paper Review

YOLOv4: Optimal Speed and Accuracy of Object Detection

  • -
728x90

Abstract

a huge number of features → to improve CNN accuracy

  • YOLO v4 - use new features and achieve state-of -the-art results.2) CSP (Cross-Stage-Partial-Connections)4) SAT (Self-Adversarial-Training)6) Mosaic Data Agumentation8) CIOU Loss
  • 7) Drop Block Regularization
  • 5) Mish Activation
  • 3) CmBN (Cross mini-Batch Normalizations)
  • 1) WRC (Weighted-Residual-Connections)

⇒ acheive: MS COCO dataset AP: 43,5%, 65 FPS(realtime)

 

1. Introduction

Problem

The most accurate modern NN

  • do not operate in real time
  • require large number of GPUs for training with large mini-batch-size
  • Figure 1. YOLO v4 vs. state-of-the-art object detectors
    • YOLO v4 vs. EfficientDet: comparable performance, x2 faster FPS
    • improve YOLO v3's AP 10%, FP 12%

 

contributions

  1. efficient and powerful object detection model
  2. verify the influence of BoF and BoS methods during training
  3. modify state-of-the-art methods more effecient and suitable for single GPU training

 

2. Related Work

2.1 Object detection models

  • backbone: pre-trained on ImageNet
  • neck: collect featuremaps from each stages
  • head: class + bounding box prediction
    • Dense Prediction(one-stage)
    • Sparse Prediction(two-stage)
      : class prediction, bounding box regression 부분이 분리

2.2 Bag of freebies(BOF)

: methods that only change training strategy or only increase training cost

⇒ better accuracy without increasing inference cost

  1. Data augmentation
    1. photometric or geometric distortions
    2. CutOut
    3. CutMix
  2. Regularization
    1. DropOut
    2. DropPath
    3. Spatial DropOut
    4. DropBlock
  3. Objective function of BBox Regression: Loss function
    1. MSE
    2. IoU
      • GIoU, CIoU, DIoU

2.3 Bag of Specials(BOS)

: plugin modules or post-processing(후처리) methods that increase the inference cost (small) + significantly improve the accuracy

  1. enhance receptive field
    : SPP, ASPP, RFB
  2. attention module
    : SE(Squeeze and Excitation), SAM(Spatial attention module)
  3. feature integration
    1. skip connection, hyper-column
    2. SFAM(SE), ASFF(softmax), BiFPN(multi-input weighted residual connections)
  4. good activation function
    1. ReLU
    2. LReLU, PReLU
    3. SELU
    4. Swish, hard-Swish
    5. Mish
  5. post-processing → no longer required in anchor-free
    1. NMS(Non-Maximum Suppression): optimize the objective function

 

3. Methodology

fast operating speed of neural net + optimization for parallel computations

3.1 Selection of architecture

Objective

  1. optimal balance among the network resolution, convolution layer number, parameter number, layer outputs number
  2. select additional blocks for increasing the receptive field
    and best parameter aggregation

CSPNet

→ design and use a CSPNet based backbone

  • propose Cross Stage Partial Network structure
    : reduce extremely heavy inference cost and minimize accuracy loss
  • Figure: CSPNet based backbone architecture: After dividing input feature map into 2 parts, one part doesn't participate in the operation and then merges into output.
  • → reduce the inference cost, memory cost, etc.

 

  • detector requires...
    • YOLO의 문제 : 작은 object에 취약하다. → 다양한 작은 object를 잘 검출하기 위해 input resolution을 크게 사용했다.
    • receptive field를 물리적으로 키워주기 위해 layer 수를 늘림
    • 하나의 image에서 다양한 종류, 다양한 크기의 object들을 동시에 검출하려면 높은 표현력이 필요하므로 paraeters 수를 늘림
  • higher input network resolution
    : detect multiple small-sized objects
  • more layers
    : higher receptive field (increased size of input network)
  • more parmeters
    : to detect multiple objects of different sizes in a single image
  • larger receptive field, larger number of parameters → backbone.
    • CSPDarknet53: larger receptive field, larger number of parameter, FPS fastest
    • CSPDarknet53 → optimal backbone for a detector!

 

YOLOv4 > CSPDarknet53

  1. additional blocks: SPP block
    1. increase receptive field
    2. separate out the context feature
    3. no reduction of the network operating speed
  2. parameter aggregation: PAN
    1. YOLOv3: FPN

 

 📌   Final architecture    

  1. backbone: CSPDarknet53
  2. neck:
    1. additional blocks: SPP(Spational Pyramid Pooling)
    2. path-aggregation: PANet(Path Aggregation Network)
  3. head: YOLOv3

 

3.2 Selection of BoF and BoS

  1. BoF
    • Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
    • Data augmentation: CutOut, MixUp, CutMix
    • Regularization method: DropOut, DropPath, Spatial DropOut, DropBlock
  2. BoS
    • Activations: ReLu, leaky-ReLu,
      PReLU, ReLU6, SELU,
      Swish, Mish→ ReLU6: for quantization network
    • → PReLU, SELU: difficult to train
    • Normalization of the network activations by their mean and variance
      : BN,
      CGBN(or SyncBN)
      , FRN(Filter Response Normalization), CBN(Cross-Iteration Batch Normalization)
    • → single GPU
    • Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, Cross stage partial connections (CSP)

 

3.3 Additional improvements

: designed and improved the detector more suitable for training on single GPU

  1. introduce new method of data augmentation
    • Mosaic, SAT
  2. select optimal hyper-parameters: genetic algorithms
  3. modify existing methods → suitable for efficient training and detection
    • modified SAM, modified PAN, CmBN

BOF

  1. Mosaic
    • : mix 4 trainining images
    • detect objects outside normal context
    • batch normalization
      : calculate activation statistics from 4 images on each layer
    • → reduce the need for large mini-batch size
  2. SAT (Self-Adversarial Training)
    1. alter the original image
      • adversarial attack (itself)
      → create the deception that there's no desired object on the image.
    2. train the neural network to detect an object on this modified image in the noraml way
  3. : 2 forward backward stages

BOS

  1. CmBN
    • Figure 4: Cross mini-Batch Normalization
    • Untitled
    • collect statistics only between mini-batches within a single batch
  2. modified SAM
    • spatial-wise attention → point-wise attention
    • Figure 5: Modified SAM
    • Untitled
  3. modified PAN
    • PAN's shortcut connection → concatenation (replace)
    • Figure 6: Modified PAN
    • Untitled

 

3.4 YOLOv4

YOLOv4

  • Backbone: CSPDarknet53
  • Neck: SPP, PAN
  • Head: YOLOv3

BOF

  1. backbone
    • data augmentation: CutMix, Mosaic
    • imbalance sampling: Class labeling smoothing
    • Regularization: DropBlock
  2. detector
    • objective function: CIoU-loss
    • normalization of network activation: CmBN
    • regularization: DropBlock
    • data augmentation: Mosaic, SAT
    • hyper-parameters optimization: Genetic algorithms
    • learning rate scheduler: Cosine annealing scheduler
    • 기타:
      • eliminate grid sensitivity
      • use multiple anchors for a single ground truth
      • random training shapes

BOS

  1. backbone
    • activation: Mish
    • skip connections: CSP, MiWRC
  2. detector
    • activation: Mish
    • receptive field enchancement: SPP
    • attention: modified SAM
    • feature integration: modified PAN
    • post-processing: DIoU-NMS

 

4. Experiments

4.1 Experimental SetUp

4.2 Influence of different features on Classifier training

features

  • Class label smoothing
  • data augmentation
    : bilateral blurring, MixUp, CutMix, Mosaic
  • activations
    : Leaky-ReLU(by default), Swish, Mish

Result

  • improve accuracy
  • BoF-backbone: CutMix, Mosaic, Class label smoothing
  • additional option: Mish

Untitled

 

4.3 Influence of different feature on Detector training

BOF

  • 1) BOF
    • loss: MSE 고정
    Untitled
    • M, GA, CBN, CA → good performance
  • 2) BOF: S, M, IT, GA
    • loss: GIoU, DIoU, CIoU
    Untitled
    • S, M, IT, GA + CIoU → improve performance
  • 3) OA(Optimized Anchors)
    • CIoU + S, M, IT, GA
    Untitled
    • OA → improve performance
  • 4) Loss
    • Loss: MSE, GIoU, CIoU
    Untitled
    • GIoU, CIoU → high performance

BOS

  • backbone: CSPResNeXt50
  • features
  • : PAN, RFB, SAM, Gaussian YOLO (G), ASFF

Untitled

  • SPP + PAN + SAM → BEST

 

4.4 Influence of different backbones and pre-trained weightings on Detector training

: the influence of different backbone models on the detector accuracy

  • best classification accuracy model is not always the best detector accuracy model.
  1. CSPResNexT50: classifier
    • BoF, Mish + CSPResNeXT50
      : increase classifier, decrease detector acc
  2. CSPDarknet53: detector
    • BoF, Mish + CSPDarknet53
      : increase both accuracy
    • more suitable for detector

 

4.5 Influence of different mini-batch size on Detector training

: compare the results of models trained with different mini-batch sizes.

  • After BoF, BoS, mini-batch size → no effect on the detector's performance
  • ⇒ After BoF, BoS, no need for expensive GPUs

 

5. Results

  • Figure 8: comparison of the speed and accuracy of object detectors
  • Untitled

YOLOv4

  • located on the Pareto optimality curve
  • superior in speed + accuracy

 

6. Conclusion

  • faster(FPS), more accurate(MS COCO AP50...95, AP50) detector
  • be trained and used on a conventional GPU with 8 to 16 GB VRAM → broad use
  • one-stage anchor-based detector의 viability 입증
  • verify features/ select features that improve the accuracy of bot classifier and detector
728x90
Contents

포스팅 주소를 복사했습니다

이 글이 도움이 되었다면 공감 부탁드립니다.