728x90

Abstract

a huge number of features → to improve CNN accuracy

YOLO v4 - use new features and achieve state-of -the-art results.2) CSP (Cross-Stage-Partial-Connections)4) SAT (Self-Adversarial-Training)6) Mosaic Data Agumentation8) CIOU Loss
7) Drop Block Regularization
5) Mish Activation
3) CmBN (Cross mini-Batch Normalizations)
1) WRC (Weighted-Residual-Connections)

⇒ acheive: MS COCO dataset AP: 43,5%, 65 FPS(realtime)

1. Introduction

Problem

The most accurate modern NN

do not operate in real time
require large number of GPUs for training with large mini-batch-size
Figure 1. YOLO v4 vs. state-of-the-art object detectors
- YOLO v4 vs. EfficientDet: comparable performance, x2 faster FPS
- improve YOLO v3's AP 10%, FP 12%

contributions

efficient and powerful object detection model
verify the influence of BoF and BoS methods during training
modify state-of-the-art methods more effecient and suitable for single GPU training

2. Related Work

2.1 Object detection models

backbone: pre-trained on ImageNet
neck: collect featuremaps from each stages
head: class + bounding box prediction
- Dense Prediction(one-stage)
- Sparse Prediction(two-stage)
  : class prediction, bounding box regression 부분이 분리

2.2 Bag of freebies(BOF)

: methods that only change training strategy or only increase training cost

⇒ better accuracy without increasing inference cost

Data augmentation
1. photometric or geometric distortions
2. CutOut
3. CutMix
Regularization
1. DropOut
2. DropPath
3. Spatial DropOut
4. DropBlock
Objective function of BBox Regression: Loss function
1. MSE
2. IoU
  - GIoU, CIoU, DIoU

2.3 Bag of Specials(BOS)

: plugin modules or post-processing(후처리) methods that increase the inference cost (small) + significantly improve the accuracy

enhance receptive field
: SPP, ASPP, RFB
attention module
: SE(Squeeze and Excitation), SAM(Spatial attention module)
feature integration
1. skip connection, hyper-column
2. SFAM(SE), ASFF(softmax), BiFPN(multi-input weighted residual connections)
good activation function
1. ReLU
2. LReLU, PReLU
3. SELU
4. Swish, hard-Swish
5. Mish
post-processing → no longer required in anchor-free
1. NMS(Non-Maximum Suppression): optimize the objective function

3. Methodology

fast operating speed of neural net + optimization for parallel computations

3.1 Selection of architecture

Objective

optimal balance among the network resolution, convolution layer number, parameter number, layer outputs number
select additional blocks for increasing the receptive field
and best parameter aggregation

CSPNet

→ design and use a CSPNet based backbone

propose Cross Stage Partial Network structure
: reduce extremely heavy inference cost and minimize accuracy loss
Figure: CSPNet based backbone architecture: After dividing input feature map into 2 parts, one part doesn't participate in the operation and then merges into output.
→ reduce the inference cost, memory cost, etc.

detector requires...
- YOLO의 문제 : 작은 object에 취약하다. → 다양한 작은 object를 잘 검출하기 위해 input resolution을 크게 사용했다.
- receptive field를 물리적으로 키워주기 위해 layer 수를 늘림
- 하나의 image에서 다양한 종류, 다양한 크기의 object들을 동시에 검출하려면 높은 표현력이 필요하므로 paraeters 수를 늘림
higher input network resolution
: detect multiple small-sized objects
more layers
: higher receptive field (increased size of input network)
more parmeters
: to detect multiple objects of different sizes in a single image
larger receptive field, larger number of parameters → backbone.
- CSPDarknet53: larger receptive field, larger number of parameter, FPS fastest
- CSPDarknet53 → optimal backbone for a detector!

YOLOv4 > CSPDarknet53

additional blocks: SPP block
1. increase receptive field
2. separate out the context feature
3. no reduction of the network operating speed
parameter aggregation: PAN
1. YOLOv3: FPN

📌 Final architecture

backbone: CSPDarknet53
neck:
1. additional blocks: SPP(Spational Pyramid Pooling)
2. path-aggregation: PANet(Path Aggregation Network)
head: YOLOv3

3.2 Selection of BoF and BoS

BoF
- Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
- Data augmentation: CutOut, MixUp, CutMix
- Regularization method: ~~DropOut, DropPath, Spatial DropOut~~, DropBlock
BoS
- Activations: ReLu, leaky-ReLu,
  ~~PReLU, ReLU6, SELU,~~
  Swish, Mish→ ReLU6: for quantization network
- → PReLU, SELU: difficult to train
- Normalization of the network activations by their mean and variance
  : BN,
  ~~CGBN(or SyncBN)~~
  , FRN(Filter Response Normalization), CBN(Cross-Iteration Batch Normalization)
- → single GPU
- Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, Cross stage partial connections (CSP)

3.3 Additional improvements

: designed and improved the detector more suitable for training on single GPU

introduce new method of data augmentation
- Mosaic, SAT
select optimal hyper-parameters: genetic algorithms
modify existing methods → suitable for efficient training and detection
- modified SAM, modified PAN, CmBN

BOF

Mosaic
- : mix 4 trainining images
- detect objects outside normal context
- batch normalization
  : calculate activation statistics from 4 images on each layer
- → reduce the need for large mini-batch size
SAT (Self-Adversarial Training)
1. alter the original image
  - adversarial attack (itself)
  → create the deception that there's no desired object on the image.
2. train the neural network to detect an object on this modified image in the noraml way
: 2 forward backward stages

BOS

CmBN
- Figure 4: Cross mini-Batch Normalization
- collect statistics only between mini-batches within a single batch
modified SAM
- spatial-wise attention → point-wise attention
- Figure 5: Modified SAM
modified PAN
- PAN's shortcut connection → concatenation (replace)
- Figure 6: Modified PAN

3.4 YOLOv4

YOLOv4

Backbone: CSPDarknet53
Neck: SPP, PAN
Head: YOLOv3

BOF

backbone
- data augmentation: CutMix, Mosaic
- imbalance sampling: Class labeling smoothing
- Regularization: DropBlock
detector
- objective function: CIoU-loss
- normalization of network activation: CmBN
- regularization: DropBlock
- data augmentation: Mosaic, SAT
- hyper-parameters optimization: Genetic algorithms
- learning rate scheduler: Cosine annealing scheduler
- 기타:
  - eliminate grid sensitivity
  - use multiple anchors for a single ground truth
  - random training shapes

BOS

backbone
- activation: Mish
- skip connections: CSP, MiWRC
detector
- activation: Mish
- receptive field enchancement: SPP
- attention: modified SAM
- feature integration: modified PAN
- post-processing: DIoU-NMS

4. Experiments

4.1 Experimental SetUp

4.2 Influence of different features on Classifier training

features

Class label smoothing
data augmentation
: bilateral blurring, MixUp, CutMix, Mosaic
activations
: Leaky-ReLU(by default), Swish, Mish

Result

improve accuracy
BoF-backbone: CutMix, Mosaic, Class label smoothing
additional option: Mish

Untitled

4.3 Influence of different feature on Detector training

BOF

1) BOF
- loss: MSE 고정
- M, GA, CBN, CA → good performance
2) BOF: S, M, IT, GA
- loss: GIoU, DIoU, CIoU
- S, M, IT, GA + CIoU → improve performance
3) OA(Optimized Anchors)
- CIoU + S, M, IT, GA
- OA → improve performance
4) Loss
- Loss: MSE, GIoU, CIoU
- GIoU, CIoU → high performance

BOS

backbone: CSPResNeXt50
features
: PAN, RFB, SAM, Gaussian YOLO (G), ASFF

Untitled

SPP + PAN + SAM → BEST

4.4 Influence of different backbones and pre-trained weightings on Detector training

: the influence of different backbone models on the detector accuracy

best classification accuracy model is not always the best detector accuracy model.

CSPResNexT50: classifier
- BoF, Mish + CSPResNeXT50
  : increase classifier, decrease detector acc
CSPDarknet53: detector
- BoF, Mish + CSPDarknet53
  : increase both accuracy
- more suitable for detector

4.5 Influence of different mini-batch size on Detector training

: compare the results of models trained with different mini-batch sizes.

After BoF, BoS, mini-batch size → no effect on the detector's performance
⇒ After BoF, BoS, no need for expensive GPUs

5. Results

Figure 8: comparison of the speed and accuracy of object detectors

YOLOv4

located on the Pareto optimality curve
superior in speed + accuracy

6. Conclusion

faster(FPS), more accurate(MS COCO AP50...95, AP50) detector
be trained and used on a conventional GPU with 8 to 16 GB VRAM → broad use
one-stage anchor-based detector의 viability 입증
verify features/ select features that improve the accuracy of bot classifier and detector

728x90

저작자표시 (새창열림)

'📓 Papers' 카테고리의 다른 글

HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching (0)	2021.10.03
SIMPLE ONLINE AND REALTIME TRACKING (0)	2021.10.03
CDnetV2: CNN-Based Cloud Detection for Remote Sensing Imagery With Cloud-Snow Coexistence (0)	2021.10.03
Semi-supervised semantic segmentation needs strong, varied perturbations (0)	2021.10.03
PointRend: Image Segmentation as Rendering (0)	2021.10.03

YOLOv4: Optimal Speed and Accuracy of Object Detection

Abstract

1. Introduction

2. Related Work

2.1 Object detection models

2.2 Bag of freebies(BOF)

2.3 Bag of Specials(BOS)

3. Methodology

3.1 Selection of architecture

📌 Final architecture

3.2 Selection of BoF and BoS

3.3 Additional improvements

3.4 YOLOv4

4. Experiments

4.2 Influence of different features on Classifier training

4.3 Influence of different feature on Detector training

4.4 Influence of different backbones and pre-trained weightings on Detector training

4.5 Influence of different mini-batch size on Detector training

5. Results

6. Conclusion

'📓 Papers' 카테고리의 다른 글

티스토리툴바