YOLOv3

Yolov3

YOLOv2 ์ดํ›„ ๋‚˜์˜จ ๋…ผ๋ฌธ์„ ์ ์šฉํ•ด Object Detection์˜ ์•ฝ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ์‹คํ—˜์„ ํ•ฉ๋‹ˆ๋‹ค.

์ •ํ™•์„ฑ์€ ๋†’์ง€๋งŒ ์—ฌ์ „ํžˆ ๋น ๋ฆ…๋‹ˆ๋‹ค!

  • SSD๋ณด๋‹ค 3๋ฐฐ ๋น ๋ฅด์ง€๋งŒ ์ •ํ™•๋„๋Š” ๋†’์Šต๋‹ˆ๋‹ค.

  • RetinaNet๊ณผ ์ •ํ™•๋„๊ฐ€ ์œ ์‚ฌํ•˜์ง€๋งŒ ๋น ๋ฆ…๋‹ˆ๋‹ค.

Bounding Box Prediction

  • YOLOv2๋Š” Anchor Box๋กœ Dimension cluster๋ฅผ ์‚ฌ์šฉํ•ด์„œ Bounding Box๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

  • tx,ty,tw,tht_x, t_y, t_w, t_h๋ฅผ ์—์ธกํ•˜๊ณ  ์ขŒ์ƒ๋‹จ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด cx,cyc_x, c_y ๋งŒํผ offset๋˜๊ณ  bounding box์˜ width, height๊ฐ€ pw,php_w, p_h์ธ ๊ฒฝ์šฐ ์ตœ์ข… bounding box๋Š” bx,by,bw,bhb_x, b_y, b_w, b_h์ž…๋‹ˆ๋‹ค.

  • L2 loss๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šตํ–ˆ๊ณ  YOLOv3๋Š” ์ด ์‹์„ ๋’ค์ง‘์–ด์„œ ๋ฐ”๋กœ t^โˆ—โˆ’tโˆ—\hat{t}_* - t_*์„ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ground truth๋ฅผ tx,ty,tw,tht_x, t_y, t_w, t_h๋กœ ๋งŒ๋“ ๋‹ค๋Š” ์˜๋ฏธ ์ž…๋‹ˆ๋‹ค.

  • ๋งŒ์•ฝ bounding box๊ฐ€ ๋‹ค๋ฅธ box๋ณด๋‹ค ground truth์™€ ๋งŽ์ด ๊ฒน์น˜๋Š” ๊ฒฝ์šฐ IOU๋Š” 1์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ IOU๊ฐ€ ์ œ์ผ ์ข‹์€ ๊ฒƒ์ด ์•„๋‹ˆ๋ฉด์„œ ์ž„๊ณ„๊ฐ’ ์ด์ƒ์˜ IOU๋ฅผ ๊ฐ€์ง„๋‹ค๋ฉด ์˜ˆ์ธก์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ground truth์— 1๊ฐœ์˜ bounding box๋งŒ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

  • IOU ์ž„๊ณ„๊ฐ’์€ 0.5์ž…๋‹ˆ๋‹ค.

  • bounding box๊ฐ€ ground truth์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ classification loss๋Š” ์—†๊ณ  objectness loss๋งŒ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Class Prediction

  • ๊ฐ bounding box๋Š” multi-label classification์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • multi-label classification์€ softmax๊ฐ€ ์ข‹์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— binary cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Predictions Across Scales

  • YOLOv3๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์„ ๊ฐ€์ง€๋Š” 3๊ฐ€์ง€ box๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

  • feature pyramid networks์™€ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

  • ๋ช‡๊ฐœ์˜ convolutional layer๊ฐ€ ์ถ”๊ฐ€๋˜๊ณ  ์ถœ๋ ฅ์€ 3-d tensor ์ž…๋‹ˆ๋‹ค.

  • N x N x [3 * (4(bounding box offsets) + 1(objectness) + 80(class))]

  • ์ด์ „์˜ 2๋ฒˆ์งธ layer์—์„œ feature map์„ 2๋ฐฐ Upsampling ํ•ฉ๋‹ˆ๋‹ค.

  • ์ดˆ๊ธฐ๋ถ€ํ„ฐ feature map์„ ๊ฐ€์ ธ์™€ Upsampling๋œ feature map๊ณผ concatํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด(์ด์ „ layer)์™€ ์„ธ๋ถ„ํ™” ๋œ ์ •๋ณด(์ดˆ๊ธฐ layer)๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ฒฐํ•ฉ ๋œ feature map์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ convolutional layer๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  • ์ตœ์ข… scale์˜ box๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ฐ™์€ ๋””์ž์ธ์„ ํ•œ๋ฒˆ๋” ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ 3๋ฒˆ์งธ scale์˜ ์˜ˆ์ธก์€ ๋ชจ๋“  ์ด์ „ layer์™€ ์ดˆ๊ธฐ์˜ ์„ธ๋ถ„ํ™”๋˜๊ณ  ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • k-means๋ฅผ ํ†ตํ•ด anchor box๋ฅผ clusteringํ•˜๊ณ  9๊ฐœ์˜ cluster์™€ 3๊ฐœ์˜ scale๋ฅผ ์ž„์˜๋กœ ์„ ํƒํ•ด cluster๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

  • COCO์˜ ๊ฒฝ์šฐ (10 ร— 13), (16 ร— 30), (33 ร— 23), (30 ร— 61), (62 ร— 45), (59 ร— 119), (116 ร— 90) , (156 ร— 198), (373 ร— 326) ์ž…๋‹ˆ๋‹ค.

Feature Extractor

ํŠน์ง• ์ถ”์ถœ์„ ์œ„ํ•œ DarkNet53์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

DarkNet53์„ ๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์€ ImageNet์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Training

  • mining๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • multi-scale training, data augmentation, batch normalization ๋“ฑ ๋งŽ์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

How We Do

  • COCO์˜ ์ด์ƒํ•œ mAP๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด SSD ๋ณ€ํ˜•๊ณผ ๋™์ผํ•˜์ง€๋งŒ 3๋ฐฐ๋Š” ๋น ๋ฆ…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์ธก์ •๋ฒ•์œผ๋กœ RetinaNet๊ณผ ๊ฐ™์€ ๋ชจ๋ธ๋ณด๋‹ค ์•ฝ๊ฐ„ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • IOU = 0.5์—์„œ AP50๋ฅผ ๋ณผ๋•Œ YOLOv3๋Š” ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • IOU์˜ threshold๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด Object์™€ Box๋ฅผ ์™„๋ฒฝํžˆ ์ •๋ ฌํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์–ด ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

  • ์ด์ „์— YOLO์˜ ์•ฝ์ ์ธ ์ž‘์€ ๋ฌผ์ฒด๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์ข‹์•„์กŒ์Šต๋‹ˆ๋‹ค.

Things We Tried That Didn't Work

  • anchor box์˜ x, y offset์„ ์˜ˆ์ธก : linear activation์„ ์‚ฌ์šฉํ•ด์„œ box์˜ width, height์˜ ๋ฐฐ์ˆ˜๋กœ์จ anchor box์˜ x, y๋ฅผ ์˜ˆ์ธก์„ ์‹œ๋„ํ–ˆ์ง€๋งŒ ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

  • Linear x, yt predictions instead of logistic : logistic activation๋Œ€์‹  linear activation์„ ์‚ฌ์šฉํ•ด x, y์˜ offset์„ ์˜ˆ์ธกํ•˜๋ ค ํ–ˆ์ง€๋งŒ ๋ช‡ ํฌ์ธํŠธ ์ •๋„์˜ mAP ์„ฑ๋Šฅ์„ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค.

  • Focal Loss : mAP๊ฐ€ 2% ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋ฏธ objectness, classification์ด ์ž˜๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ•˜์ง€๋งŒ ํ™•์‹ ํ•  ์ˆ˜ ์—†๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • Dual IOU thresholds and truth assignment : Faster RCNN์—์„œ ๊ณ ์•ˆ๋œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‘๊ฐœ์˜ IOU๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ธก IOU๊ฐ€ 0.7์ด์ƒ์ด๋ฉด ๊ธ์ •์ ์ธ sample์ด๊ณ  0.3์ดํ•˜๋ฉด ๋ถ€์ •์ ์ธ sample์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

What This All Means

YOLOv3๋Š” ์ •ํ™•ํ•˜๊ณ  ๋น ๋ฆ…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ COCO metric(0.5 ~ 0.95๊นŒ์ง€ ์กฐ๊ธˆ์”ฉ ๋Š˜๋ฆฌ๋ฉด์„œ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•)์œผ๋กœ๋Š” ์ข‹์ง€ ์•Š์ง€๋งŒ AP50 metric์€ ์ข‹์Šต๋‹ˆ๋‹ค. Russakovsky et al.์€ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ IOU๊ฐ€ 0.3, 0.5์ธ bounding box๋ฅผ ๊ตฌ๋ถ„ํ•˜๋„๋ก ํ•˜๊ฒŒ ํ–ˆ์ง€๋งŒ ๊ตฌ๋ถ„์„ ์ž˜ ๋ชปํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋ง์€ ์ฆ‰์Šจ COCO metric์ฒ˜๋Ÿผ ์„ธ๋ฐ€ํ•œ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์ด ์ •๋ง ์ข‹์€์ง€์— ๋Œ€ํ•œ ์˜๊ฒฌ์„ ๋งํ•ฉ๋‹ˆ๋‹ค.

Rebutter๋Š” YOLO benchmarking์˜ ์œ„์น˜, COCO metric์ด ์•ฝํ•œ ์ด์œ ๋ฅผ ๋” ์„ธ๋ฐ€ํ•˜๊ฒŒ ํ’€์–ด๋‚ด์ง€๋งŒ ์ง์ ‘์ ์œผ๋กœ ๋‹ค๋ฃจ์ง€ ์•Š๊ฒ ์Šต๋‹ˆ๋‹ค.

Last updated

Was this helpful?