Study: Artificial Intelligence(AI)/AI: 2D Vision(Det, Seg, Trac)

[Vision] Semantic Segmentation VS. Instance Segmentation

DrawingProcess 2024. 1. 15. 17:11
๋ฐ˜์‘ํ˜•
๐Ÿ’ก ๋ณธ ๋ฌธ์„œ๋Š” '[Vision] Semantic Segmentation VS. Instance Segmentation'์— ๋Œ€ํ•ด ์ •๋ฆฌํ•ด๋†“์€ ๊ธ€์ž…๋‹ˆ๋‹ค.
์ปดํ“จํ„ฐ ๋น„์ „์—๋Š” ํฌ๊ฒŒ 4๊ฐ€์ง€์˜ ๊ณผ์ œ(1. Classification 2. Object Detection 3. Image Segmentation 4. Visual relationship)๊ฐ€ ์žˆ๋‹ค. ์ด ๊ธ€์—์„œ๋Š” ์ด ์ค‘ 3. Image Segmentation์— ๊ด€ํ•ด ์ •๋ฆฌํ•˜์˜€์œผ๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

1. Classification / Object Detection / Image segmentation

Classification / Object Detection

๋จผ์ € Image Segmentation ์ด์ „์˜ ๊ณผ์ œ๋“ค์ธ Classification๊ณผ Object Detection์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•˜์ž๋ฉด Classification์€ Image๊ฐ€ ์ฃผ์–ด์กŒ์„๋•Œ ์ด ์ด๋ฏธ์ง€๊ฐ€ ์–ด๋–ค ์‚ฌ์ง„์ธ์ง€, ์–ด๋–ค Object๋ฅผ ๋Œ€ํ‘œํ•˜๋Š”์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์ œ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์œ„ ๊ทธ๋ฆผ์—์„œ ๊ณ ์–‘์ด์˜ ์œ„์น˜๊ฐ€ ์•„๋ž˜ ์˜ˆ์‹œ์™€ ๊ฐ™์ด ๋ณ€ํ•˜์—ฌ๋„ Classification์—์„œ๋Š” ๋˜‘๊ฐ™์ด ๊ณ ์–‘์ด๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•ด์•ผํ•œ๋‹ค.

1) Classification: translation invariance

์ด๋ฅผ "translation invariance" ๋ผ๊ณ  ํ•˜๋ฉฐ Classification์€ translation invariance๋ฅผ ๋งŒ์กฑํ•ด์•ผํ•œ๋‹ค.

CNN์—์„œ translation invariance๋ž€ input์˜ ์œ„์น˜๊ฐ€ ๋‹ฌ๋ผ์ ธ๋„ output์ด ๋™์ผํ•œ ๊ฐ’์„ ๊ฐ–๋Š”๊ฒƒ์„ ๋งํ•œ๋‹ค. ๋ฌผ๋ก  CNN ๋„คํŠธ์›Œํฌ ์ž์ฒด๋Š” translation equivariance(variance)ํ•˜๋‹ค. convolution filter๋กœ ์—ฐ์‚ฐ์„ ํ• ๋•Œ ํŠน์ • feature์˜ ์œ„์น˜๊ฐ€ ๋ฐ”๋€Œ๋ฉด ๋‹น์—ฐํžˆ output์—์„œ ํ•ด๋‹น feature์— ๋Œ€ํ•œ ์—ฐ์‚ฐ๊ฒฐ๊ณผ์˜ ์œ„์น˜๋„ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

์ด๋ ‡๋“ฏ CNN์€ translation equivarianceํ•œ๋ฐ ์–ด๋–ป๊ฒŒ CNN์œผ๋กœ ์ด๋ฃจ์–ด์ง„ Classification์ด translation invariance ํ• ๊นŒ? ์ด๋Š” ๋‹ค์Œ์˜ 3๊ฐ€์ง€ ๊ณผ์ •์„ ํ†ตํ•ด translation invarianceํ•˜๊ฒŒ ๋œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

  • 1. Max pooling
  • 2. CNN์˜ ํŠน์ง•์ธ Weight sharing & Learn local features
  • 3. Softmax๋ฅผ ํ†ตํ•œ ํ™•๋ฅ ๊ฐ’ ๊ณ„์‚ฐ

๊ด€๋ จ๋œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋‹ค์Œ์˜ ๋ธ”๋กœ๊ทธ 'translation invariance ์„ค๋ช… ๋ฐ ์ •๋ฆฌ'๋ฅผ ์ฐธ๊ณ ํ•˜๊ธฐ ๋ฐ”๋ž€๋‹ค.

2) Localization: translation variance

Object detection์€ object์˜ ๋ถ„๋ฅ˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•ด๋‹น object๊ฐ€ ์–ด๋А ์œ„์น˜์— ์žˆ๋Š”์ง€ bounding box๋ฅผ ํ†ตํ•ด ์ฐพ์•„์•ผํ•œ๋‹ค.

์ฆ‰, localization์„ ์ˆ˜ํ–‰ํ•ด์•ผํ•˜๋ฉฐ classification๊ณผ ๋‹ค๋ฅด๊ฒŒ ์œ„์น˜๋ฅผ ์ฐพ์•„์•ผ ํ•˜๋ฏ€๋กœ ๋”์ด์ƒ "translation invariance"์ด๋ฉด ์•ˆ๋œ๋‹ค. ๋”ฐ๋ผ์„œ Object detection์€ "translation variance"๋ฅผ ๋งŒ์กฑํ•ด์•ผํ•œ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด Image segmentation์€ ๋ฌด์—‡์ผ๊นŒ? Image segmentation์€ ์ด๋ฏธ์ง€์˜ ์˜์—ญ์„ ๋ถ„ํ• ํ•ด์„œ ๊ฐ object์— ๋งž๊ฒŒ ํ•ฉ์ณ์ฃผ๋Š”๊ฒƒ์„ ๋งํ•œ๋‹ค.

Semantic Segmentation / Instance Segmentation

Image segmentation์˜ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋กœ๋Š” Semantic segmentation๊ณผ Instance segmentation์ด ์žˆ๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณด๋‹ค์‹œํ”ผ Semantic segmentation์ด๋ž€ Object segmentation์„ ํ•˜๋˜ ๊ฐ™์€ class์ธ object๋“ค์€ ๊ฐ™์€ ์˜์—ญ ํ˜น์€ ์ƒ‰์œผ๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋ฐ˜๋Œ€๋กœ Instance segmentation์€ ๊ฐ™์€ class์ด์—ฌ๋„ ์„œ๋กœ ๋‹ค๋ฅธ instance๋กœ ๊ตฌ๋ถ„ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

๋”ฐ๋ผ์„œ object๊ฐ€ ๊ฒน์ณค์„๋•Œ ๊ฐ๊ฐ์˜ object๋ฅผ ๊ตฌ๋ถ„ํ•ด์ฃผ์ง€ ๋ชปํ•˜๋Š” Semantic segmentation์—์„œ์˜ ๋ฌธ์ œ๋ฅผ Instance segmentation์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

2. Semantic segmentation vs Instance segmentation

Object detection์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Image segmentation์—์„œ ์—ญ์‹œ object์˜ localization์„ ์ˆ˜ํ–‰ํ•ด์•ผํ•˜๊ธฐ์— Image segmentation์€ translation variance์„ ๋งŒ์กฑํ•ด์•ผํ•œ๋‹ค.

1) Semantic segmentation

Semantic segmentation์€ ๊ฐ ํ”ฝ์…€๋ณ„๋กœ ์–ด๋–ค class์— ์†ํ•˜๋Š”์ง€ label์„ ๊ตฌํ•ด์ค˜์•ผ ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ One-Hot encoding์œผ๋กœ ๊ฐ class์— ๋Œ€ํ•ด class๊ฐœ์ˆ˜๋งŒํผ ์ถœ๋ ฅ์ฑ„๋„์„ ๋งŒ๋“ ๋‹ค. ๊ทธ ํ›„ argmax๋ฅผ ํ†ตํ•ด ์œ„ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ ํ•˜๋‚˜์˜ output์„ ๊ณ„์‚ฐํ•œ๋‹ค.

๊ฐ Class๋ณ„๋กœ ์ถœ๋ ฅ์ฑ„๋„์„ ๋งŒ๋“  ํ›„ argmax

์ด๋ ‡๊ฒŒ semantic segmentation์€ pixel๋“ค์ด ๊ฐ class์— ๋Œ€ํ•ด binaryํ•˜๊ฒŒ ํฌํ•จ๋˜๋Š”์ง€ ์•ˆ๋˜๋Š”์ง€ ์—ฌ๋ถ€๋งŒ ๋”ฐ์ง„๋‹ค.

์ฆ‰, ๊ฐ•์•„์ง€ ouput channel์—์„œ๋Š” ๊ฐ pixel๋“ค์— ๋Œ€ํ•ด ๊ฐ•์•„์ง€์— ํฌํ•จ๋˜๋Š” pixel์ธ์ง€ ์•„๋‹Œ์ง€, ์‚ฌ๋žŒ output channel์—์„œ๋Š” ๊ฐ pixel๋“ค์ด ์‚ฌ๋žŒ์— ํฌํ•จ๋˜๋Š” pixel์ธ์ง€ ์•„๋‹Œ์ง€, .. 0๊ณผ 1๋กœ binaryํ•˜๊ฒŒ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

๋”ฐ๋ผ์„œ ์œ„์™€ ๊ฐ™์ด segmentation๋˜๋Š”๋ฐ ์ด๋•Œ ๊ฐ™์€ class์˜ object๋“ค์— ๋Œ€ํ•ด ์„œ๋กœ ๊ตฌ๋ถ„์ง€์„ ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์šฐ์ธกํ•˜๋‹จ์˜ ์ฐจ๋“ค์„ ๋ณด๋ฉด ๋ช‡๋Œ€์˜ ์ฐจ๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ฒน์ณ์ ธ์žˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†๊ณ , ๋‹จ์ง€ ์ € pixel๋“ค์ด ์ฐจ์— ํฌํ•จ๋œ๋‹ค๋Š” ๊ฒƒ๋งŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

2) Instance Segmentation

๋ฐ˜๋Œ€๋กœ Instance segmentation์€ ๊ฐ ํ”ฝ์…€๋ณ„๋กœ ์–ด๋–ค ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•˜๋Š”์ง€ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๊ฐ ํ”ฝ์…€๋ณ„๋กœ object๊ฐ€ ์žˆ๋Š”์ง€ ์—†๋Š”์ง€ ์—ฌ๋ถ€๋งŒ ๊ณ„์‚ฐํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ํ•˜๋Š˜์ด๋‚˜ ๋„๋กœ ๋“ฑ  ์ •ํ•ด์ง„ ํ˜•ํƒœ๊ฐ€ ์—†๋Š” ๋ฌผ์ฒด์˜ ๊ฒฝ์šฐ๋Š” ํด๋ž˜์Šค ๋ผ๋ฒจ์„ ๋ถ€์—ฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ Mask R-CNN๊ณผ ๊ฐ™์€ 2-stage detector์—์„œ๋Š” ๋จผ์ € object๋“ค์„ bounding box๋ฅผ ํ†ตํ•ด localization์‹œํ‚จ๋‹ค. ๊ทธ ํ›„ ์œ„์—์„œ class๋ณ„๋กœ output ์ฑ„๋„์„ ๋งŒ๋“  ๊ฒƒ๊ณผ ๊ฐ™์ด localize๋œ RoI๋งˆ๋‹ค class์˜ ๊ฐœ์ˆ˜๋งŒํผ binary mask(instance์ธ์ง€ ์•„๋‹Œ์ง€)๋ฅผ ์”Œ์›Œ์ค€๋‹ค.

semantic segmantation๊ณผ ๋‹ค๋ฅด๊ฒŒ ์ด๋ฏธ์ง€ ์‚ฌ์ด์ฆˆ ํฌ๊ธฐ๋กœ class ๊ฐœ์ˆ˜๋งŒํผ output ์ฑ„๋„์ด ์กด์žฌํ•˜์ง€ ์•Š๊ณ  RoI ๋ณ„๋กœ class ๊ฐœ์ˆ˜๋งŒํผ output ์ฑ„๋„์ด ์กด์žฌํ•˜๊ณ  ๋™์ผ class๋”๋ผ๋„ ์„œ๋กœ ๋‹ค๋ฅธ instance, ์ฆ‰ RoI๊ฐ€ focusํ•˜๋Š” instance๋ถ€๋ถ„๋งŒ value๋ฅผ ๊ฐ–๋„๋ก ํ•œ๋‹ค.

3) Panoptic Segmentation

panoptic segmentation์€ ์œ„ ๋‘ ๊ฐ€์ง€์˜ segmentation์„ ๊ฒฐํ•ฉํ•œ ํ˜•ํƒœ๋กœ ์ด๋ฏธ์ง€ ์•ˆ์˜ ๋ชจ๋“  ํ™”์†Œ์— ๋Œ€ํ•ด ํด๋ž˜์Šค ๋ผ๋ฒจ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ž„์˜์˜ ID๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.

์ด ํƒœ์Šคํฌ์—์„œ๋Š” ๊ฐ์ฒด๊ฐ„ ๊ตฌ๋ณ„์ด ๋˜๋Š” ํด๋ž˜์Šค(์˜ˆ๋ฅผ ๋“ค์–ด ์ฐจ๋‚˜ ์‚ฌ๋žŒ)์„ Thing ํด๋ž˜์Šค๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ฐ์ฒด๊ฐ„ ๊ตฌ๋ณ„์ด ๋˜์ง€ ์•Š๋Š” ํด๋ž˜์Šค(์˜ˆ๋ฅผ ๋“ค์–ด ํ•˜๋Š˜์ด๋‚˜ ๊ธธ)์€ Stuff ํด๋ž˜์Šค๋ผ๊ณ  ํ•˜๋Š”๋ฐ, Thing ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„œ๋Š” Instance Segmentation, Stuff ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„œ๋Š” semantic segmentation์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

3. Semantic segmentation ๋ชจ๋ธ ํ•™์Šต ๋ฐฉ์‹

๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ ํ•™์Šต ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ Ground Truth(GT) ์‚ฌ์šฉ ๋ฐฉ์‹์— ๋”ฐ๋ผ ๊ตฌ๋ถ„ํ•˜๋ฉฐ, ์ž…๋ ฅ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ GT ์ƒ์„ฑ ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ๋ถ„์•ผ.

  • Supervised SS: ๋งŽ์€ (์ž…๋ ฅ ์ด๋ฏธ์ง€-GT) ์Œ์„ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ํ•™์Šต
  • Semi-supervised SS: ์ ์€ ์Œ๊ณผ ๋งŽ์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ํ•™์Šต
  • Weakly-supervised SS: ๋งŽ์€ ์„ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ํ•™์Šต
  • Unsupervised SS: ์ž…๋ ฅ ์ด๋ฏธ์ง€๋งŒ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ํ•™์Šต

Weakly-supervised SS

์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ์ •๋ณด๋กœ Semantic Segmentation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ

  • ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ์ •๋ณด๋ฅผ Weakly supervision ์ด๋ผ๊ณ  ๋ถ€๋ฆ„
  • ์œ„์น˜ ์ •๋ณด(Bounding Box, Point), ๋ฒ”์ฃผ ์ •๋ณด๋ฅผ Weakly supervision์œผ๋กœ ์‚ฌ์šฉ

์ž…๋ ฅ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ GT ์ƒ์„ฑ ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, ๋Œ€ํ‘œ์ ์œผ๋กœ Segment-Anything(SAM) ๋ชจ๋ธ์—์„œ์˜ ์ž…๋ ฅ ์ •๋ณด๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Conclusion

์š”์•ฝํ•˜์ž๋ฉด, class label์ด 10๊ฐœ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ Semantic segmentation์—์„œ๋Š” ๊ฐ pixel๋“ค์ด ์–ด๋–ค class์— ํฌํ•จ๋˜๋Š”์ง€ ์•ˆ๋˜๋Š”์ง€๋ฅผ 10๊ฐœ์˜ class์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ binaryํ•˜๊ฒŒ ๊ณ„์‚ฐํ•œ๋‹ค.

๋ฐ˜๋ฉด, Instance segmentation์€ ์ด๋ฏธ localization์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ๊ทธ box๊ฐ€ focusํ•˜๊ณ  ์žˆ๋Š” instance์˜ pixel์ด ๊ถ๊ธˆํ•œ ๊ฒƒ์ด๋ฏ€๋กœ ๊ฐ box์— ๋Œ€ํ•ด image segmentation์„ ํ•˜๋Š”๋ฐ ์ด๋•Œ ๋™์ผ class์—ฌ๋„ ์„œ๋กœ ๋‹ค๋ฅธ instance์ด๋ฉด value๋ฅผ ๊ฐ–์ง€ ์•Š๋Š”๋‹ค.

๋”ฐ๋ผ์„œ Instance segmentation์„ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์žฅ์ ์€ Object ๋ณ„๋กœ ๊ตฌ๋ถ„์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, localization ์ˆ˜ํ–‰ ํ›„ ๊ฐ Object ๋ณ„๋กœ segmenation ํ•˜๊ธฐ์— ์—ฐ์‚ฐ ๋ฐ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ์ด์ ์ด ์žˆ๋‹ค.

์ฐธ๊ณ 

๋ฐ˜์‘ํ˜•