Study: Artificial Intelligence(AI)/AI: 3D Vision

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] PETR : Position Embedding Transformation for Multi-View 3D Object Detection (ECCV 2022)

DrawingProcess 2024. 7. 12. 05:39
๋ฐ˜์‘ํ˜•
๐Ÿ’ก ๋ณธ ๋ฌธ์„œ๋Š” 'PETR : Position Embedding Transformation for Multi-View 3D Object Detection (ECCV 2022)' ๋…ผ๋ฌธ์„ ์ •๋ฆฌํ•ด๋†“์€ ๊ธ€์ž…๋‹ˆ๋‹ค.
ํ•ด๋‹น ๋…ผ๋ฌธ์€ CLIP ๊ฐ™์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ language embedding์„ NeRF ์•ˆ์— ์ง‘์–ด๋„ฃ์–ด NeRF๋ฅผ Multi Modal๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค€ ๋…ผ๋ฌธ์ด๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.
 - Paper: https://arxiv.org/abs/2203.05625
 - Github: https://github.com/megvii-research/PETR

Contribution

  • DETR ๊ณ„์—ด์˜ Detection ๋ชจ๋ธ๋“ค์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Œ
  • DETR3D ๋Š” 3D Object Detection์„ ์œ„ํ•œ ๋ชจ๋ธ์ธ๋ฐ, 2D-3D Transformation ๊ณผ์ •์—์„œ ๋ฌธ์ œ ๋ฐœ์ƒ
    • 1. Reference Point ์˜ ์˜ˆ์ธก ์ขŒํ‘œ๊ฐ€ ์ •ํ™•ํ•˜์ง€ ๋ชปํ•จ
    • 2. Projection๋œ ์ด๋ฏธ์ง€ Feature๋งŒ ๋ณด๊ธฐ ๋•Œ๋ฌธ์—, Global Representation์ˆ˜ํ–‰์ด ํž˜๋“ฆ
  • 3D ์ขŒํ‘œ๋ฅผ ์ด์šฉํ•œ Position Embedding Transformation ์ œ์•ˆ
    • (a) DETR์€ query์™€ 2D Positioning Embedding (PE)๊ฐ€ ๋”ฐ๋กœ ๋“ค์–ด๊ฐ.
    • (b) Query๋กœ ๋ถ€ํ„ฐ ๋งŒ๋“ค์–ด์ง„ reference point๋ฅผ ์ด๋ฏธ์ง€ feature์— ํˆฌ์˜์‹œ์ผœ feature๋ฅผ sampling.
    • (c) 2D feature ์™€ 3D PE๋ฅผ ํ•จ๊ป˜ encoding ํ•˜์—ฌ 3D aware feature ๋ฅผ ๋งŒ๋“ค๊ณ , query์™€ decoding.

PETR overall Architecture

  1. Multi-view ์ด๋ฏธ์ง€๋ฅผ Backbone์„ ํ†ตํ•ด 2D Feature ์ถ”์ถœ
  2. 3D Meshigrid ํ˜•ํƒœ์˜ Camera Frustum Sace ์ƒ์„ฑ
  3. Frustum ์„ ์ด์šฉํ•˜์—ฌ 3D World Space Coordinate ๋กœ ๋ณ€ํ™˜
  4. 3D Position Encoder์—์„œ 3D Position Aware Feature ์ถ”์ถœ
  5. Object Query ์™€ Cross-Attention
  6. ์ตœ์ข… ๊ฒฐ๊ณผ(3D box + class) ์˜ˆ์ธก

1. Multi-view ์ด๋ฏธ์ง€๋ฅผ Backbone์„ ํ†ตํ•ด 2D Feature ์ถ”์ถœ (ex, resnet50)

2. 3D Meshigrid ํ˜•ํƒœ์˜ Camera Frustum Space ์ƒ์„ฑ

์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ Frustum ์ƒ์„ฑ

Meshgrid ํ˜•ํƒœ์˜ Frustum ์ƒ์„ฑ ๊ณผ์ •

  • ์ด๋ฏธ์ง€ Coordinate ์— Depth ๐‘‘_๐‘— ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ Frustum ๐‘_๐‘—^๐‘š ์„ ๋งŒ๋“ฆ
    • ๐‘ข_๐‘—, ๐‘ฃ_๐‘— ๋Š” ๐‘‘_๐‘—์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€ ์ขŒํ‘œ๋ฅผ ๋œปํ•จ
  • Frustum์€ ๊ฐ ์ด๋ฏธ์ง€์—์„œ์˜ ์ขŒํ‘œ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

3์ฐจ์› ์ขŒํ‘œ์—์„œ Muti-view์— ๋Œ€ํ•œ Frustum์„ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„  3D World Space๋กœ ๋ณ€ํ™˜์„ ํ•ด ์ฃผ์–ด์•ผ ํ•จ

  • Frustum์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€ ์ขŒํ‘œ๊ฐ’์œผ๋กœ 3์ฐจ์›์— ํ‘œํ˜„ํ•˜๊ธฐ์—” ์™œ๊ณก์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, 3D World ์ขŒํ‘œ๋กœ ๋ณ€ํ™˜ ํ•„์š” 

3. Frustum ์„ ์ด์šฉํ•˜์—ฌ 3D World Space Coordinate ๋กœ ๋ณ€ํ™˜

Transformation to 3D World Space from Frustum

  • Frustum ์— ๋ณ€ํ™˜ ํ–‰๋ ฌ์„ ๊ณฑํ•˜์—ฌ x,y,z 3D World ์ขŒํ‘œ๋กœ ๋ณ€ํ™˜ ํ›„ [-1, 1]๋กœ Normalize
    • ๐พ_๐‘– - 3D์—์„œ i๋ฒˆ์งธ ์นด๋ฉ”๋ผ Frustum์œผ๋กœ ๋ณ€ํ™˜
    • ๐‘ฅ_๐‘š๐‘–๐‘›, ๐‘ฅ_๐‘š๐‘Ž๐‘ฅ, ๐‘ฆ_๐‘š๐‘–๐‘›, ๐‘ฆ_๐‘š๐‘Ž๐‘ฅ : 3D ์—์„œ ํ‘œํ˜„ํ•˜๊ณ ์ž ํ•˜๋Š” x,y ์ตœ์†Œ/์ตœ๋Œ€ ๊ฐ’


4. 3D Position Encoder์—์„œ 3D position aware feature ์ถ”์ถœ

3D Position Encoder in Overall Architecture
3D Position Encoder. 2D Feature์™€ 3D coordinate์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ 3D Position-aware Feature ์ƒ์„ฑ

  • 2D Feature ์— [1x1 Conv] ์„ ํ†ตํ•ด ์ธ์ฝ”๋”ฉ
  • 3D Coordinate ์— [FC-ReLU-FC] ์„ ํ†ตํ•ด ์ธ์ฝ”๋”ฉ
  • ๋‘˜์„ ๋”ํ•˜์—ฌ 3D position ์ด ๋ฐ˜์˜๋œ feature๋ฅผ ์ถ”์ถœ ํ›„ flatten

5. Object Query ์™€ Cross-Attention ๋ณ€ํ™˜

Query Generator in Overall Architecture

  • Query Generator - Object Query ์ƒ์„ฑ
  • ๋จผ์ €, 3D ์—์„œ์˜ Learnable Anchor Points ์„ Uniform ํ•˜๊ฒŒ ์ƒ์„ฑ
  • Anchor ์˜ 3D ์ขŒํ‘œ ์ž‘์€ MLP๋ฅผ ํ†ตํ•ด Object Query ์ƒ์„ฑ
  • ์ƒ์„ฑ๋œ Query ์™€ 3D-aware feature ์™€ Cross Attention

6. ์ƒ์„ฑ๋œ Query์™€ 3D Position-aware Feature๋ฅผ key, value๋กœ ํ•˜์—ฌ DETR3D Decoder์— ๋„ฃ์–ด ์ตœ์ข… 3D box ์˜ˆ์ธก

Positional Embedding Analysis

๋นจ๊ฐ„ ์ ์€ random ํ•˜๊ฒŒ sampling ํ•œ ์ ์ด๊ณ , ์˜ค๋ฅธ์ชฝ ์—ฌ์„ฏ์žฅ์˜ ์ด๋ฏธ์ง€๋Š” sampling ๋œ ์ ๊ณผ์˜ similiarity๋ฅผ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ. ๋…ธ๋ž€์ƒ‰์ผ์ˆ˜๋ก ๋†’์€ similarity๋ฅผ ๋ณด์—ฌ์คŒ

  • Front view ์˜ random ํ•œ ์ขŒํ‘œ๋ฅผ ๊ฐ ๋‹ค๋ฅธ view ์—์„œ similarity ๋ถ„์„
  • ๊ฐ€๊นŒ์šด ์˜์—ญ์—์„œ similarity ๊ฐ€ ๋†’์Œ
  • Positional Embedding ์ด ๋‹ค๋ฅธ view๋ผ๋ฆฌ์˜ correlation ์„ ๋งŒ๋“ฆ
  • ์ œ์•ˆํ•˜๋Š” Positional Embedding ์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋œปํ•จ

Experiments

ํ•ด๋‹น ๋ชจ๋ธ์€ nuscens dataset์—์„œ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•จ. 6์žฅ์˜ multi-view image๋ฅผ ์ธํ’‹์œผ๋กœ ๋ฐ›์•„ 3์ฐจ์› ๋ฌผ์ฒด์˜ ์œ„์น˜ ๋ฐ ์ข…๋ฅ˜๋ฅผ ๊ฒ€์ถœํ•˜๋Š” Task ์ˆ˜ํ–‰.

  • โˆ— - trained with external data, โ€ก - test time augmentation
  • PETR์€ ๊ธฐ์กด SOTA์˜ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์ข‹์€ NDS, mAP, mAVE์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ
  • DETR3D๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ ์ œ์•ˆํ•˜๋Š” Positioning Embedding์˜ ํšจ๊ณผ๋ฅผ ์ž…์ฆ

Ablation Study

2D์™€ MV๋ฅผ ์‚ฌ์šฉํ• ๋•Œ๋ณด๋‹ค 3D๋งŒ์„ ์‚ฌ์šฉํ• ๋•Œ ๋” ํ™•์‹คํžˆ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ

  • 2D โ€“ 2D Positioning Embedding
  • MV โ€“ Multi-view Position Embedding
  • 3D โ€“ 3D Positioning Embedding

  • 1x1 conv๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ MLP๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ
  • 3x3 conv๋Š” ์ฃผ๋ณ€์— ์œ„์น˜ํ•œ coordinate ์ •๋ณด์™€ ๊ด€๋ จ๋œ feature๋ฅผ aggregateํ•˜๊ณ ์ž ํ•จ์ธ๋ฐ, ์ด๋•Œ์˜ ๊ฒฐ๊ณผ๋Š” ์ข‹์ง€ ๋ชปํ•จ
  • ์ด๋Š” semanticํ•œ ์ •๋ณด๊ฐ€ position ์ •๋ณด๋ฅผ ๋Œ€ํ‘œํ•˜๋Š”๋ฐ ํฌ๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š๊ณ , ๊ฐ ๊ฐœ๋ณ„์  ์ขŒํ‘œ๊ฐ’์ด ๋Œ€ํ‘œ๋˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๊ฒƒ์„ ๋œปํ•จ.
๋ฐ˜์‘ํ˜•