Study: Artificial Intelligence(AI)/AI: 3D Vision

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] D-NeRF: Neural Radiance Fields for Dynamic Scenes (CVPR 2021)

DrawingProcess 2024. 11. 26. 12:58
๋ฐ˜์‘ํ˜•
๐Ÿ’ก ๐Ÿ’ก ๋ณธ ๋ฌธ์„œ๋Š” 'D-NeRF: Neural Radiance Fields for Dynamic Scenes (CVPR 2021)' ๋…ผ๋ฌธ์„ ์ •๋ฆฌํ•ด๋†“์€ ๊ธ€์ด๋‹ค.
๊ธฐ์กด์˜ NeRF๋Š” static dataset์— ๋Œ€ํ•ด์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์ด๋•Œ, dynaimcํ•œ ๊ฐ์ฒด๋“ค, ์›€์ง์ด๋Š” ๊ฐ์ฒด์— ๋Œ€ํ•ด์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด blur๊ฐ€ ๋งŽ์ด ์ƒ๊ธฐ๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค. D-NeRF๋Š” ์ด๋Ÿฌํ•œ NeRF์˜ ๋‹จ์ ์„ ์ฐพ์•„์„œ, t๋ผ๋Š” frame ์‹œ๊ฐ„ ๋ณ€์ˆ˜๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์„œ ๊ธฐ์กด์˜ 5D input์ด ์•„๋‹Œ 6D input์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค. ๋˜ํ•œ NeRF์™€ ๋‹ค๋ฅด๊ฒŒ ์ƒˆ๋กœ์šด network(Deformation network)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ์„œ, ๊ฐ์ฒด๋ฅผ ๋”์šฑ ์ž˜ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€์œผ๋‹ˆ ์ฐธ๊ณ ํ•˜๊ธฐ ๋ฐ”๋ž€๋‹ค.

 - Project: https://www.albertpumarola.com/research/D-NeRF/index.html
 - Paper: https://arxiv.org/abs/2011.13961
 - Github: https://github.com/albertpumarola/D-NeRF
 - Dataset: https://www.dropbox.com/scl/fi/cdcmkufncwcikk1dzbgb4/data.zip

Introduction

๊ธฐ์กด์˜ NeRF๋Š” static dataset์— ๋Œ€ํ•ด์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์ด๋•Œ, dynaimcํ•œ ๊ฐ์ฒด๋“ค, ์›€์ง์ด๋Š” ๊ฐ์ฒด์— ๋Œ€ํ•ด์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด blur๊ฐ€ ๋งŽ์ด ์ƒ๊ธฐ๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค. D-NeRF๋Š” ์ด๋Ÿฌํ•œ NeRF์˜ ๋‹จ์ ์„ ์ฐพ์•„์„œ, t๋ผ๋Š” frame ์‹œ๊ฐ„ ๋ณ€์ˆ˜๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์„œ ๊ธฐ์กด์˜ 5D input์ด ์•„๋‹Œ 6D input์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค. 

6D input์„ ๋ฐ”๋กœ network์— ๋„ฃ๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ, ๋‘ ๊ฐ€์ง€์˜ module๋กœ ๋‚˜๋ˆ„์–ด์„œ ์ฒซ๋ฒˆ์งธ๋กœ๋Š” ์‹œ๊ฐ„ t์ผ ๋•Œ์˜ point์™€ canonical scene configuration์˜ point์‚ฌ์ด์˜ mapping์„ ํ•™์Šตํ•˜๋Š” ํŒŒํŠธ์™€ ๋‘๋ฒˆ์งธ๋กœ๋Š” canonical scene์—์„œ radiance์™€ volume density๋ฅผ ๊ตฌํ•˜๋Š” ํŒŒํŠธ๋กœ ๋‚˜๋ˆ„์–ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

** canonical scene configuration
canonical scene configuration์ด๋ผ๋Š” ๋ง์ด ๋งŽ์ด ๋‚˜์˜ค๋Š”๋ฐ, ์ด๊ฑฐ๋Š” reference scene์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๊ณ , ํŠน๋ณ„ํžˆ t=0์ผ ๋•Œ์˜ scene์„ ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ์ฆ‰, ์œ„์—์„œ point ์‚ฌ์ด์˜ mapping์€ t=a์ผ ๋•Œ, t=0์ผ ๋•Œ์˜ point๋กœ๋ถ€ํ„ฐ ์–ผ๋งˆ๋‚˜ transformation์ด ์ผ์–ด๋‚ฌ๋Š”์ง€๋ฅผ measureํ•˜๋Š” ๋ฐฉ๋ฒ•.

์ด๋ ‡๊ฒŒ ํ•™์Šต์„ canonical scene์— ๋Œ€ํ•œ ํ•™์Šต๊ณผ scene flow์— ๋Œ€ํ•œ ํ•™์Šต์œผ๋กœ ๋‚˜๋ˆˆ ๊ฒฐ๊ณผ D-NeRF๋Š” camera view์™€ time component (= dynamic state of the scene)๋ฅผ ์ž˜ control ํ•˜๋ฉด์„œ high-quality image๋ฅผ renderingํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” contribution ์ค‘ ๊ฐ€์žฅ ๋ฉ”์ธ์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Dynamic domain์—์„œ neural radiance field๋ฅผ ์ ์šฉํ–ˆ๋‹ค.
  • Time์„ ์ถ”๊ฐ€์ ์ธ input์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ object movement๋ฅผ renderingํ–ˆ๋‹ค.
  • ํ•™์Šต ๊ณผ์ •์„ Deformation network์™€ Canonical network์œผ๋กœ ๋‚˜๋ˆ„์–ด ํšจ์œจ์ ์œผ๋กœ renderingํ–ˆ๋‹ค.

Methods

Model network

D-NeRF model framework

D-NeRF๋Š” NeRF์™€ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋‚˜์˜ MLP model๋กœ ๊ตฌ์„ฑ๋œ๊ฒŒ ์•„๋‹ˆ๋ผ, 2๊ฐœ์˜ MLP network๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ๋ฐ”๋กœ Deformation network์™€ Canonical network์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  NeRF์™€ ๋‹ค๋ฅด๊ฒŒ frame์— ๋Œ€ํ•œ ์‹œ๊ฐ„ ๋ณ€์ˆ˜(t)๋„ ํ•จ๊ป˜ input์œผ๋กœ ํ™œ์šฉ๋œ๋‹ค. ๊ฐ๊ฐ์˜ network๋ฅผ ์ž์„ธํžˆ ์•Œ์•„๋ณด์ž!

1. Deformation network

time instant ์—์„œ์˜ scene๊ณผcanonical scene์‚ฌ์ด๋ฅผ mapping์‹œํ‚ค๋Š” 

D-NeRF visualization

Deformation network๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์ „์—, canonical space๊ฐ€ ๋ฌด์—‡์ธ์ง€ ๋จผ์ € ์ •์˜ํ•˜๊ณ  ๊ฐ€๊ฒ ๋‹ค. Canonical space๋Š” ์‰ฝ๊ฒŒ ๋งํ•ด์„œ, representative pose๋ฅผ ๊ฐ€์ง€๋Š” frame(t=0)์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, ๋ฌผ์ฒด์˜ ๋Œ€ํ‘œ์ ์ธ ๋ชจ์Šต์„ ๋ณด์—ฌ์ฃผ๋Š” frame์œผ๋กœ, ๋…ผ๋ฌธ์—์„œ๋Š” ์ฒซ๋ฒˆ์งธ ํ”„๋ ˆ์ž„(t=0)์œผ๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

Deformation network์˜ ๋ชฉ์ ์€ ํ˜„์žฌ ๋“ค์–ด์˜จ ์ž„์˜์˜ ์‹œ์  t์— ๋Œ€ํ•œ input frame์ด representative pose(t=0)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ขŒํ‘œ์˜ ๋ณ€ํ™”์œจ์ด ์–ด๋Š์ •๋„์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์œ„์—์„œ ๋ณด๋Š” ๋ชจ๋ธ network ๊ตฌ์กฐ์ฒ˜๋Ÿผ, Deformation network๋Š” input์œผ๋กœ (x,y,z,t)๋ฅผ ๋„ฃ์–ด์„œ, (Δx,Δy,Δz)์˜ ๋ณ€ํ™”์œจ์„ ์–ป๋Š”๋‹ค.

์œ„์˜ ์‹œ๊ฐํ™” ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด, x+Δx๋ผ๋Š” ํ•ญ๋ชฉ์ด ์žˆ๋‹ค. ์ด๊ฒƒ์€ ๊ฐ™์€ ์ƒ‰์ƒ์ด๋ฉด ๊ฐ์ฒด์˜ ๊ฐ™์€ ์ง€์ ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋ณด๋ฉด ๋˜๋Š”๋ฐ, ๊ฑฐ์˜ ๋™์ผํ•œ ์œ„์น˜์— ๋™์ผํ•œ ์ƒ‰์ƒ์ด ์น ํ•ด์ ธ์žˆ๋‹ค. ์ฆ‰, ๋ฌผ์ฒด๊ฐ€ ์›€์ง์—ฌ๋„, Deformation network๊ฐ€ ๋ณ€ํ™”์œจ์„ ์ž˜ ํ™•์ธํ•ด์„œ ์›€์ง์ž„๊ณผ ์ƒ๊ด€์—†์ด ๊ฐ์ฒด์˜ ๋™์ผํ•œ point๋ฅผ ์ž˜ ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋œป์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.


Deformaton network model code

+) code๋ฅผ ์‚ดํŽด๋ณด๋ฉด, Deformation network์— ๋“ค์–ด๊ฐ€๋Š” (x,y,z,t)๋Š” ์ „๋ถ€ positional embedding์„ ์ ์šฉํ•œ๋‹ค. (x,y,z)์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ t๋˜ํ•œ L=10์„ ์ ์šฉํ•ด์„œ dimenstion์ด 20์œผ๋กœ ํ™•์žฅ์ด ๋œ๋‹ค. (๋งŒ์•ฝ ์ž๊ธฐ์ž์‹ ์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค๋ฉด 21)

๊ทธ๋ฆฌ๊ณ  ๊ธฐ์กด์˜ NeRF ๊ตฌ์กฐ์™€ ๋˜‘๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ๋ช‡๊ฐ€์ง€ ๋‹ค๋ฅธ ์ ์€ ์ค‘๊ฐ„์˜ density ์ถœ๋ ฅ ๋ถ€๋ถ„๊ณผ ๋งˆ์ง€๋ง‰ layer๊ฐ€ 128์ด ์•„๋‹ˆ๋ผ๋Š” ์  ์ •๋„์ด๋‹ค.

2. Canonical network

canonical configuration์˜ scene์„ representํ•˜๋Š” 

Deformatio network and Canonical network

Canonical network๋Š” (x,y,z)๋ฅผ canonical space(t=0) ์œ„์˜ ์ ์œผ๋กœ ๋งคํ•‘ ์‹œํ‚จ ํ›„ (rgb, density)๋ฅผ ์ถ”์ถœํ•˜๋Š” network์ด๋‹ค. NeRF ๋ชจ๋ธ๊ณผ ๋˜‘๊ฐ™์€ ๊ตฌ์กฐ๋กœ ๋˜์–ด์žˆ๋Š”๋ฐ ๋‹ค๋งŒ input์˜ ๋ณ€ํ™”๊ฐ€ ์•ฝ๊ฐ„ ์žˆ๋‹ค.

(x,y,z)๋ฅผ t=0 ์ขŒํ‘œ๋กœ ๋งคํ•‘ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ (x+Δx, y+Δy, z+Δz)๋ฅผ ์ˆ˜ํ–‰ํ•ด์ฃผ๊ณ , ์ด ์ขŒํ‘œ์™€ camera parameters(θ, φ)๋ฅผ ํ•จ๊ป˜ input ๋„ฃ๋Š”๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜์˜จ output์œผ๋กœ (x,y,z)๋ฅผ rendering ํ•˜๊ฒŒ ๋œ๋‹ค.


Canonical network code

+) Canonical network์˜ code์ธ๋ฐ ๊ธฐ์กด NeRF ์ฝ”๋“œ์™€ 100% ๋™์ผํ•˜๋‹ค!

+) D-NeRF๋Š” end-to-end ๋ชจ๋ธ์ด๋ผ, Deformation network์™€ Canonical network๋ฅผ ๊ฐ™์ด ํ•™์Šตํ•œ๋‹ค. ๋˜ํ•œ NeRF์— ์žˆ์—ˆ๋˜ coarse network์™€ fine network๋„ ๊ฐ™์ด ์ ์šฉํ•œ๋‹ค. ์ฆ‰, D-NeRF๋กœ ํ•œ๋ฒˆ ๊ฒฐ๊ณผ๊ฐ’ ๋ฝ‘๊ณ (coarse network), ๊ฑฐ๊ธฐ์„œ ๋‹ค์‹œ samplingํ•ด์„œ D-NeRF(fine network)์— ๋‹ค์‹œ ๋„ฃ์–ด์ค€๋‹ค.

3. Volume rendering

Volume rendering ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ NeRF์™€ ๊ฐ™๊ณ , ๋Œ€์‹ ์— camera ray์œ„์˜ ๊ฐ point๊ฐ€ deformation network์— ์˜ํ•ด canonical space๋กœ mapping๋œ ํ›„์˜ color์™€ density๋ฅผ ์ด์šฉํ•œ๋‹ค.

Training loss๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Discussion

ํ˜„์‹ค ์„ธ๊ณ„์—์„œ scene rendering์€ ์˜คํžˆ๋ ค static๋ณด๋‹ค dynamic์ธ ๊ฒฝ์šฐ๊ฐ€ ๋‹น์—ฐํžˆ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์‹œ๊ฐ„์„ ๊ณ ๋ คํ•œ representation์ด ๋‹น์—ฐํžˆ ์ค‘์š”ํ•ด์ง€๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

๋‹ค๋งŒ ์ด ์—ฐ๊ตฌ๋„ ์ •ํ•ด์ง„ ๊ณต๊ฐ„์—์„œ ์ž˜ ์ •์˜๋œ camera parameter์™€ time์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ๋กœ monocular video๋ฅผ ์ดฌ์˜ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ 4D rendering์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ๊ตฌ๊ฐ€ ์žˆ๋‹ค๋ฉด ๋” ๊ณต๋ถ€ํ•ด๋ณผ ์˜ˆ์ •์ด๋‹ค.

๋ฐ˜์‘ํ˜•