Study: Artificial Intelligence(AI)/AI: 3D Vision

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting (CVPR 2024)

DrawingProcess 2024. 11. 17. 05:55
๋ฐ˜์‘ํ˜•
๐Ÿ’ก ๐Ÿ’ก ๋ณธ ๋ฌธ์„œ๋Š” 'HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting (CVPR 2024) ๋…ผ๋ฌธ์„ ์ •๋ฆฌํ•ด๋†“์€ ๊ธ€์ด๋‹ค.
ํ•ด๋‹น ๋…ผ๋ฌธ์€ ๊ด€๊ด‘๊ฐ์ด ์ฐ์€ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ 3D Reconstruction์„ ์ง„ํ–‰ํ•˜๋Š” Task(unstructured tourist environments)๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋…ผ๋ฌธ์ด๋‹ค. ์ด๋Š” NeRF ๊ธฐ๋ฐ˜์ด ์•„๋‹Œ Gaussian Splatting ์„ ํ™œ์šฉํ•˜์˜€์œผ๋ฉฐ, Hierarchical Appearance Modeling๊ณผ Depth Regularization์„ ์ง„ํ–‰ํ•œ ๊ฒƒ์ด ํŠน์ง•์ด๋‹ˆ ์ฐธ๊ณ ํ•˜๊ธฐ ๋ฐ”๋ž€๋‹ค.

 - Project: https://alvinliu0.github.io/projects/HumanGaussian
 - Paper: https://arxiv.org/abs/2311.17061
 - Github: https://github.com/alvinliu0/HumanGaussian

Abstract

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์—์„œ Realistic 3D ์ธ๊ฐ„ ์ƒ์„ฑ์€ ์•„์ง challenging task์ด๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ ๋ถ€์ ์ ˆํ•œ ์„ธ๋ถ€ ์‚ฌํ•ญ(inadequate fine details)์ด๋‚˜ ๊ณผ๋„ํ•œ ํ›ˆ๋ จ ์‹œ๊ฐ„์œผ๋กœ ์ธํ•ด ์–ด๋ ค์›€์„ ๊ฒช๋Š” score distillation sampling(SDS)์„ ํ†ตํ•ด mesh or neural fields๊ณผ ๊ฐ™์€ 3D ํ‘œํ˜„์„ ์ตœ์ ํ™”ํ•œ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” fine-grained ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ์™€ ์‚ฌ์‹ค์ ์ธ ์™ธ๊ด€์„ ๊ฐ€์ง„ ๊ณ ํ’ˆ์งˆ 3D ์ธ๊ฐ„์„ ์ƒ์„ฑํ•˜๋Š” ํšจ์œจ์ ์ด๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ์ธ HumanGaussian์„ ์ œ์•ˆํ•œ๋‹ค. ์šฐ๋ฆฌ์˜ ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ๋Š” 3D Gaussian Splatting์ด ์ฃผ๊ธฐ์ ์ธ ๊ฐ€์šฐ์‹œ์•ˆ ์ถ•์†Œ ๋˜๋Š” ์„ฑ์žฅ์„ ๊ฐ€์ง„ ํšจ์œจ์ ์ธ ๋ Œ๋”๋ง์ด๋ฉฐ, ์ด๋Ÿฌํ•œ adaptive density control๋Š” ๋‚ด์žฌ์  ์ธ๊ฐ„ ๊ตฌ์กฐ์— ์˜ํ•ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์œ ๋„๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, 1) ๋จผ์ € ์ธ๊ฐ„์˜ ์™ธ๊ด€๊ณผ ๊ธฐํ•˜ํ•™์„ ๋™์‹œ์— ์ตœ์ ํ™”ํ•˜๋Š” Structure-Aware SDS๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. RGB์™€ depth space์˜ multi-modal score ํ•จ์ˆ˜๋Š” ๊ฐ€์šฐ์‹œ์•ˆ ๋ฐ€๋„ํ™” ๋ฐ ๊ฐ€์ง€์น˜๊ธฐ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ฆ๋ฅ˜ํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋œ๋‹ค. 2) ๋˜ํ•œ SDS๋ฅผ noisier generative score์™€ cleaner classifier score๋กœ ๋ถ„ํ•ดํ•˜์—ฌ Annealed Negative Prompt Guidance๋ฅผ ๊ณ ์•ˆํ•˜์—ฌ over-saturation ๋ฌธ์ œ๋ฅผ ์ž˜ ํ•ด๊ฒฐ ํ•œ๋‹ค. floating artifacts๋Š” ๊ฐ€์ง€์น˜๊ธฐ ์ „์šฉ ๋‹จ๊ณ„์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ถ”๊ฐ€๋กœ ์ œ๊ฑฐ๋˜์–ด generation smoothness์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์šฐ์ˆ˜ํ•œ ํšจ์œจ์„ฑ๊ณผ ๊ฒฝ์Ÿ ํ’ˆ์งˆ์„ ์ž…์ฆํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ƒ์ƒํ•œ 3D ์ธ๊ฐ„์„ ๋ Œ๋”๋งํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” contribution ์ค‘ ๊ฐ€์žฅ ๋ฉ”์ธ์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  •  

Related Works

3D Neural Representations

voxel, point cloud, mesh, and neural field์™€ ๊ฐ™์€ ๊ณต๊ฐ„ ๊ธฐํ•˜ํ•™ ๋ฐ ํ…์Šค์ฒ˜ ๋ชจ๋ธ๋ง์„ ์œ„ํ•ด ๋‹ค์–‘ํ•œ 3D Representation์ด ์ œ์•ˆ๋œ๋‹ค. ํ•™์Šต ์‹œ๊ฐ„, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ, ๋ Œ๋”๋ง ๊ธฐ๋Šฅ ๋ฐ ๋„คํŠธ์›Œํฌ ํ˜ธํ™˜์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์ œ ์„ค์ •์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ‘œํ˜„์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค:

  • 1) Voxel: ์žฅ๋ฉด ์ •๋ณด๋ฅผ ๊ทธ๋ฆฌ๋“œ ๋ฐฉ์‹์œผ๋กœ ์ €์žฅํ•˜๋Š” ์œ ํด๋ฆฌ๋“œ ํ‘œํ˜„์œผ๋กœ CNN์— ์‰ฝ๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ [9, 46, 84] ์ž…๋ฐฉ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ ์ธํ•ด ๋ Œ๋”๋ง ํ•ด์ƒ๋„๊ฐ€ ์ œํ•œ๋œ๋‹ค.
  • 2) Point cloud: 3D ํ‘œ๋ฉด์—์„œ ์ƒ˜ํ”Œ๋ง๋œ discrete ํฌ์ธํŠธ ์„ธํŠธ์ธ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋Š” ๋ Œ๋”๋ง [38, 57, 58]์— ํšจ์œจ์ ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ถˆ์—ฐ์†์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด fine-grained details์„ ์บก์ฒ˜๋ฅผ ๋ชปํ•œ๋‹ค.
  • 3) Mesh, ๊ฐ€์žฅ์ž๋ฆฌ ๋ฐ ๋ฉด ๊ฐ„์˜ ์—ฐ๊ฒฐ์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ํ‘œํ˜„์ธ ๋ฉ”์‹œ๋Š” ์ž˜ ๋‘˜๋Ÿฌ์‹ธ์ธ ๊ทธ๋ž˜ํ”ฝ ํŒŒ์ดํ”„๋ผ์ธ[16, 78, 81]์—์„œ ์‹œ๊ฐ„ ํšจ์œจ์„ฑ์„ ๊ณ„์Šนํ•˜์ง€๋งŒ ์ •ํ™•ํ•œ ํ† ํด๋กœ์ง€๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๋Š”๋‹ค.
  • 4) Neural field: ๊ฐ 3D ์œ„์น˜์˜ ์†์„ฑ์˜ ์•”๋ฌต์  ํ•จ์ˆ˜์ธ ์‹ ๊ฒฝ๋ง์€ ๋ณต์žกํ•œ ๊ตฌ์กฐ๋ฅผ ์ž„์˜์˜ ํ•ด์ƒ๋„๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์ง€๋งŒ [41, 49, 53, 83, 85] ์ตœ์ ํ™”์™€ ์ถ”๋ก ์€ ๋Š๋ฆฌ๋‹ค. ์ตœ๊ทผ 3D Gaussian Splatting(3DGS)[31, 45]์€ 3D ์žฌ๊ตฌ์„ฑ์—์„œ ๋” ๋‚˜์€ ํ’ˆ์งˆ๊ณผ ๋” ๋น ๋ฅธ ์ˆ˜๋ ด์œผ๋กœ ์ด์ „ ํ‘œํ˜„์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์ธ์ƒ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

์ด ์—ฐ๊ตฌ์—์„œ๋Š” text-driven 3D human generation์˜ challenging task์—์„œ 3D Gaussian Splatting์˜ ์ž ์žฌ๋ ฅ์„ ์˜ฌ๋ฆฌ๋ ค๊ณ  ํ•œ๋‹ค.

Text-to-3D Generation

์ตœ๊ทผ์˜ diffusion-based text-to-3D ์ž‘์—…์€ ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • 1) 3D native pipelines: ํŠน์ • ๋„๋ฉ”์ธ์—์„œ 3D ๋ฐ์ดํ„ฐ[28, 44, 51] ๋˜๋Š” ์žฌ๊ตฌ์„ฑ๋œ ์ค‘๊ฐ„ ๊ธฐ๋Šฅ[5, 8, 15, 52]์˜ ๋ถ„ํฌ๋ฅผ ์ง์ ‘ ์บก์ฒ˜ํ•˜๋Š” 3D ๋„ค์ดํ‹ฐ๋ธŒ ํŒŒ์ดํ”„๋ผ์ธ์ž…๋‹ˆ๋‹ค. ์ผ๋ถ€ ์ตœ๊ทผ ์ž‘์—…[2, 21]์€ Objaverse [10]์™€ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ 3D ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ํ›ˆ๋ จ์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์šฉ๋Ÿ‰์„ ํ™•์žฅํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ๋‹จ์ผ ๊ฐœ์ฒด์— ๊ตญํ•œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • 2) Optimization-based 2D lifting pipelines: ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ 2D ๋ฆฌํ”„ํŒ… ํŒŒ์ดํ”„๋ผ์ธ์€ 2D ๋„๋ฉ”์ธ์—์„œ ์ด์ „์— ํ•™์Šตํ•œ ํ’๋ถ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ๋ฅ˜ํ•˜์—ฌ ํ”„๋กฌํ”„ํŠธ ๋ฐฉ์‹์œผ๋กœ 3D ์žฅ๋ฉด ํ‘œํ˜„์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ผ๋ถ€ ์ดˆ๊ธฐ ์‹œ๋„๋Š” CLIP ๊ฐ€์ด๋˜์Šค[59]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฉ€ํ‹ฐ๋ทฐ ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ์ •๋ ฌ์„ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐ˜๋ฉด, ์ตœ๊ทผ์˜ ๋ฐฉ๋ฒ•์€ ์˜ˆ์‹œ์ ์ธ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋ชจ๋ธ[6, 47, 56, 77, 80]์—์„œ ์ „๋ก€ ์—†๋Š” ๋ Œ๋”๋ง ํ’ˆ์งˆ์„ ๊ณ„์Šนํ•˜๊ธฐ ์œ„ํ•ด ์ฆ๋ฅ˜ ์ƒ˜ํ”Œ๋ง(SDS)์„ ์ ์ˆ˜ํ™”ํ•˜๋Š” ๋ฐ ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ NeRF์˜ ๋†’์€ ๊ณ„์‚ฐ ๋ถ€๋‹ด์€ ๊ธด ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋ฏ€๋กœ text-to-3D ์ƒ์„ฑ์„ ์œ„ํ•œ ๊ฐ€์šฐ์Šค ์Šคํ”Œ๋žซ์˜ ํ‘œํ˜„์„ ์ ์‘ํ•˜๊ธฐ ์œ„ํ•œ ๋™์‹œ ์ž‘์—…์— ๋™๊ธฐ๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค[7, 74, 86].

์ด ์ž‘์—…์—์„œ๋Š” ํšจ์œจ์„ฑ๊ณผ ํšจ์œจ์„ฑ ๋•Œ๋ฌธ์— 3D ๊ฐ€์šฐ์Šค๋ฅผ ์„ ํƒํ•˜์ง€๋งŒ, ๋ฏธ์„ธํ•œ ๋””ํ…Œ์ผ ์บก์ฒ˜์™€ ํ˜„์‹ค์ ์ธ ํ…์Šค์ฒ˜ ์ƒ์„ฑ์ด ๋ชจ๋‘ ์š”๊ตฌ๋˜๋Š” ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ 3D ์ธ๊ฐ„ ์˜์—ญ์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

Text-Driven 3D Human Generation

 

Methods

 

๋ฐ˜์‘ํ˜•