Study: Artificial Intelligence(AI)/AI: MultiModal

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] CLIP: Vision Language Multimodal dataset - CLIP: Learning transferable visual models from natural language supervision

DrawingProcess 2024. 3. 31. 16:37
๋ฐ˜์‘ํ˜•
๐Ÿ’ก ๋ณธ ๋ฌธ์„œ๋Š” 'CLIP: Learning transferable visual models from natural language supervision' ๋…ผ๋ฌธ์„ ์ •๋ฆฌํ•ด๋†“์€ ๊ธ€์ž…๋‹ˆ๋‹ค.
ํ•ด๋‹น ๋…ผ๋ฌธ์€ CLIP ๊ฐ™์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ language embedding์„ NeRF ์•ˆ์— ์ง‘์–ด๋„ฃ์–ด NeRF๋ฅผ Multi Modal๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค€ ๋…ผ๋ฌธ์ด๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.
 - Project: https://www.lerf.io/
 - Paper: https://arxiv.org/abs/2303.09553
 - Github: https://github.com/kerrj/lerf
 - Dataset: https://drive.google.com/drive/folders/1vh0mSl7v29yaGsxleadcj-LCZOE_WEWB

Abstract

์ตœ์‹  Computer Vision System์€ ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ๊ฐ์ฒด์˜ ์ง‘ํ•ฉ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จํ•œ๋‹ค. ์ด๋•Œ ๋‹ค๋ฅธ ๊ฐ์ฒด๋ฅผ ์ถ”๊ฐ€ํ•˜๋ ค๋ฉด ์ถ”๊ฐ€๋ฐ์ดํ„ฐ์™€ ๋ ˆ์ด๋ธ”๋ง์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋Ÿฌํ•œ ์ œํ•œ๋œ ํ˜•ํƒœ์˜ supervision์€ ์ผ๋ฐ˜์„ฑ๊ณผ ์‚ฌ์šฉ์„ฑ์„ ์ œํ•œํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ Raw Text ์—์„œ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ๊ด‘๋ฒ”์œ„ํ•œ supervision์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์•ˆ์ด๋‹ค. 

์ด ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” contribution ์ค‘ ๊ฐ€์žฅ ๋ฉ”์ธ์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

  1. Text model์—์„œ ์‚ฌ์šฉ๋˜๋˜ ๋Œ€์šฉ๋Ÿ‰ dataset(400M Text Image pair)์„ ๊ตฌ์ถ•ํ•˜์—ฌ image model์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜์˜€๋‹ค.
  2. Contrastive Learning์„ ์ด์šฉํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ text to image ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ฆ‰, multimodal์„ ๊ฐ€๋Šฅ์ผ€ ํ–ˆ๋‹ค.
  3. Downstream Task์— ๋Œ€ํ•ด Zero-shot transfer learning์˜ ์„ฑ๋Šฅ์„ ๋น„์•ฝ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์‹คํ—˜ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•์ธํ•œ ์ถ”๊ฐ€ contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ด ๋ถ€๋ถ„์€ ์‚ฌ์‹ค main contribution์ด ์•„๋‹Œ sub contribution์— ํ•ด๋‹น๋œ๋‹ค.

  1. Linear probe๋‚˜ logistic regression์‹œ ์„ฑ๋Šฅ์„ ๋ณด์•„, representation learning์ด ์–ด๋Š ์ •๋„ ์ž˜ ์ง„ํ–‰๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
  2. Domain shift์— robustํ•œ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์˜€๋‹ค.
  3. ๋‹จ์ˆœํžˆ ๋ฌด์ž‘์ • dataset์„ ๋ชจ์€ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, dataset์˜ overlapping ๋ฌธ์ œ๋„ ๋ถ„์„ํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด overfitting์— ์˜ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์•„๋‹˜์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค.

0. What is modality?

๋…ผ๋ฌธ ์†Œ๊ฐœ์— ์•ž์„œ multimodal์ด๋ผ๋Š” ์˜๋ฏธ์— ๋Œ€ํ•ด ์งš๊ณ  ๋„˜์–ด๊ฐ€๊ธฐ ์œ„ํ•ด ๋จผ์ € ‘modality’์˜ ์˜๋ฏธ์— ๋Œ€ํ•ด ํ™•์ธํ•ด๋ณด์ž. ์ผ๋ฐ˜์ ์œผ๋กœ modality๊ฐ€ ๊ฐ€์ง€๋Š” ์˜๋ฏธ๋Š” ํฌ๊ฑฐ๋‚˜ ์ž‘๊ฒŒ ๊ตฌ๋ถ„์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋งŒ์•ฝ ํ™•๋ฅ  ๋ถ„ํฌ์— ๋Œ€ํ•ด์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด, modality๋Š” ํ•˜๋‚˜์˜ probability density function๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์ด๋•Œ multimodal์€ ์„œ๋กœ ๋‹ค๋ฅธ peak(local maxima)๋ฅผ ๊ฐ€์ง€๋Š” ๋‘ ๊ฐœ ์ด์ƒ์˜ mode๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์œ„์˜ ๋‚ด์šฉ์€ ๋‹จ์ˆœํžˆ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•œ ์ˆ˜ํ•™์  ์ ‘๊ทผ(ํ•ด์„)์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ์ด๊ณ , ์‹ค์ œ๋กœ ์šฐ๋ฆฌ๊ฐ€ ๋”ฅ๋Ÿฌ๋‹์—์„œ ์ด์•ผ๊ธฐํ•˜๊ณ ์ž ํ•˜๋Š” ๋ถ€๋ถ„์€ sensory(๊ฐ๊ฐ)๊ณผ ๋งž๋‹ฟ์•„ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

์ธ๊ฐ„์—๊ฒŒ ์žˆ์–ด ๊ฐ๊ฐ์—๋Š” ์—ฌ๋Ÿฌ modality(ํฌ๊ฒŒ ํ•ด์„ํ•ด์„œ, ํ˜•ํƒœ๋‚˜ ๋ชจ์–‘์ด๋ผ๊ณ  ๋ณด๋ฉด ๋œ๋‹ค)์ธ ์ฒญ๊ฐ, ์ด‰๊ฐ, ์‹œ๊ฐ ๋“ฑ๋“ฑ์ด ์žˆ์œผ๋ฉฐ ์ด๋Ÿฌํ•œ ๋‹ค์–‘ํ•œ ๊ฐ๊ฐ๋“ค์ด ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ๋งž๋‹ฟ์•„์žˆ๋Š” ์ค‘๊ฐ„ ๋ถ€๋ถ„์ด ์‚ฌ๋žŒ์œผ๋กœ ํ•˜์—ฌ๊ธˆ ์ถ”๋ก ์˜ ๊ทผ๊ฑฐ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์ฆ‰ multimodal์ด๋ž€ ์†Œํ†ต์˜ ๋ฐฉ์‹์ด ๋  ์ˆ˜๋„ ์žˆ๊ณ , ์ง„ํ™”์‹ฌ๋ฆฌํ•™์— ๋”ฐ๋ฅธ ํŠน์ • ํ–‰๋™์–‘์‹์˜ ๊ธฐ์ค€์ด ๋˜๊ธฐ๋„ ํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, Verbal(์–ธ์–ด)์— ์žˆ์–ด์„œ๋Š” ๋‹จ์–ด ๊ฐ๊ฐ์˜ ์˜๋ฏธ๋ฅผ ์„œ์ˆ ํ•˜๋Š” Lexicon์ด ์žˆ๊ณ , Speech์˜ ํ•œ ๋ถ€๋ถ„์„ ๋‹ด๋‹นํ•˜๋ฉฐ ๊ฐ ๋‹จ์–ด์˜ ์œ ๊ธฐ์ ์ธ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜๋Š” Syntax(๋ฌธ๋ฒ•)์ด ์žˆ๊ธฐ๋„ ํ•˜๋ฉฐ ์–ด๋– ํ•œ context๊ฐ€ ์ง„ํ–‰๋˜๋Š” ์‚ฌ๋žŒ ๊ฐ„์˜ ๊ด€๊ณ„, ์‹œ๊ฐ„๊ณผ ์žฅ์†Œ ๋“ฑ ์–ธ์–ด์˜ ์˜๋ฏธ ๋ถ„์„์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ๊ณ ๋ คํ•˜๋Š” Pragmatics(ํ™”์šฉ๋ก )๋„ ์žˆ๋‹ค.

์ด์ฒ˜๋Ÿผ ์–ธ์–ด ์ด์™ธ์—๋„ Visual(์‹œ๊ฐ) ์š”์†Œ์—๋Š” gesturebody posture๋‚˜ proxemics(์ธ๊ฐ„์˜ ์‚ฌํšŒ์ ์ด๋‚˜ ๋ฌธํ™”์ ์ธ ํ™˜๊ฒฝ ๋ฐ ๊ณต๊ฐ„๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ํ†ตํ•ด ์ •์˜๋œ ์‹ ์ฒด ์–ธ์–ด) ๋“ฑ์ด ์žˆ๋‹ค. ๋”๋ถˆ์–ด ์‚ฌ๋žŒ๊ณผ ์‚ฌ๋žŒ ๊ฐ„์˜ eye contact, facial expression๋„ ์ด๋Ÿฌํ•œ modality์˜ ํ•œ ํ˜•ํƒœ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์•ž์„œ ์„ค๋ช…ํ•œ ๋‚ด์šฉ์€ ๋‚ด๋ถ€๋ถ„ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฐ๊ฐ์˜ ์‹ ์ฒด ๊ธฐ๊ด€ ๋ฐ ์ •๋ณด ์ „๋‹ฌ์˜ ์ˆ˜๋‹จ๊ณผ ๊ด€๋ จ๋œ ๊ตฌ์„ฑ ์š”์†Œ๋“ค์— ์‚ดํŽด๋ณด์•˜๊ณ , ์ด์ œ ์กฐ๊ธˆ ๋” ์ข์€ ์˜๋ฏธ์˜ modality์— ๋Œ€ํ•ด ์ •์˜ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Modality in computational environment

Modality๋Š” ๋ฌด์–ธ๊ฐ€๊ฐ€ ๊ฒฝํ—˜๋˜๊ฑฐ๋‚˜ ๋ฐœ์ƒํ•˜๋Š” ๋ฐฉ์‹ ๋ฐ ์ˆ˜๋‹จ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์—์„œ ์ดํ•ดํ•˜๋Š” modality๋Š” ํŠน์ • ํ˜•ํƒœ์˜ ์ •๋ณด๋ฅผ ์˜๋ฏธํ•˜๊ธฐ๋„ ํ•˜๋ฉฐ, ์ •๋ณด๊ฐ€ ์ €์žฅ๋˜๋Š” ๋Œ€ํ‘œ์  ํ˜•ํƒœ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ปดํ“จํ„ฐ์— ์–ด๋–ค ์ด๋ฏธ์ง€๊ฐ€ ์ €์žฅ๋œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ‘Image’ ์ž์ฒด๋ฅผ modality๋กœ ๋ณผ ์ˆ˜๋„ ์žˆ๊ณ  ๋ณด๋‹ค ์ƒ์„ธํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•ด์„œ ‘jpeg’, ‘bmp’, ‘png’ ๋“ฑ ์ด๋ฏธ์ง€ ์ €์žฅ ๋ฐ ์••์ถ• ๋ฐฉ์‹์— ๋”ฐ๋ผ modality๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

์•ž์„œ ๋ณธ ์ธ๊ฐ„์˜ ๊ฐ๊ฐ๊ณผ ๊ด€๋ จ๋œ sensory modality๋‚˜ ์ˆ˜ํ•™์  ํ™•๋ฅ  ๋ถ„ํฌ์— ๊ทผ๊ฑฐํ•œ probability modality์™€ ๊ทธ ๊ฒฐ์„ ํ•จ๊ป˜ํ•˜์—ฌ, modality(ํ˜น์€ medium, media)๋Š” ์–ด๋–ค ๋ฐฉ๋ฒ•์ด๋‚˜ ๊ธฐ๊ตฌ๋ฅผ ํ†ตํ•ด ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๊ณ  ์†Œํ†ตํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์ฆ‰ ํ†ต์‚ฐ ๋ฐ ์˜์‚ฌ ์†Œํ†ต์˜ ์‹œ์Šคํ…œ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

์„ธ์ƒ์—๋Š” ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ communities(๋‹จ์ฒด)๋กœ ์ธ์‹๋œ ์‚ฌํšŒ/๋ฌธํ™” ์ƒ์˜ ํ”Œ๋žซํผ๋“ค์ด ์ž‡๊ณ , modality๋Š” ๊ทธ ์ƒ์—์„œ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์„ ํ†ตํ•ด ์ •์˜๋˜๋Š” ์ •๋ณด๋‚˜ ์†Œํ†ต ๋ฐฉ์‹์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜์ด๋‹ค. deep learning๊ณผ ํฌ๊ฒŒ ๊ด€๋ จ์ด ์žˆ๋Š” Natural language(spoken or written), Visual(images, videos) ๊ทธ๋ฆฌ๊ณ  Auditory(voice, sounds and music)๊ฐ€ ๊ทธ๋Ÿฌํ•œ ๋ถ„๋ฅ˜ ์ค‘ ํ•˜๋‚˜์— ์†ํ•˜๋ฉฐ, ์ด๋•Œ์˜ multimodal์ด๋ž€ ๋‘ ๊ฐœ ์ด์ƒ์˜ modality๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜์—ฌ ์–ด๋– ํ•œ task๋ฅผ ํ’€์–ด๋‚˜๊ฐ€๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ๋‹จ์ˆœํžˆ ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” text์™€ image๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š”(์™„์ „ํžˆ ๊ฒฐ์ด ๋‹ค๋ฅธ modality) task๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, modality๋งŒ ๊ตฌ๋ถ„๋œ๋‹ค๋ฉด(ex. ์‚ฌ๋žŒ์˜ ๋ชฉ์†Œ๋ฆฌ + ์•…๊ธฐ ์—ฐ์ฃผ) ๋ชจ๋‘ multimodal๋กœ ๊ฐ„์ฃผํ•  ์ˆ˜ ์žˆ๋‹ค.

1.Learning Transferable Visual Models From Natural Language Supervision

์•ž์„œ multimodal ์„ค๋ช…์„ ๊ตณ์ด ๊ธธ๊ฒŒ ํ•˜๊ณ  ์ด์ œ์•ผ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ์ด์œ ๋Š” ์ด paper์˜ main contribution ์ค‘ ํ•˜๋‚˜๊ฐ€ ‘text์™€ image ๊ฐ„์˜ ์œ ์˜๋ฏธํ•œ ๊ด€๊ณ„‘๋ฅผ ์ฐพ๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ฌด๋ ค 48์ชฝ์ด๋‚˜ ๋˜๋Š” ๋ถ„๋Ÿ‰์„ ์žก์•„๋จน๋Š” ์ด ๋…ผ๋ฌธ์€ ์›Œ๋‚™ ์œ ๋ช…ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ๋ธ”๋กœ๊ทธ์—์„œ ๋ฆฌ๋ทฐ๋ฅผ ํ–ˆ๊ณ , ์œ ํŠœ๋ธŒ๋ฅผ ์ฐพ์•„๋ด๋„ ๊ฐ•์˜ ์˜์ƒ์ด ์ •๋ง ๋งŽ์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ์ธ์ด ์ด ๊ธ€์—์„œ ์ง‘์ค‘ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ ์ด ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” paper๋‚˜ ๋ฐฉ๋ฒ•๋ก ์—์„œ์˜ ์žฅ๋‹จ์ , ๊ทธ๋ฆฌ๊ณ  contribution์ด๋‹ค.

NLP task์˜ ๋ฐœ์ „๊ณผ CV task์˜ ํ•œ๊ณ„์ 

Computer vision task์™€ Natural language processing(NLP) deep learning์€ ๋ฐœ์ „ ๋ฐฉํ–ฅ์— ์žˆ์–ด ์ฐจ์ด๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ฒ˜์Œ์— ๊ทธ ์‹œ์ž‘์€ CNN(Convolutional Neural Network)๋ฅผ baseline์œผ๋กœ recurrent neural network์ด ๊ฐœ๋ฐœ๋˜์—ˆ์ง€๋งŒ, attention mechanism ๋ฐ transformer๊ฐ€ ์†Œ๊ฐœ๋˜๋ฉฐ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ๊ณผ ๊ด€๋ จ๋œ task์˜ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๊ณ , transformer์˜ encoder ๋ฐ decoder ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ GPT๋‚˜ BERT model์ด ๋ฐœ์ „ํ•˜๋ฉฐ ๋งŽ์€ downstream task์˜ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์ปดํ“จํ„ฐ ๋น„์ „๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ์ด๋Ÿฌํ•œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์ด ๋ฐœ์ „ํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฐ€์žฅ ํฐ ์ด์œ  ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘์˜ ์šฉ์ด์„ฑ์ด์—ˆ๋‹ค. ๋ณดํ†ต ์˜์ƒ(image)๊ณผ ๊ด€๋ จ๋œ task๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด ์ด์— ๋งž๋Š” domain์„ ์ •์˜ํ•˜๊ณ , ๊ฐ domain์— ๋งž๋Š” distribution์„ ํ•œ์ •ํ•˜์—ฌ ์ด์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ ๋ฐ ๋ผ๋ฒจ๋งํ•˜๋Š” ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๊ฐ•์•„์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด, ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฐ•์•„์ง€์˜ ์‚ฌ์ง„๊ณผ ๊ฐ ์‚ฌ์ง„์— ๋Œ€ํ•ด ๊ฐ•์•„์ง€ ํ’ˆ์ข…์„ ๋งค์นญํ•˜๋Š” ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค.

ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ image dataset ์ˆ˜์ง‘์˜ ํ•œ๊ณ„๊ฐ€ ๋“œ๋Ÿฌ๋‚œ๋‹ค. ๋งŒ์•ฝ ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฐ•์•„์ง€ ํ’ˆ์ข…์ด ์ˆ˜์ฒœ/์ˆ˜๋งŒ ๊ฐ€์ง€๊ฐ€ ๋œ๋‹ค๋ฉด ์ด์— ๋งž๋Š” index labeling์ด ๋˜์–ด์•ผํ•˜๊ณ , ๋ฌด์—‡๋ณด๋‹ค ๋งŒ์•ฝ ์‚ฌ์ง„์— ๊ฐ•์•„์ง€๋ณด๋‹ค background ๋น„์ค‘์ด ๋†’๋‹ค๊ฑฐ๋‚˜(์‚ฌ์ง„๋งˆ๋‹ค object์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ), ๊ฐ•์•„์ง€ ์—ฌ๋Ÿฌ ๋งˆ๋ฆฌ๊ฐ€ ํฌํ•จ๋œ ์‚ฌ์ง„ ๋“ฑ๋“ฑ ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํฌ์˜ ์ผ๊ด€์„ฑ ๋ฐ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋ณด์žฅํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ ์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์—ด์‹ฌํžˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ distribution์„ ๋งž์ถ˜ dataset์„ ๊ตฌ์„ฑํ–ˆ๋Š”๋ฐ, ์ถ”๊ฐ€์ ์œผ๋กœ ‘์‹œ๋ฐ”๊ฒฌ’์„ ๊ตฌ๋ถ„ํ•ด์•ผํ•œ๋‹ค๋ฉด ํ•ด๋‹น class์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ด๋Š” ๋‹จ์ˆœํžˆ ์›น ์ƒ์—์„œ ๋‹ค๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์…‹์„ ํš๋“ํ•˜๊ณ , ํš๋“ํ•œ ๋‹ค์–‘ํ•œ text prompt๋ฅผ token ๋‹จ์œ„๋กœ supervision์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” NLP ๋ชจ๋ธ๊ณผ ํฐ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. ์ด๋ ‡๋“ฏ computer vision์—์„œ ํ€„๋ฆฌํ‹ฐ, ๋ถ„ํฌ๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•œ ์œ ์˜๋ฏธํ•œ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๊ณผ์ •์„ ‘gold labeling‘์ด๋ผ ๋ถ€๋ฅด๋ฉฐ, ์ด๋Ÿฌํ•œ gold label ์—†์ด๋Š” computer vision task์—์„œ deep learning์˜ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
๋˜ํ•œ NLP์™€ CV์—์„œ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ด new task์— ๋Œ€ํ•œ zero-shot ๋ฐ few-shot ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ๋ถ€๋ถ„์ด๋‹ค. ์•ž์„œ ์„ค๋ช…ํ–ˆ๋˜ ๊ฒƒ๊ณผ ๊ฐ™์ด ์ž์—ฐ์–ด ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์›น ์ƒ์—์„œ ๋‹ค๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ• ๊ฐ€๋Šฅํ•˜๋ฉฐ ๋ถ„ํฌ๊ฐ€ ๋ฌดํ•œ์— ๊ฐ€๊นŒ์šด(continous signal) ์ด๋ฏธ์ง€์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์–ธ์–ด๋Š” ์ธ๊ฐ„์ด ์‚ฌ์šฉํ•˜๋Š” vocabulary๋‚˜ ๋ฌธ์žฅ ๋‚ด์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ํ˜•ํƒœ๋‚˜ variation์ด ์žˆ๊ธฐ(discrete signal) ๋•Œ๋ฌธ์— ๋Œ€์šฉ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต๋œ ๋Œ€์šฉ๋Ÿ‰ ๋„คํŠธ์›Œํฌ๋Š” ๋‹ค๋ฅธ task์—๋„ ์‰ฝ๊ฒŒ ์ ์‘์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Restricted form of supervision

์ด๋Ÿฌํ•œ SOTA computer vision system์˜ ํ•œ๊ณ„์ (๋ฏธ๋ฆฌ ์ •์˜ํ•œ object category์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ๋งŒ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ , ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹จ๊ณ„์—์„œ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ• ์ด์™ธ์—๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ)์„ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๊ฒฐํ•ด์•ผํ•  ๋ฌธ์ œ๋กœ ์ œ์‹œํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด NLP ๋ชจ๋ธ์ด ๊ฐ€์ง€๋Š” generality(์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ), usability(๋‹ค์–‘ํ•œ downstream task์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ)์„ CV์—์„œ๋Š” ์˜์›ํžˆ ๊ฐ€์ ธ๊ฐˆ ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
๋”ฐ๋ผ์„œ CLIP ๋…ผ๋ฌธ์—์„œ๋Š” image ํ•™์Šต์— raw text๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜์—ฌ, image supervision ์ž์ฒด๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค๋Š” text prompt์™€์˜ ๊ด€๊ณ„์„ฑ์„ ํ†ตํ•ด text representation๊ณผ image representation์„ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
์ด๋Š” ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹จ๊ณ„์—์„œ๋„ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋˜ ๊ธฐ์กด computer vision ์ ‘๊ทผ ๋ฐฉ์‹์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์›น์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์ด์— ๋งž๋Š” class categorize๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค๋Š”, ์›น ์ƒ์—์„œ ์ž์ฃผ ๊ฒ€์ƒ‰๋˜๊ฑฐ๋‚˜ ๋ฐ˜๋ณต๋˜๋Š” text prompt๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฅผ ๋ฌ˜์‚ฌํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ฒŒ ๋˜๋ฉด ๊ตณ์ด domain distribution์„ ๊ณ ๋ คํ•œ dataset ๊ตฌ์ถ• ์—†์ด๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์œ„ํ‚คํ”ผ๋””์•„ ์ƒ์—์„œ ๋งŽ์ด ์–ธ๊ธ‰๋œ ๋‹จ์–ด๋“ค ์ค‘ ํ•˜๋‚˜๊ฐ€ ‘๊ท€์—ฌ์šด ๊ณ ์–‘์ด‘๋ผ๋ฉด, ํ•ด๋‹น text๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํฌ๋กค๋งํ•œ image๋ฅผ ‘๊ท€์—ฌ์šด ๊ณ ์–‘์ด’๋ผ๋Š” text์™€ ์—ฐ๊ด€์ง“๋Š” ํ•™์Šต๋งŒ ์ง„ํ–‰ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋น„๊ตํ•œ ํ˜•ํƒœ๋Š” ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค. ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๋‹จ๊ณ„์—์„œ task๋ฅผ ์ •์˜ํ•˜๋Š” ๋ถ€๋ถ„์ด ๋น ์ง€๊ณ , image/text์˜ ์œ ๊ธฐ์  ํ•™์Šต์„ ์œ„ํ•œ text prompt๋ฅผ ์„ค์ •ํ•˜๊ณ  ์ด๋ฅผ ํ† ๋Œ€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ณผ์ •์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค. Text prompt์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ์กด์˜ task based dataset ์ˆ˜์ง‘๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ๋ผ๋ฒจ๋ง์— ์ถ”๊ฐ€ ๊ณผ์ •์ด ๋“ค์–ด๊ฐ€์ง€ ์•Š๋Š”๋‹ค.

Task-robust representation learning with CLIP

์‚ฌ์‹ค ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์‚ฌ์ด์˜ ์œ ๊ธฐ์ ์ธ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๊ณ ์ž ํ•œ ์—ฐ๊ตฌ๋Š” CLIP ์ด์ „์—๋„ ์กด์žฌํ–ˆ์—ˆ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์œผ๋กœ ์ ‘๊ทผํ–ˆ๋˜ ๋ฐฉ์‹์œผ๋กœ๋Š” content based image retrieval(ํŠน์ • ๋ฌธ์„œ์™€ ํ•จ๊ป˜ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ nouns, adjectives๋ฅผ ์˜ˆ์ธกํ•˜๋Š” task), caption prediction with manifold learning ๊ทธ๋ฆฌ๊ณ  multimodal deep Boltzmann machine์„ ํ™œ์šฉํ•œ low level image์™€ text tag feature์˜ ํ•™์Šต์ด ์žˆ๋‹ค. ์ดํ›„์—๋„ CNN based approach๋‚˜ Transformer based approach๋กœ ์ ‘๊ทผํ•œ deep learning ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์‹œ๋˜์—ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ‘image representation learning‘์ด๋ผ๋Š” ๊ด€์ ์—์„œ natural language ๋ฐฉ์‹๋“ค์€ zero-shot learning ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ์ง€ ๋ชปํ•˜๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋ ‡๋“ฏ CLIP ์ด์ „์˜ deep learning ๋ฐฉ์‹๋“ค์€ ๋Œ€๋ถ€๋ถ„ JFT-300M dataset๊ณผ ๊ฐ™์ด ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ๋˜์ง€, ๊ด€๋ จ๋œ prompt๋ฅผ instagram hashtag๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€์šฉ๋Ÿ‰์˜ dataset์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋“ฑ ์•ž์„œ ์„ค๋ช…ํ–ˆ๋˜ ๊ธฐ์กด computer vision ๋ฐฉ์‹์˜ gold-label๊ณผ unlimited amounts of text prompt ๊ธฐ๋ฐ˜์œผ๋กœ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์…‹ ์ค‘๊ฐ„ ๊ณผ์ •์— ๋†“์ธ ์—ฐ๊ตฌ๋“ค์ด๋‹ค. ์—ฌ์ „ํžˆ ํ•œ์ •๋œ supervision์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ๊ณ ์ „์ ์ธ softmax classifier์— ์˜์กดํ•œ ํ•™์Šต ๋•Œ๋ฌธ์— representation์˜ ์œ ์—ฐํ•œ ํ•™์Šต๊ณผ zero-shot ์„ฑ๋Šฅ์„ ์ œํ•œํ•˜๋Š” ์š”์†Œ๊ฐ€ ๋˜์—ˆ๋‹ค.
์ด๋Ÿฌํ•œ weakly-supervised method์™€๋Š” ๋‹ค๋ฅด๊ฒŒ VirTexICMLM ๊ทธ๋ฆฌ๊ณ  CLIP ๋…ผ๋ฌธ์ด ๊ฐ€์žฅ ํฐ insight๋ฅผ ์–ป์—ˆ๋˜ ConVIRT์˜ ๊ฒฝ์šฐ์—๋Š” ์ด ๋…ผ๋ฌธ์˜ ๋ฐฉํ–ฅ์„ฑ๊ณผ ๋น„์Šทํ•˜๊ฒŒ language ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ image representation์„ ํ•™์Šตํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ๋Œ€์šฉ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์ง€ ์•Š์•˜๋‹ค๋Š” ์ ์—์„œ CLIP ์ €์ž๋“ค์ด ConVIRT(์ฐธ๊ณ ๋กœ ConVIRT๋Š” medical diagnosis์™€ ๊ด€๋ จ๋œ task๋ผ ๋ฐ์ดํ„ฐ์…‹์ด ํ•œ์ •์ ์ด๋‹ค)์˜ ์ปจ์…‰์„ ๊ทธ๋Œ€๋กœ ์ด์šฉํ•˜๋˜, ๋Œ€๋Ÿ‰์˜ WebImageText dataset์„ ํ™œ์šฉํ•˜์—ฌ() scratch๋ถ€ํ„ฐ contrastive representation learning์„ ์ง„ํ–‰ํ•œ ๊ฒƒ์ด๋‹ค.
ํ•™์Šต ๊ณผ์ •์—์„œ ์ €์ž๋“ค์€ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ํƒ์ƒ‰ํ–ˆ์œผ๋ฉฐ, ๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ์ž์—ฐ์–ด ๋ชจ๋ธ์ธ GPT ์‹œ๋ฆฌ์ฆˆ์™€ ๊ฐ™์ด ๋‹ค์–‘ํ•œ task์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Image-captioning baseline์ธ transformer language ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค bag of words prediction(BoW)์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ์œจ์ ์ด๊ณ , bag of words prediction๊ณผ CLIP(constrastive learning)์„ ํ•จ๊ป˜ ํ™œ์šฉํ•œ ๊ฒƒ์ด zero-shot ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐ์— ์žˆ์–ด ์•ฝ 12๋ฐฐ์˜ ํ•™์Šต ํšจ์œจ์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค๊ณ  ํ•œ๋‹ค.

Several approaches

์•ž์„œ ๊ธธ๊ฒŒ ์„ค๋ช…ํ•˜๊ณ  ๋„˜์–ด์˜ค๊ธด ํ–ˆ์ง€๋งŒ, ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ approach๋ฅผ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์„ค๋ช…ํ•œ ๋ถ€๋ถ„์ด ์‚ฌ์‹ค์ƒ main contribution์„ ์•Œ์•„๋ณด๊ธฐ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๊ตฌ๊ฐ„์ด๋ผ๊ณ  ์ƒ๊ฐ๋˜์–ด ๊ฐ ์š”์†Œ๋งˆ๋‹ค ๊ฐ„๋‹จํ•˜๊ฒŒ ์š”์•ฝํ•˜๊ณ  ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ๋‹ค.

NLP supervision

๊ธฐ์กด์˜ text์™€ image pair๋ฅผ ํ•™์Šตํ–ˆ๋˜ ์—ฐ๊ตฌ๋“ค์€ ๋Œ€๋ถ€๋ถ„ unsupervised, self-supervised, weakly-supervised ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํ† ๋Œ€๋กœ natural language supervision์„ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— image representation ํ•™์Šต์— text ์ •๋ณด๋ฅผ ์ง์ ‘ ์—ฐ๊ด€์ง€์„ ์ˆœ ์žˆ์—ˆ์œผ๋‚˜ NLP๊ฐ€ ๊ฐ€์ง€๋Š” ์ •๋ณด๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜์ง€ ๋ชปํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ NLP๋Š” training ๊ณผ์ •์—์„œ  to  mapping์ด ํ•„์š”์—†์œผ๋ฉฐ, categorized๋œ labeling์— ๋น„ํ•ด ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๊ฐ€ ๋งŽ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ ์ง‘์ค‘ํ•˜๊ณ ์ž ํ•œ ๋‚ด์šฉ์€ NLP๋ฅผ supervision์œผ๋กœ ํ™œ์šฉํ•ด์„œ image๋ฅผ ํ†ตํ•ด text๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ˜•ํƒœ๋กœ image/text multimodal learning์„ ์ง„ํ–‰ํ•˜์ž๋Š” ๊ธฐ์กด concept์—์„œ ๋ฒ—์–ด๋‚˜, NLP representation์— image representation์„ ์—ฐ๊ฒฐ๋งŒ ์‹œ์ผœ์ฃผ์ž๋Š” concept์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Dataset ์ˆ˜์ง‘

MS-COCO, Visual Genome ๊ทธ๋ฆฌ๊ณ  YFCC100M๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์€ ์ˆ˜๊ฐ€ ํ˜„์ €ํžˆ ์ ๊ฑฐ๋‚˜, ๋Œ€์šฉ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์…‹์ด๋ผ ํ•˜๋”๋ผ๋„ image์˜ metadata(ex. image filename)์„ ์ด์šฉํ•œ ์ •๋ณด ์ถ”์ถœ์ด ์–ด๋ ค์› ๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•์ด ํž˜๋“ค์—ˆ๋‹ค.
๋”ฐ๋ผ์„œ NLP supervision์—์„œ ์ˆ˜์ง‘ํ•˜๋Š” web based ๋ฐฉ์‹์—์„œ motivation์„ ์–ป์–ด, ์ธํ„ฐ๋„ท ์ƒ์—์„œ ๋งŽ์ด ์‚ฌ์šฉ(Wikipedia ์—์„œ ์ตœ์†Œํ•œ 100๋ฒˆ์€ ๋“ฑ์žฅ ํ•˜๋Š” ๋‹จ์–ด)๋˜๋Š” query(text prompt)๋ฅผ augmentationํ•˜์—ฌ  ๋งŒํผ ์ค€๋น„ํ–ˆ์œผ๋ฉฐ, ๊ฐ query์— ๋งž๋Š” (image, text) pair๋ฅผ ์ตœ๋Œ€  ์”ฉ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค. ๊ฒฐ๊ตญ ๋ชฉํ‘œ๋กœ ํ–ˆ๋˜ GPT-2 training dataset์ธ ‘WebText‘์˜ ์ด ๊ฐฏ์ˆ˜์— ํ•„์ ํ•˜๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹์ธ ‘WebImageText‘๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Selecting an efficient pre-training method

๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ๊ฐ€ ์ปค์ง„๋งŒํผ ํ•™์Šต์— ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„ ๋˜ํ•œ ๊ธธ์–ด์กŒ๊ธฐ ๋•Œ๋ฌธ์—, ๋น ๋ฅธ ์ˆ˜๋ ด์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ pre-training method๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ–ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด ๋ถ€๋ถ„์—์„œ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฒช์€ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•ด ์–ธ๊ธ‰ํ•œ๋‹ค.

1) Jointly trained an image CNN and text transformer from scratch

VirTex ๋…ผ๋ฌธ์—์„œ์™€ ๊ฐ™์ด ์ ‘๊ทผํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ์œ„์—์„œ ๋ดค๋˜ ๊ทธ๋ž˜ํ”„(transformer based learning)์™€ ๊ฐ™์ด ๊ฐ„๋‹จํ•œ baseline์ธ ResNet๊ณผ BoW encoder๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๊ฒฝ์šฐ๋ณด๋‹ค ์ˆ˜๋ ด ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ๊ฒƒ(์•ฝ 3๋ฐฐ)์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

2) Bag of Words with simpler baseline

ํ•˜์ง€๋งŒ BoW ๋ฐฉ์‹๊ณผ transformer ๋ฐฉ์‹์˜ ์ฐจ์ด๋Š” ๊ตฌ์กฐ(parameter ๊ฐœ์ˆ˜, CNN and transformer)์— ์žˆ์„ ๋ฟ ๊ฒฐ๋ก ์ ์œผ๋กœ ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๊ณตํ†ต์ ์œผ๋กœ ๊ฐ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด exact words๋ฅผ ์ฐพ๊ณ ์ž ํ•˜๋Š” NLP supervision concept์—์„œ ๋ฒ—์–ด๋‚˜์ง€ ๋ชปํ–ˆ๋‹ค.
์ด๋Š” ๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์„ ๋‹จ์ผ์˜ description์— ํ•œ์ •์ง€์—ˆ๊ธฐ ๋•Œ๋ฌธ์ธ๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋ณด์ž.

์ด ์‚ฌ์ง„์„ ๋ฌ˜์‚ฌํ•˜๋Š” text prompt๋Š” ๋‹จ์ˆœํžˆ ‘๊ณ ์–‘์ด ๋‘๋งˆ๋ฆฌ๊ฐ€ ์žˆ๋Š” ์ด๋ฏธ์ง€’๋งŒ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ‘์„œ๋กœ ์ข‹์•„ํ•˜๋Š” ๋‘ ๊ณ ์–‘์ด’, ‘์ดˆ์›์— ์„œ ์žˆ๋Š” ๋‘ ๊ณ ์–‘์ด’ ๋“ฑ ๊ด€๋ จ๋œ text embedding space์—์„œ ์œ ์‚ฌํ•œ ์˜์—ญ์— ๋†“์ผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  representation์˜ ์ด์ฒด๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐ๊ตญ contrastive learning(๊ด€๋ จ์ด ์žˆ๋Š” text embedding๊ณผ์˜ ์œ ์‚ฌ์„ฑ์„ ์˜ฌ๋ฆฌ๊ณ , ๊ด€๋ จ์ด ์—†๋Š” text embedding๊ณผ์˜ ์œ ์‚ฌ์„ฑ์„ ๋‚ฎ์ถ”๋Š”) ๋ฐฉ์‹์ด ๋” ํšจ์œจ์ ์ธ ํ•™์Šต ํšจ๊ณผ๋ฅผ ๋ถˆ๋Ÿฌ์™”์œผ๋ฉฐ ๋ฐ”๋กœ ์ด ๋ฐฉ๋ฒ•์ด CLIP์—์„œ ์‚ฌ์šฉํ•œ ํ•™์Šต๋ฒ•์ด๋‹ค.

Contrastive learning

๋งŒ์•ฝ ๊ฐœ์˜ (image, text) pair์˜ batch๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•ด๋ณด์ž. CLIP์€ ์˜ ๊ฐ€๋Šฅํ•œ (image, text) pair์— ๋Œ€ํ•œ prediction์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ •๋‹ต์ด ๋˜๋Š” text prompt๋Š” batch ๋‚ด์— image๋‹น ํ•˜๋‚˜์”ฉ ์ด ๊ฐœ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—,  prediction์—์„œ ์€ positive pair๊ฐ€ ๋˜๊ณ  ๋‚˜๋จธ์ง€ ์€ negative pair๊ฐ€ ๋œ๋‹ค.

์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๊ฐ sample embedding์— ๋Œ€ํ•œ symmetric cosine similarity๋ฅผ matrix๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•œ๋ฐ, ์ด๋ฅผ normalizeํ•œ cross-entropy loss๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. Batch ๋‹จ์œ„์—์„œ similarity๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ˜•ํƒœ๋Š” multi-class -pair loss, InfoNCE loss ๋“ฑ์˜ ์—ฐ๊ตฌ๋กœ๋ถ€ํ„ฐ ์†Œ๊ฐœ๋˜์—ˆ๊ณ , ์ด ๋…ผ๋ฌธ์—์„œ๋„ ๊ฐ™์€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

์ด๋Ÿฌํ•œ InfoNCE loss (contrastive loss)๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜๋ฉฐ, InfoNCE loss๋ฅผ ์‚ฌ์šฉํ•œ ๋Œ€ํ‘œ์ ์ธ ๋…ผ๋ฌธ์€ SimCLR ํ˜น์€ MOCO ์ด๋‹ค.

SimCLR ํ˜น์€ MOCO ์—์„œ Contrastive learning ์— ์‚ฌ์šฉ๋œ InfoNCE loss์˜ ์ˆ˜์‹. z_i ์™€ z_j ์˜ similarity๋Š” ๊ฐ™๋„๋ก ์œ ๋„ํ•˜๋ฉฐ ๋‚˜๋จธ์ง€ z_k ๋“ค๊ณผ zi์˜ ๊ฑฐ๋ฆฌ๋Š” ๋ฉ€์–ด์ง€๋„๋ก ํ•œ๋‹ค.

์œ„ loss ์˜ ํ˜•ํƒœ๋ฅผ ๋ณด๋ฉด ์‹ค์ œ cross entropy with softmax activation ์˜ loss ํ˜•ํƒœ์™€ ๋งค์šฐ ํก์‚ฌํ•จ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Training from scratch

๋ฐ์ดํ„ฐ์…‹ ๊ฐœ์ˆ˜๊ฐ€ ์ด๋ฏ€๋กœ overfitting์„ ๊ฑฑ์ •ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. ๋”ฐ๋ผ์„œ ๋„คํŠธ์›Œํฌ๋ฅผ pre-trainingํ•˜๋Š” ๊ณผ์ •์—์„œ ๋”ฐ๋กœ representation๊ณผ ๊ด€๋ จ๋œ initializing ์—†์ด ๋ฐ”๋กœ scratch๋ถ€ํ„ฐ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ crop ์ด์™ธ์—๋Š” data augmentation์„ ์ง„ํ–‰ํ•˜์ง€ ์•Š์•˜๋‹ค. ๋˜ํ•œ multi-modal embedding space๋กœ mappingํ•˜๋Š” ๊ณผ์ •์—์„œ non-linear projection์„ ์ง„ํ–‰ํ•˜์ง€ ์•Š์•˜๋Š”๋ฐ, ์ด๋Š” self-supervised learning์—์„œ ์‚ฌ์šฉ๋˜๋˜ ๋ฐฉ์‹๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ์—ฌ๊ธฐ์—์„œ๋Š” ํ•ด๋‹น training efficiency์™€ ๊ด€๋ จ๋œ method๋ฅผ ์‚ฌ์šฉํ•  ํ•„์š”๊ฐ€ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ธ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๊ทธ๋ฆฌ๊ณ  entropy ์กฐ์ ˆ์— ์‚ฌ์šฉ๋˜๋Š” temperature normalizing hyperparameter ๋Š” ์ง์ ‘ ์„ค์ •ํ•˜๊ฒŒ ๋˜๋ฉด ablation ์ง„ํ–‰์ด ์–ด๋ ค์› ๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ parameter๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

Network selection

BoW baseline์œผ๋กœ ์‚ฌ์šฉ๋œ ResNet-50์„ CLIP์—์„œ๋„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์ด์— ResNet-D ๋…ผ๋ฌธ์„ ํ†ตํ•ด ์ œ์‹œ๋œ ๋‹ค์–‘ํ•œ improvement ๋˜ํ•œ encoder๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋˜ํ•œ image embedding์˜ dimension์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด feature extraction์˜ global average pooling์„ attention pooling layer๋กœ ๋Œ€์ฒดํ•˜์˜€๋‹ค.
ResNet ๊ตฌ์กฐ ์ด์™ธ์—๋Š” ViT(Vision Transformer)์—์„œ ์กฐํ•ฉ๋œ patch์™€ position embedding์— ์ถ”๊ฐ€ layer normalization์„ ๋”ํ•œ ๊ฒƒ ์ด์™ธ์—๋Š” ๋™์ผํ•œ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
Text encoder๋กœ๋Š” GPT-2๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ , base size๋กœ๋Š” ๊ฐœ์˜ layers์˜ ์˜ channel dimension width๋ฅผ ๊ฐ€์ง€๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค(attention head๋Š” 8๊ฐœ).

Results

์‚ฌ์‹ค ์ด ๋…ผ๋ฌธ์ด ๋ฐฉ๋Œ€ํ•˜๊ฒŒ ๊ธธ์–ด์ง„ ์ด์œ  3๊ฐ€์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

  1. Related works๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๋‹ค.
  2. ์ €์ž๋“ค์ด ์ฃผ์žฅํ•˜๋Š” contribution์ด ๋ฐฉ๋Œ€ํ•˜๋‹ค.
  3. ์ปจ์…‰ ํ•˜๋‚˜์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๋Š” ๊ณผ์ •(๋นŒ๋“œ์—…)์ด ๋„ˆ๋ฌด ํฌ๋‹ค.

๊ทธ๋Ÿฌ๋‹ค๋ณด๋‹ˆ ์‹คํ—˜ ๊ฒฐ๊ณผ๋„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ธธ์–ด์ง€๊ฒŒ ๋˜์—ˆ๊ณ , ๋…ผ๋ฌธ์„ ๋ชจ๋‘ ์ฝ์œผ๋ฉด์„œ ํŒŒ์•…ํ•˜๊ธฐ์—๋Š” ๋‹ค์†Œ ๋ฌด๋ฆฌ๊ฐ€ ์žˆ์ง€ ์•Š๋‚˜ ์‹ถ๋‹ค. ๋”ฐ๋ผ์„œ ๋น„๊ต์  ์ฃผ์š”ํ•œ ๋‚ด์šฉ๋“ค๋งŒ ์œ„์ฃผ๋กœ figure๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ์ž‘์„ฑํ•ด๋ณด์•˜๋‹ค.

Zero-shot with prompt engineering

์šฐ์„  CLIP์ด ์ฃผ๋กœ ๋‹ค๋ฃฌ ๋‚ด์šฉ์€, ์ž์—ฐ์–ด ๋ชจ๋ธ๊ณผ ๊ฐ™์ด ์ด๋ฏธ์ง€ ๋ชจ๋ธ ๋˜ํ•œ ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ representation learning์ด ์ง„ํ–‰๋œ๋‹ค๋ฉด, ์ด์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด task์— ๋Œ€ํ•œ performance๋„ zero-shot์œผ๋กœ ์ง„ํ–‰๋  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ์ƒˆ๋กœ์šด task(classification)์— ๋Œ€ํ•ด downstream task๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์—ฌ๊ธฐ์„œ ์ €์ž๋“ค์€ ๋‹จ์ˆœํžˆ class ์ด๋ฆ„์„ prompt๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ‘A photo of a {object}’ ์‹์˜ prompt engineering์„ ๊ฐ dataset ํŠน์„ฑ์— ๋งž๊ฒŒ ์ง„ํ–‰ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” (image, text) pair ์ž์ฒด๊ฐ€ object๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, image scene์„ ์„ค๋ช…ํ•˜๋Š” description์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต๋œ domain distribution๊ณผ class name distribution์— ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์ฃผ๋œ ๋ถ„์„์ด์—ˆ๋‹ค.

๊ฐœ์˜ class์— ๋Œ€ํ•œ text prompt ๊ฐœ ์™€์˜ similarity๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ , ์ด ์ค‘ ๊ฐ€์žฅ ๋†’์€ ์œ ์‚ฌ๋„์˜ index๊ฐ€ ํ•ด๋‹น image์˜ prediction์ด ๋˜๋Š” ๊ตฌ์กฐ๋‹ค. ์‹ค์ œ๋กœ prompt engineering์„ ํ†ตํ•ด classification ์„ฑ๋Šฅ(average)์ด ์˜ฌ๋ผ๊ฐ€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Zero-shot vs fully-supervised baseline

ResNet-50์„ baseline study์— ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— linear probe์™€์˜ ์„ฑ๋Šฅ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•œ๋ฐ, ์œ„์˜ ํ‘œ์— ์žˆ๋Š” ์ด ๊ฐœ์˜ dataset์„ fully-supervision์— ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ๋ณด๋‹ค ๊ฐœ์˜ dataset์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ(positive )๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋•Œ, Linear probe๋Š” pre train๋œ ๋ชจ๋ธ์˜ down stream task ์„ฑ๋Šฅ์„ ์ธก์ •ํ•ด์ฃผ๋Š” ์ง€ํ‘œ์ด๋‹ค. ๊ทธ๋ž˜์„œ  CLIP์ด pre train์—์„œ ์ผ๋ฐ˜ํ™”๋œ ํŠน์ง•์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ถ”์ถœํ–ˆ๋Š”์ง€ ๋‚˜ํƒ€๋‚ด์ค€๋‹ค.

Zero-shot vs Few-shot

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์—ฌ๋Ÿฌ few-shot methods์— ๋Œ€ํ•ด ์ผ์ • ์ˆ˜์˜ ์ƒ˜ํ”Œ์„ ์ œ๊ณตํ–ˆ์„ ๋•Œ ์ค‘์—์„œ ๊ฐ€์žฅ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ธ ๊ทธ๋ž˜ํ”„๋ผ๋ฆฌ ๋น„๊ตํ•˜๋ฉด zero-shot CLIP์ด 4-shot method์— ํ•„์ ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ , ์‹ฌ์ง€์–ด BiT-M, SimCLRv2 ๋“ฑ ๊ธฐ์กด ๋ฐฉ์‹์€ 16-shot์„ ์ง„ํ–‰ํ–ˆ์Œ์—๋„ Zero-shot CLIP์ด ํš๋“ํ•œ ์ •ํ™•๋„๋ฅผ ๋„˜๊ธฐ๊ธฐ ์–ด๋ ค์šด ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

CLIP์˜ zero-shot ์„ฑ๋Šฅ์— ํ•„์ ํ•  ๋งŒํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ธฐ ์œ„ํ•œ few-shot sample์˜ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ ํ‘œ๋ฅผ ๋ณด๋ฉด ๋Œ€๋ถ€๋ถ„์˜ dataset์— ๋Œ€ํ•ด์„œ ํ‰๊ท  20.8์˜ sample์ด ํ•„์š”ํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ† ๋Œ€๋กœ CLIP method๊ฐ€ ๊ฐ€์ง„ data efficiency ๋ฐ task robustness์— ๋Œ€ํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Zero shot performance & Linear probe

๋˜ํ•œ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹จ์ˆœํžˆ CLIP์ด ‘ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋งŒ ๋Œ€์ถฉ ์œ ์‚ฌํ•˜๊ฒŒ connection’ํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, classifier ๊ธฐ๋ฐ˜์˜ linear probe system์—์„œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ image representation์„ image encoder๊ฐ€ ๋™์ผํ•˜๊ฒŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ์œ„์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ๋ณด์—ฌ์ค€๋‹ค. Zero-shot ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋‹ค๋ฉด ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Linear probe ์„ฑ๋Šฅ๋„ ์ข‹์ง€ ์•Š๋‹ค๋Š” ๊ฒฝํ–ฅ์„ฑ์„ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ, classification ํ•™์Šต์ด ์–ด๋ ค์šด task์— ๋Œ€ํ•ด CLIP๋„ ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง€๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

Zero shot performance with network scale

๋˜ํ•œ ๊ธฐ์กด ๋ฐฉ์‹๋“ค(ex. EfficientNet)์—์„œ ์ฃผ์žฅํ–ˆ๋˜ ๊ฒฝํ–ฅ์„ฑ๊ณผ ๋™์ผํ•˜๊ฒŒ image encoder์˜ representation power๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก(network์˜ scale์ด ์ปค์งˆ์ˆ˜๋ก) ์ด์— ๋”ฐ๋ฅธ ํ•™์Šต๋œ CLIP์˜ zero-shot ์„ฑ๋Šฅ์ด ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

CLIP is also a good representation learner

ํ•™์Šต ๊ณผ์ •์—์„œ์˜ GFLOPS/Image์— ๋”ฐ๋ฅธ Average score ๋น„๊ต๋Š” ์œ„์˜ ๊ทธ๋ž˜ํ”„์™€ ๊ฐ™๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ ๊ทธ๋ž˜ํ”„๋Š” zero-shot์„ ๋น„๊ตํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ  ์‹ค์ œ๋กœ CLIP์ด pre-trained ๊ณผ์ •์—์„œ ํ•™์Šตํ•œ representation์ด Linear probe ์„ฑ๋Šฅ์— ์–ผ๋งˆ๋‚˜ ๋„์›€์ด ๋˜๋Š”์ง€ ํ™•์ธํ•œ ์ ˆ์ฐจ์ด๋‹ค. ์‹ค์ œ๋กœ ๊ทธ๋ž˜ํ”„์ƒ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ CLIP network์™€, ํ•จ๊ป˜ ๋น„๊ตํ•œ network ์ค‘ ๊ฐ€์žฅ ๋†’์€ representation learning ์„ฑ๋Šฅ์„ ๋ณด์ธ EfficientNet-NoisyStudent์™€ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ํ•˜์˜€๋‹ค.

๋ฌผ๋ก  ViT(L/14) ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ CLIP์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‹น์—ฐํžˆ ๋” ์ข‹๊ฒ ๋‹ค๊ณ  ์˜ˆ์ƒํ•˜๊ธด ํ–ˆ์ง€๋งŒ ์•„๋ฌดํŠผ 27์˜ dataset ์ค‘์—์„œ 21 ๋งŒํผ์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋” ์ข‹์€ ์„ฑ๋Šฅ(positive Δ)๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

Robustness to natural domain shift

๋ฌผ๋ก  supervision์„ ํ™œ์šฉํ•œ deep learning network๋“ค์˜ ์„ฑ๋Šฅ์ด ์ถฉ๋ถ„ํžˆ ์ข‹์•„์ง€๊ธด ํ–ˆ์œผ๋‚˜, ์—ฌ์ „ํžˆ distribution shift์™€ ๊ฐ™์€ ๋ฌธ์ œ์— ์ทจ์•ฝํ•œ ๋ฉด์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹จ์ˆœํžˆ ImageNet dataset์— ๋Œ€ํ•ด ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ๋Š” ๊ฐ class์˜ object์˜ ํŠน์ง•์— ๋Œ€ํ•ด ์œ ์˜๋ฏธํ•œ feature๋ฅผ ํ•™์Šต ๋ฐ ์ธ์‹ํ•˜๊ธฐ๋ณด๋‹ค๋Š”, training dataset์˜ in-distribution์„ neural network์— fittingํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ImageNet dataset์— ๋Œ€ํ•ด 7๊ฐ€์ง€์˜ natural distribution shift๊ฐ€ ์ผ์–ด๋‚œ dataset์ธ ImageNetV2ImageNet SketchYoutube-BB and ImageNet-VidObjectNetImageNet Adversarial ๊ทธ๋ฆฌ๊ณ  ImageNet Rendition์— ๋Œ€ํ•ด validation์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

์‚ฌ์‹ค ์ด figure์— ๋Œ€ํ•œ ์„ค๋ช…์ด ๋…ผ๋ฌธ์— ์—†๋Š” ๋ถ€๋ถ„์ด ์ž˜ ์ดํ•ด๊ฐ€ ๊ฐ€์งˆ ์•Š์ง€๋งŒ ์ผ๋‹จ ์ด ๋ถ€๋ถ„์— ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ด์„ํ•˜๋Š” ๋ฐ”๋กœ๋Š” CLIP์˜ zero-shot ๋ชจ๋ธ๋“ค์ด linear probe๋ฅผ ํ†ตํ•ด domain transfer๋ฅผ ์ง„ํ–‰ํ–ˆ์„๋•Œ, ๋ณด๋‹ค ImageNet์˜ ์„ฑ๋Šฅ์„ ์ž˜ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„ domain shift ํ˜„์ƒ์— ์ž˜ ๋Œ€์ฒ˜ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์œ„์˜ figure๊ฐ€ main์€ ์•„๋‹ˆ๊ณ , ๋‹ค์Œ figure๋ฅผ ๋ณด๋ฉด ์™œ CLIP model์ด zero-shot ๋ฐ domain shift task์—์„œ game changer๋กœ ๋“ฑ์žฅํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”์ง€ ์‹คํ—˜์  ์ฆ๊ฑฐ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ƒ์ ์ธ ์ƒํ™ฉ์—์„œ๋Š” domain shift๊ฐ€ ์ผ์–ด๋‚˜๋”๋ผ๋„ ImageNet dataset์— ๋Œ€ํ•œ ์„ฑ๋Šฅ๊ณผ domain shifted dataset์—์„œ์˜ ์„ฑ๋Šฅ์ด ๋™์ผํ•ด์•ผํ•œ๋‹ค(๏ฟฝ=๏ฟฝ). ๋‹จ์ˆœํžˆ ImageNet์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•œ ๊ฒฝ์šฐ ๊ทธ ์„ฑ๋Šฅ์ด 2∼30% ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, zero-shot CLIP ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ 7๊ฐœ์˜ natural domain shifted dataset์˜ ์„ฑ๋Šฅ ํ‰๊ท ์ด ์ตœ๋Œ€ ์•ฝ 75%๋‚˜ ์ƒ์Šนํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ์˜๋ฌธ์ด ๋“ค ์ˆ˜๋„ ์žˆ๋Š” ์ ์€, ๋‹จ์ˆœํžˆ zero-shot์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๊ฒƒ๊ณผ dataset์— specificํ•˜๊ฒŒ fine-tuningํ•œ logistic regression classifier์— ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ์— ๋Œ€ํ•œ ๋ถ€๋ถ„์ด๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ฒŒ ๋˜๋ฉด ImageNet์— ๋Œ€ํ•ด fine-tuningํ•œ ๊ฒฝ์šฐ original dataset์— ๋Œ€ํ•ด์„œ๋Š” 9.2%์˜ ๊ด„๋ชฉํ• ๋งŒํ•œ ์ •ํ™•๋„์˜ ํ–ฅ์ƒ์ด, ImageNet์˜ ๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘ ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ–ˆ๋˜ ImageNetV2์— ๋Œ€ํ•ด์„œ๋Š” ๊ทธ์— ๋น„ํ•ด ์ ˆ๋ฐ˜ ์ •๋„์˜ ์ •ํ™•๋„์˜ ํ–ฅ์ƒ์ด ์ด๋ฃจ์–ด์กŒ๊ณ , Video dataset์ธ Youtube-BB์™€ ImageNet-Vid๋ฅผ ์ œ์™ธํ•œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚ฌ๋‹ค.
์‚ฌ์‹ค ๊ทธ๋ƒฅ text encoder๋ฅผ ํฌํ•จํ•œ CLIP model๋กœ zero-shot classification์„ ์ง„ํ–‰ํ–ˆ๋‹ค๋ฉด, ์œ„์—์„œ ์—ด์‹ฌํžˆ ์‹คํ—˜ํ–ˆ๋˜ ๋‚ด์šฉ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ‘text๊ฐ€ guidance๋ฅผ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— domain์— robustํ•œ ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋‹ค’๋ผ๊ณ  ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์ง€๋งŒ, ๋†€๋ผ์šด ์ ์€ classifier๋ฅผ ํ†ตํ•ด downstream task๋ฅผ fine-tuningํ–ˆ์„ ๋•Œ๋„ domain shift์— ๋Œ€ํ•œ ์ •ํ™•๋„๊ฐ€ ์œ ์ง€๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. ์ด ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ๋Š” ์ €์ž๋“ค์ด ๋ช…ํ™•ํ•˜๊ฒŒ ๊ทธ ๊ทผ๊ฑฐ๋ฅผ ์„ค๋ช…ํ•˜์ง€ ๋ชปํ•œ ์ฑ„ ๋„˜์–ด๊ฐ”๋‹ค.
์ €์ž๋“ค์ด ์ถ”๊ฐ€๋กœ ํ™•์ธํ•œ ์‹คํ—˜ ์ค‘ ํ•˜๋‚˜๋Š” transfer dataset์˜ class๋“ค์€ ํ•ญ์ƒ 1000-way classifier๊ฐ€ ์ ์šฉ๋˜๋Š” ImageNet๊ณผ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์  ๋•Œ๋ฌธ์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด ์ ์ด์—ˆ๋Š”๋ฐ, ์ด์ „ ๊ณผ์ •์—์„œ๋Š” ์ด๋ฅผ ImageNet class์˜ ํ•˜์œ„ ํ•ญ๋ชฉ์— ํ•ด๋‹น๋˜๋Š” prediction์— ๋Œ€ํ•œ max-pooling์œผ๋กœ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์ •ํ™•ํ•˜์ง€ ์•Š์•˜์œผ๋ฉฐ, ์ €์ž๋Š” ์ด๋Ÿฌํ•œ ๋ฐฉ์‹ ๋Œ€์‹  CLIP์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ธ ‘class name์„ promptํ™” ์‹œํ‚จ text embedding’์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ Video based dataset์ธ Youtube-BB, ImageNet-Vid์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์—ˆ์œผ๋ฉฐ ObjectNet์˜ ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ๋งŒ์ด ์žˆ์—ˆ๋‹ค.

์œ„์˜ ๊ทธ๋ฆผ์„ ํ†ตํ•ด Zero-shot CLIP์— Few-shot ๋งŒํผ์˜ supervision์ด ๋”ํ•ด์งˆ์ˆ˜๋ก, fully-supervised learning์— ๋น„ํ•ด ๊ฐ€์ง€๋˜ robustness๊ฐ€ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” dataset sample์ด ๋งŽ์•„์งˆ์ˆ˜๋ก in-distribution์— fitting๋˜์–ด ๊ธฐ์กด์˜ zero-shot CLIP model์— ๋น„ํ•ด ๊ฐ™์€ ImageNet performance์— ๋Œ€ํ•œ robustness๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

Comparison to Human Performance

์‚ฌ๋žŒ์œผ๋กœ ํ•˜์—ฌ๊ธˆ zero-shot ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ธฐ๋ž€ ๊ต‰์žฅํžˆ ์–ด๋ ต๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ๋„ zero-shot ํ˜น์€ few-shot task performance๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ, ๋ฐ”๋กœ Oxford-IIT Pets dataset(37๊ฐœ์˜ ๊ฐ•์•„์ง€ ํ˜น์€ ๊ณ ์–‘์ด ์ข…๋ฅ˜ ๋งž์ถ”๊ธฐ)์— ๋Œ€ํ•œ ๊ฒƒ์ด๋‹ค. ์•„๋ฌด๋ฆฌ ์‚ฌ๋žŒ์ด๋ผ๋„ ๊ฐ•์•„์ง€๋‚˜ ๊ณ ์–‘์ด์˜ ๋ชจ๋“  ์ข…์— ๋Œ€ํ•ด ์•Œ๊ณ  ์žˆ์ง€๋Š” ์•Š์„ ๊ฒƒ์ด๋‹ค. ํŠน์ • ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ , ํ•ด๋‹น ์ข…์— ๋Œ€ํ•ด์„œ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด ์ข…๋ฅ˜๋ฅผ ๋Œ€๋‹ตํ•˜๋ฉด ๋˜๊ณ  ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ‘I don’t know’๋ฅผ ๋Œ€๋‹ตํ•˜๋ฉด ๋œ๋‹ค.

์–ด๋Š ์ •๋„ ์ง๊ด€์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์‚ฌ๋žŒ์€ ๋ณดํ†ต 1๊ฐœ์˜ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๋ฉด(one-shot) ํ•ด๋‹น ์ข…์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ€์ง€๋งŒ, ์ถ”๊ฐ€๋กœ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋”๋ผ๋„ ํ•ด๋‹น ์ข…์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ์ •ํ™•๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ€์ง€๋Š” ์•Š๋Š”๋‹ค. ์ด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฌธ์žฅ์œผ๋กœ๋Š” ‘humans know what they don’t know’์ธ๋ฐ, ๋ณธ์ธ์ด ๋ชจ๋ฅด๋Š” ๊ฒƒ์— ๋Œ€ํ•ด์„œ๋Š” ๋ช…ํ™•ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ ๋ฒˆ ํ•ด๋‹น ๋‚ด์šฉ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๋œป์ด๋‹ค.

๋”ฐ๋ผ์„œ ๋”ฅ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋กœ ํ•˜์—ฌ๊ธˆ ์‚ฌ๋žŒ๊ณผ ๊ฐ™์ด few-shot ์„ฑ๋Šฅ์ด prior knowledge๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ด๋Œ์–ด๋‚ด์ง€ ๋ชปํ•˜๋Š” ์ ์ด ์œ ์‚ฌํ•˜๊ณ , ์ด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด CLIP์˜ ์ค‘์š”ํ•œ ๋ฐœ์ „์ด ๋  ๊ฒƒ์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. ๋˜ํ•œ ์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ, ๊ฐ ์ข…์— ๋Œ€ํ•œ zero-shot ์„ฑ๋Šฅ์„ ์ธ๊ฐ„๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์„œ๋กœ ์–ด๋ ต๊ฒŒ ๋Š๋ผ๋Š” breed์— ๋Œ€ํ•œ ๊ฒฝํ–ฅ์„ฑ์ด ์–ด๋Š ์ •๋„ ์œ ์‚ฌํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Data overlap analysis

์•ž์„œ WebImageText ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋Œ€๋Ÿ‰์˜ ์›น ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์™€ ํ•™์Šตํ•œ๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋Š” validation์—์„œ ์‚ฌ์šฉํ•  ์ผ๋ถ€ dataset ๋˜ํ•œ web ์ƒ์—์„œ ํš๋“ํ•  ์ˆ˜ ์žˆ๋Š” sample์— ํ•ด๋‹น๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ validation set์ด pre-training dataset์— ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค๋ฉด(leak into) ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ ์ „์— ์‚ฌ์ „์— duplicated image(validation set๊ณผ ๋™์ผํ•œ ์ƒ˜ํ”Œ)์„ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜์ง€๋งŒ, ๊ฒฐ๊ตญ ๋ชจ๋“  dataset์— ๋Œ€ํ•ด ๊ฒ€์ฆ์„ ๊ฑฐ์ณ์•ผํ•œ๋‹ค๋Š” ์ ์ด time-consumingํ•˜๋‹ค.
์ด๋Ÿฌํ•œ brute-force ๋ฐฉ์‹ ๋Œ€์‹ , ์ €์ž๋“ค์€ overlap์ด ๋ฐœ์ƒํ•˜๋Š” ๋ฐฉ์‹์„ ์ •์˜ํ•˜๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์„ ํ†ตํ•ด overlap์„ ํ™•์ธํ•˜์˜€๋‹ค.

  1. ๊ฐ validation dataset์— ๋Œ€ํ•ด duplicate detector(๋…ผ๋ฌธ์˜ appendix์— ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค)๋ฅผ ๊ฐ example์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•œ๋‹ค. ๊ทธ๋Ÿฐ ๋’ค nearest neighbor๋ฅผ ์ง์ ‘ ์กฐ์‚ฌํ•œ ๋’ค, ํŠน์ • threshold(sample๊ฐ„ ๊ฑฐ๋ฆฌ metric)๋ฅผ ์„ค์ •ํ•˜์—ฌ Overlap(threshold๋ณด๋‹ค ํฐ ์œ ์‚ฌ์„ฑ์„ ๊ฐ€์ง€๋Š” ์ƒ˜ํ”Œ)๊ณผ Clean(threshold๋ณด๋‹ค ๋‚ฎ์€ ์œ ์‚ฌ์„ฑ์„ ๊ฐ€์ง€๋Š” ์ƒ˜ํ”Œ)๋กœ subset์„ ๊ตฌ์„ฑํ•œ๋‹ค. Data์˜ cleanํ•œ ์ •๋„(๊ฒน์น˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ์—†์Œ์„ ์˜๋ฏธํ•œ๋‹ค)๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด All(Overlap + Clean)์— ๋Œ€ํ•œ Overlap์˜ ๋น„์œจ์„ ๊ตฌํ•œ๋‹ค.
  2. ๊ทธ๋Ÿฐ ๋’ค All, Overlap, Clean์— ๋Œ€ํ•ด CLIP RN50x64์— ๋Œ€ํ•œ zero-shot accuracy๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. (All accuracy) - (Clean accuracy)๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”๊ฐ€ contamination์— ์˜ํ•œ accuracy ์ฐจ์ด๊ฐ€ ๋˜๋ฏ€๋กœ, ์ด ๊ฐ’์ด ์–‘์ˆ˜๋ผ๋ฉด ์–ผ๋งˆ๋‚˜ overlapping data์— ์˜ํ•ด over-fitting ๋˜์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•œ๋‹ค.
  3. Overlap์˜ ์–‘์ด ์ ์€ ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— binomial significance test๋ฅผ ์ถ”๊ฐ€๋กœ ์ง„ํ–‰ํ•œ๋‹ค. Clean accuracy๋ฅผ null hypothesis๋กœ ์žก๊ณ  Overlap subset์— ๋Œ€ํ•œ one-tailed -value(greater)๋ฅผ ์–ป๋Š”๋‹ค. ์ถ”๊ฐ€๋กœ 99.5%์˜ Clopper-Pearson confidence intervals ๊ณ„์‚ฐ์„ Dirty subset์— ์ง„ํ–‰ํ•œ๋‹ค.

Binomial distribution test

์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด๋ณด๊ธฐ ์ „์— Binomial distribution test์— ๋Œ€ํ•ด์„œ ์ˆ˜์‹์„ ํ†ตํ•ด ์ดํ•ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์šฐ์„  ๋ผ๋Š” test hypothesis๊ฐ€ ์žˆ๊ณ , ์ด๋ฅผ null hypothesis๋ผ๊ณ  ๋ถ€๋ฅด๋„๋ก ํ•˜์ž. ์šฐ๋ฆฌ๊ฐ€ ์ง„ํ–‰ํ•  test์—์„œ null hypothesis๋Š” clean accuracy์ด๋ฏ€๋กœ, ์ ์€ ์–‘์˜ overlapping dataset์„ ์ œ์™ธํ•˜๊ณ  ์ง„ํ–‰ํ•œ zero-shot validation ์„ฑ๋Šฅ์„ ์˜๋ฏธํ•œ๋‹ค.

์ •ํ™•๋„๋Š” 0∼1์˜ ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋˜๋ฏ€๋กœ, ์ด๋ฅผ CLIP model์ด overlapping๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ์ค€์œผ๋กœ ํŠน์ • ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ‘์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•  ํ™•๋ฅ ’์ด๋ผ๊ณ  ํ•˜์ž.
Overlap dataset์ด ๊ฐœ ์žˆ๊ณ , ๊ทธ ์ค‘ ์ƒ˜ํ”Œ ๋งŒํผ์ด CLIP model์— ๋Œ€ํ•ด ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜๋  ํ™•๋ฅ ์ด๋ผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

์šฐ๋ฆฌ๋Š” Overlap dataset ๋‚ด๋ถ€์˜ ํ™•๋ฅ  ๋กœ ํ•˜์—ฌ๊ธˆ ์ด ๊ฐ’๋ณด๋‹ค ์ปค์งˆ -value๋ฅผ ๊ตฌํ•˜๊ณ  ์‹ถ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ๊ทธ๋ž˜์•ผ๋งŒ overfitting์„ ๊ฐ€์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ํ•ด๋‹น ๋ฅผ ๊ตฌํ•˜๋Š” one-tail test๋Š” ์œ„์™€ ๊ฐ™๋‹ค. Sucess์˜ ๊ฐœ์ˆ˜๊ฐ€ ๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด ๊ธฐ์ค€์ ์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ฒฐ๊ณผ ๊ทธ๋ž˜ํ”„๋Š” ์œ„์™€ ๊ฐ™๋‹ค. ์ด 35๊ฐœ์˜ dataset์— ๋Œ€ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ, Overlapping๋ณด๋‹ค Clean subset์˜ accuracy๊ฐ€ ๋†’์€ 9๊ฐœ์˜ dataset์— ๋Œ€ํ•ด์„œ๋Š” overlap์ด ์—†๋‹ค๊ณ  ๊ฒฐ๋ก ์„ ๋‚ด๋ ธ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ overlap์ด ์‹ฌํ•˜์ง€ ์•Š์•˜๊ณ , overall accuracy๊ฐ€ ํฌ๊ฒŒ shifted๋œ ๋ถ€๋ถ„์€ 0.1%์˜ threhold ๊ธฐ์ค€์œผ๋กœ 7๊ฐœ์˜ dataset์ด ์žˆ์—ˆ๋‹ค. ์ด ์ค‘ Bonferroni correction์„ ๊ธฐ์ค€์œผ๋กœ 2๊ฐœ์˜ dataset์ด ๊ฐ€์žฅ ํฐ ๊ฒƒ์œผ๋กœ ํŒ๋ณ„๋˜์—ˆ๋‹ค(Birdsnap, Country211).

๊ฒฐ๋ก 

๋…ผ๋ฌธ์—์„œ๋Š” ๋’ค์— ์ถ”๊ฐ€๋กœ ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” ํ•œ๊ณ„์ ์ด๋‚˜ appendix๋ฅผ ํ†ตํ•ด ์„ค๋ช…ํ•˜์ง€ ๋ชปํ•œ ๋””ํ…Œ์ผํ•œ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœํ•œ๋‹ค. ์‚ฌ์‹ค ์œ„์—์„œ ์ •๋ฆฌํ•œ ๋‚ด์šฉ๋งŒ ํ•ด๋„ ๊ฑฐ์˜ 20 page์— ๊ฐ€๊น๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋…ผ๋ฌธ์ด ์ง€๋‚˜์น˜๊ฒŒ ๊ธธ๋‹ค๋Š” ์‚ฌ์‹ค์ด ๋” ์™€๋‹ฟ๋Š”๋‹ค.
์•„๋ฌดํŠผ ๊ธธ๊ฒŒ ๋ณด์•„ ์ด ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” contribution ์ค‘ ๊ฐ€์žฅ ๋ฉ”์ธ์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. Text model์—์„œ ์‚ฌ์šฉ๋˜๋˜ ๋Œ€์šฉ๋Ÿ‰ dataset์„ ๊ตฌ์ถ•ํ•˜์—ฌ image model์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜์˜€๋‹ค.
  2. ์œ ์˜๋ฏธํ•œ text to image ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ฆ‰, multimodal์„ ๊ฐ€๋Šฅ์ผ€ ํ–ˆ๋‹ค.
  3. Zero-shot transfer learning์˜ ์„ฑ๋Šฅ์„ ๋น„์•ฝ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์‹คํ—˜ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•์ธํ•œ ์ถ”๊ฐ€ contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ด ๋ถ€๋ถ„์€ ์‚ฌ์‹ค main contribution์ด ์•„๋‹Œ sub contribution์— ํ•ด๋‹น๋œ๋‹ค.

  1. Linear probe๋‚˜ logistic regression์‹œ ์„ฑ๋Šฅ์„ ๋ณด์•„, representation learning์ด ์–ด๋Š ์ •๋„ ์ž˜ ์ง„ํ–‰๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
  2. Domain shift์— robustํ•œ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์˜€๋‹ค.
  3. ๋‹จ์ˆœํžˆ ๋ฌด์ž‘์ • dataset์„ ๋ชจ์€ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, dataset์˜ overlapping ๋ฌธ์ œ๋„ ๋ถ„์„ํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด overfitting์— ์˜ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์•„๋‹˜์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค.

๋‹ค๋งŒ ๋‹ค์‹œ ํ•œ๋ฒˆ ๋งํ•˜์ง€๋งŒ ๋…ผ๋ฌธ์ด ๋„ˆ๋ฌด ๊ธธ๊ณ , ์†”์งํžˆ ๋ช‡๋ช‡ ๋ถ„์„์€ ์ด paper ์ดํ›„ ์—ฐ๊ตฌ๋กœ ๋„˜๊ฒผ์–ด๋„ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•˜๋‚˜ ์‹ถ๋‹ค..

๋ฐ˜์‘ํ˜•