Study: Artificial Intelligence(AI)/AI: 2D Vision(Det, Seg, Trac)

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] DINO: Emerging Properties in Self-Supervised Vision Transformers (ICCV 2021)

DrawingProcess 2024. 11. 21. 14:05
๋ฐ˜์‘ํ˜•
๐Ÿ’ก ๐Ÿ’ก ๋ณธ ๋ฌธ์„œ๋Š” 'DINO: Emerging Properties in Self-Supervised Vision Transformers (ICCV 2021)' ๋…ผ๋ฌธ์„ ์ •๋ฆฌํ•ด๋†“์€ ๊ธ€์ด๋‹ค.
ํ•ด๋‹น ๋…ผ๋ฌธ์€ DINO๋ผ๋Š” self-distillation ๊ตฌ์กฐ์˜ ์ž๊ธฐ์ง€๋„ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๋ฉฐ, ๋™์‹œ์— self-supervised learning๊ณผ ViT๊ฐ€ ๊ฒฐํ•ฉ๋˜๋ฉฐ ๋ฐœ์ƒํ•˜๋Š” ํŠน์„ฑ๋“ค์— ๋Œ€ํ•œ ๋ถ„์„๊ณผ ํฅ๋ฏธ๋กœ์šด ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋…ผ๋ฌธ์—์„œ ๋ฐํžˆ๊ณ  ์žˆ๋‹ค. ํŠนํžˆ self-supervised ViT๊ฐ€ segmentation mask์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค๋Š” ํŠน์„ฑ์ด ๊ต‰์žฅํžˆ ์žฌ๋ฏธ์žˆ์—ˆ๊ณ , ๊ธฐ์กด์˜ supervised ViT ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” object boundary์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ์–ป์„ ์ˆ˜ ์—†๊ณ  attention flow์™€ ๊ฐ™์€ ์ถ”๊ฐ€์ ์ธ ์กฐ์ž‘ ๋ฐฉ๋ฒ•์ด ๋“ค์–ด๊ฐ€์•ผ ํ•˜์ง€๋งŒ self-supervised ViT๋Š” ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์˜ attention head์—์„œ ์ง์ ‘์ ์œผ๋กœ ํ•ด๋‹น ์ •๋ณด์— ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ (๋‹ค๋งŒ ์ด ๊ฒฝ์šฐ์—๋Š” ์™œ ๊ทธ๋ ‡๊ฒŒ ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ถ๊ธˆ์ฆ์ด ์กฐ๊ธˆ ๋‚จ์•˜๋˜ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค)๋„ ์žฌ๋ฏธ์žˆ๊ฒŒ ์ฝ์€ ๋…ผ๋ฌธ์ด๋‹ˆ ์ฐธ๊ณ ํ•˜๊ธฐ ๋ฐ”๋ž€๋‹ค. 

 - Project: https://arxiv.org/abs/2104.14294
 - Paper: https://arxiv.org/abs/2303.09553
 - Github: https://github.com/facebookresearch/dino
 - Video: https://www.youtube.com/watch?v=h3ij3F3cPIk

Abstract

self-sup ๋ฐฉ๋ฒ•๋ก ์„ ViT์— ์ ์šฉํ•ด์„œ ์ž˜ ๋‚˜์˜ค๋Š” ๊ฑธ ๋„˜์–ด์„œ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐœ๊ฒฌ์„ ํ–ˆ๋‹ค.

  • self-supervised ViT features contain explicit information about the semantic segmentation of an image (supervised ViT๋‚˜ convolutional nertworks(convnets)์—์„œ๋Š” ๋ฐœ์ƒํ•˜์ง€ X)
  • these features are also excellent -NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT.
  • momentum encoder, multi-crop training, and the use of small patches with ViTs ์—ญ์‹œ ์ค‘์š”ํ•จ์„ ๊ฐ•์กฐํ•จ.

์œ„์˜ finding์„ DINO (self-distillation with no labels)๋ผ๋Š” simple self-supervised method๋ฅผ ํ†ตํ•ด implementํ•จ. ViT-Base ๊ตฌ์กฐ๋กœ Top-1 linear probing accuracy๋ฅผ 80.1% ๋‹ฌ์„ฑํ•˜๋ฉด์„œ DINO์™€ ViT ๊ฐ„ ์‹œ๋„ˆ์ง€๋ฅผ ๋ณด์—ฌ์คŒ.

์ด ๋…ผ๋ฌธ์ด ๊ฐ€์ง€๋Š” contribution ์ค‘ ๊ฐ€์žฅ ๋ฉ”์ธ์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  •  

Introduction

์ตœ๊ทผ Transformer๋Š” ์ด๋ฏธ์ง€ ์ธ์‹ ๋ถ„์•ผ์—์„œ convnet (convolutional neural network)์˜ ๋Œ€์•ˆ์œผ๋กœ ๋– ์˜ค๋ฅด๊ณ  ์žˆ๋‹ค. NLP์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ ํ•™์Šตํ•œ ๋’ค ๋ชฉํ‘œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•˜์—ฌ finetuningํ•˜๋Š” ํ•™์Šต ์ „๋žต์„ ์‚ฌ์šฉํ•œ๋‹ค. ViT๋Š” ๊ฒฝ์Ÿ๋ ฅ ์žˆ์ง€๋งŒ convnet์„ ๋›ฐ์–ด๋„˜๋Š” ์ด์ ์ด ์•„์ง ์—†๋‹ค. ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋” ์š”๊ตฌ๋˜๋ฉฐ ๋” ๋งŽ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ณผ ํ•„์š”ํ•˜๊ณ  feature๋“ค์ด ๊ณ ์œ ํ•œ ์†์„ฑ์„ ๋‚˜ํƒ€๋‚ด์ง€ ์•Š๋Š”๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” vision ๋ถ„์•ผ์—์„œ Transformer์˜ ์กฐ์šฉํ•œ ์„ฑ๊ณต์ด ์‚ฌ์ „ ํ•™์Šต์—์„œ supervision์„ ์‚ฌ์šฉํ•˜์—ฌ ์„ค๋ช…๋  ์ˆ˜ ์žˆ๋Š”์ง€ ์งˆ๋ฌธํ•œ๋‹ค. ์ €์ž๋“ค์˜ ๋™๊ธฐ๋Š” NLP์—์„œ Transformer์˜ ์„ฑ๊ณต์„ ์œ„ํ•œ ์ฃผ์š” ์š”์†Œ ์ค‘ ํ•˜๋‚˜๊ฐ€ BERT์˜ close procedure ๋˜๋Š” GPT์˜ ์–ธ์–ด ๋ชจ๋ธ๋ง ํ˜•ํƒœ๋กœ self-supervised pretraining์„ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ self-supervised pretraining์˜ ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋ฌธ์žฅ์˜ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ๋‹น ๋‹จ์ผ ๋ ˆ์ด๋ธ”์„ ์˜ˆ์ธกํ•˜๋Š” ์ง€๋„ ํ•™์Šต์˜ ๋ชฉ์  ํ•จ์ˆ˜๋ณด๋‹ค ๋” ํ’๋ถ€ํ•œ ํ•™์Šต ์‹ ํ˜ธ๋ฅผ ์ œ๊ณตํ•˜๋Š” pretext task๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด๋ฏธ์ง€์—์„œ supervision์€ ์ข…์ข… ์ด๋ฏธ์ง€์— ํฌํ•จ๋œ ํ’๋ถ€ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์ˆ˜์ฒœ ๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ ์ง‘ํ•ฉ์—์„œ ์„ ํƒ๋œ ๋‹จ์ผ ๊ฐœ๋…์œผ๋กœ ์ถ•์†Œํ•œ๋‹ค.

NLP์—์„œ ์‚ฌ์šฉํ•˜๋Š” self-supervised pretext task๊ฐ€ ํ…์ŠคํŠธ์— ํ•œ์ •๋˜์ง€๋งŒ ๋‹ค์–‘ํ•œ self-supervised ๋ฐฉ๋ฒ•๋“ค์€ ์ด๋ฏธ์ง€์—์„œ ์ž ์žฌ๋ ฅ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋ฅผ ๊ณต์œ ํ•˜์ง€๋งŒ ์ž๋ช…ํ•ด(collapse)๋ฅผ ํ”ผํ•˜๊ฑฐ๋‚˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ๋‹ค๋ฅธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ViT feature๋“ค์—์„œ self-supervised pretraining์˜ ์˜ํ–ฅ์„ ์—ฐ๊ตฌํ•˜์˜€๋‹ค.

ํŠนํžˆ ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ์ €์ž๋“ค์€ supervised ViT๋‚˜ convnet์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ๋ช‡ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ์†์„ฑ์„ ํ™•์ธํ–ˆ๋‹ค.

  1. Self-supervised ViT feature๋“ค์€ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์žฅ๋ฉด ๋ ˆ์ด์•„์›ƒ, ํŠนํžˆ ๊ฐ์ฒด ๊ฒฝ๊ณ„๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํฌํ•จํ•œ๋‹ค. ์ด ์ •๋ณด๋Š” ๋งˆ์ง€๋ง‰ ๋ธ”๋ก์˜ Self-attention ๋ชจ๋“ˆ์—์„œ ์ง์ ‘ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.
  2. Self-supervised ViT feature๋“ค์€ finetuning, linear classifier, data augmentation ์—†์ด ๊ธฐ๋ณธ k-NN์—์„œ ํŠนํžˆ ์ž˜ ์ˆ˜ํ–‰๋˜์–ด ImageNet์—์„œ 78.3%์˜ top-1 accuracy๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

Self-supervised ๋ฐฉ๋ฒ•์—์„œ๋Š” segmentation mask์˜ ์ถœํ˜„์ด ๊ณตํ†ต ์†์„ฑ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ k-NN์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์€ momentum encoder์™€ multi-crop augmentation๊ณผ ๊ฐ™์€ ํŠน์ • ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•  ๋•Œ๋งŒ ๋‚˜ํƒ€๋‚œ๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋ฐœ๊ฒฌ ์ค‘ ํ•˜๋‚˜๋Š” ViT์™€ ๊ฒฐ๊ณผ feature์˜ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์€ ํŒจ์น˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋Ÿฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ์ค‘์š”์„ฑ์— ๋Œ€ํ•œ ๋ฐœ๊ฒฌ์€ ๋ผ๋ฒจ์ด ์—†๋Š” knowledge distillation (with no label)์˜ ํ•œ ํ˜•ํƒœ๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ๋Š” ๊ฐ„๋‹จํ•œ self-supervised ๋ฐฉ๋ฒ•์„ ์„ค๊ณ„ํ•˜๋„๋ก ์ด๋Œ์—ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ ์„ค๊ณ„๋œ ํ”„๋ ˆ์ž„์›Œํฌ์ธ DINO๋Š” ํ‘œ์ค€ cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ momentum encoder๋กœ ๊ตฌ์ถ•๋œ teacher network์˜ ์ถœ๋ ฅ์„ ์ง์ ‘ ์˜ˆ์ธกํ•จ์œผ๋กœ์จ self-supervised training์„ ๋‹จ์ˆœํ™”ํ•œ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„, ์ด ๋ฐฉ๋ฒ•์€ collapse๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด teacher output์˜ centering๊ณผ sharpening๋งŒ์œผ๋กœ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋‹ค. ํŠนํžˆ ์ค‘์š”ํ•œ ์ ์€ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์œ ์—ฐํ•˜๊ณ  ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ ๋‚ด๋ถ€ ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•  ํ•„์š” ์—†์ด convnet๊ณผ ViT ๋ชจ๋‘์—์„œ ์ž‘๋™ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ž‘์€ ํŒจ์น˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ViT-Base๋กœ 80.1%์˜ top-1 accuracy๋กœ ImageNet linear classification ๋ฒค์น˜๋งˆํฌ์—์„œ ์ด์ „์˜ self-supervised feature๋“ค์„ ๋Šฅ๊ฐ€ํ•จ์œผ๋กœ์จ DINO์™€ ViT๊ฐ„์˜ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ์ถ”๊ฐ€๋กœ ๊ฒ€์ฆํ•œ๋‹ค. ๋˜ํ•œ DINO๋ฅผ state-of-the-art ResNet-50 ์•„ํ‚คํ…์ฒ˜์— ์‚ฌ์šฉํ•˜์—ฌ DINO๊ฐ€ convnet๊ณผ ํ•จ๊ป˜ ์ž‘๋™ํ•จ์„ ํ™•์ธํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ณ„์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์ด ์ œํ•œ๋œ ๊ฒฝ์šฐ ViT์™€ ํ•จ๊ป˜ DINO๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ํŠนํžˆ, ViT๋กœ DINO๋ฅผ ๊ต์œกํ•˜๋Š” ๋ฐ 3์ผ ๋™์•ˆ ๋‹จ 2๊ฐœ์˜ 8-GPU ์„œ๋ฒ„๋งŒ ์žˆ์œผ๋ฉด ImageNet linear classification ๋ฒค์น˜๋งˆํฌ์—์„œ 76.1%๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

Approach

1. SSL with Knowledge Distillation

DINO๋Š” ์ตœ๊ทผ self-supervised ๋ฐฉ๋ฒ•๋“ค๊ณผ ์ „์ฒด ๊ตฌ์กฐ๊ฐ€ ๊ฐ™์ง€๋งŒ knowledge distillation๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ knowledge distillation ๊ด€์ ์—์„œ DINO๋ฅผ ์ œ์‹œํ•œ๋‹ค. DINO๋Š” ๋‹ค์Œ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.


DINO์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ pseudo-code๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.


Knowledge distillation๋Š” student network ๋ฅผ ํ•™์Šตํ•  ๋•Œ ์ถœ๋ ฅ์„ ์ฃผ์–ด์ง„ teacher network ์™€ ์ผ์น˜์‹œํ‚ค๋Š” ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์ด๋‹ค. ์ด๋ฏธ์ง€ ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‘ ์‹ ๊ฒฝ๋ง ๋ชจ๋‘ ์ฐจ์›์˜ ํ™•๋ฅ  ๋ถ„ํฌ ์™€ Pt๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. ํ™•๋ฅ  ๋ถ„ํฌ ๋Š” ๋„คํŠธ์›Œํฌ ์˜ ์ถœ๋ ฅ์„ softmax function์œผ๋กœ ์ •๊ทœํ™”ํ•˜์—ฌ ์–ป๋Š”๋‹ค.

์™€ ๋Š” temperature parameter์ด๋ฉฐ ์ถœ๋ ฅ ๋ถ„ํฌ์˜ ๋พฐ์กฑํ•œ ์ •๋„๋ฅผ ์กฐ์ ˆํ•œ๋‹ค. ๊ณ ์ •๋œ teacher network gθt๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ student network์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์— ๋Œ€ํ•œ cross-entropy loss๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ๋‘ ์‹ ๊ฒฝ๋ง์˜ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์ผ์น˜์‹œํ‚จ๋‹ค.

์—ฌ๊ธฐ์„œ ์ด๋‹ค.

๋‹ค์Œ์€ ์–ด๋–ป๊ฒŒ ์ €์ž๋“ค์ด ์œ„ ๋ฐฉ๋ฒ•์„ self-supervised learning์— ์ ์šฉํ•˜์˜€๋Š” ์ง€์— ๋Œ€ํ•œ ๋””ํ…Œ์ผํ•œ ๋ถ€๋ถ„์ด๋‹ค. ๋จผ์ €, ์ €์ž๋“ค์€ multi-crop strategy๋กœ ์ด๋ฏธ์ง€์˜ ๋‹ค์–‘ํ•œ distorted view๋‚˜ crop์„ ๊ตฌ์„ฑํ•˜์˜€๋‹ค. ๋ณด๋‹ค ์ •ํ™•ํ•˜๊ฒŒ๋Š” ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ view์˜ set V๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด set์€ ๋‘ global view x1g์™€ x2g์™€ ์ž‘์€ ํ•ด์ƒ๋„์—์„œ์˜ ๋ช‡๋ช‡ local view๋ฅผ ํฌํ•จํ•œ๋‹ค. ๋ชจ๋“  crop ์ด๋ฏธ์ง€๊ฐ€ student๋ฅผ ํ†ต๊ณผํ•  ์ˆ˜ ์žˆ์ง€๋งŒ teacher๋Š” global view๋งŒ ํ†ต๊ณผํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด “local-to-global” ๋Œ€์‘์„ ์žฅ๋ คํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹ค์Œ loss๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค.

์ด loss๋Š” ๋ฒ”์šฉ์ ์ด๊ณ  2๊ฐœ์˜ view๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ view์— ๋Œ€ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ €์ž๋“ค์€ multi-crop ํ‘œ์ค€ ์„ค์ •๋Œ€๋กœ ์›๋ณธ ์ด๋ฏธ์ง€์˜ ๋„“์€ ์˜์—ญ์„ ํฌํ•จํ•˜๋Š”  ํ•ด์ƒ๋„์˜ global view 2๊ฐœ์™€ ์›๋ณธ ์ด๋ฏธ์ง€์˜ ์ข์€ ์˜์—ญ์„ ํฌํ•จํ•˜๋Š”  ํ•ด์ƒ๋„์˜ local view ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋ฅผ DINO์˜ ๊ธฐ๋ณธ ์„ค์ •์œผ๋กœ ํ•œ๋‹ค.

๋‘ ์‹ ๊ฒฝ๋ง ๋ชจ๋‘ ๊ฐ™์€ ์•„ํ‚คํ…์ฒ˜ ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ ์™€ θt๋ฅผ ๊ฐ€์ง„๋‹ค. θs๋Š” ์œ„ loss ์‹์„ SGD๋กœ ์ตœ์†Œํ™”ํ•˜์—ฌ ํ•™์Šต์‹œํ‚จ๋‹ค.

Teacher network

Knowledge distillation๊ณผ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์ „ ์ง€์‹์œผ๋กœ teacher network gθt๋ฅผ ๊ฐ€์ง€์ง€ ์•Š์œผ๋ฏ€๋กœ, teacher network๋ฅผ student network์˜ ์ด์ „ iteration์œผ๋กœ ๊ตฌ์ถ•ํ•œ๋‹ค. ์ €์ž๋“ค์€ ์—ฌ๋Ÿฌ ์—…๋ฐ์ดํŠธ ๊ทœ์น™์„ teacher์— ์‹คํ—˜ํ•˜์˜€์œผ๋ฉฐ, teacher network๋ฅผ epoch ๋™์•ˆ freezeํ•˜๋Š” ๊ฒƒ์ด ์ƒ๋‹นํžˆ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๊ณ  student์˜ ๊ฐ€์ค‘์น˜๋ฅผ teacher๋กœ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ์€ ์ˆ˜๋ ดํ•˜์ง€ ๋ชปํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. Student์˜ ๊ฐ€์ค‘์น˜์— exponential moving average (EMA)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” momentum encoder๊ฐ€ ํŠนํžˆ ํ”„๋ ˆ์ž„์›Œํฌ์— ์ž˜ ๋งž์•˜๋‹ค๊ณ  ํ•œ๋‹ค.

์—…๋ฐ์ดํŠธ ๊ทœ์น™์€

์ด๋ฉฐ, ํ•™์Šต ์ค‘์— ๋Š” 0.996์—์„œ 1๋กœ ์ฆ๊ฐ€ํ•˜๋Š” cosine schedule์„ ๋”ฐ๋ฅธ๋‹ค. ์›๋ž˜ momentum encoder๊ฐ€ contrastive learning์˜ ํ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ์ง€๋งŒ, DINO์—๋Š” ํ๋‚˜ contrastive loss๊ฐ€ ์—†์œผ๋ฏ€๋กœ ์—ญํ• ์ด ๋‹ค๋ฅด๊ณ  self-training์— ์‚ฌ์šฉํ•˜๋Š” mean teacher์˜ ์—ญํ• ์„ ํ•œ๋‹ค. ํ•™์Šต ์ค‘์—๋Š” teacher๊ฐ€ student๋ณด๋‹ค ๋” ์„ฑ๋Šฅ์ด ์ข‹์œผ๋ฉฐ, teacher๊ฐ€ target feature๋“ค์„ ๊ณ ํ’ˆ์งˆ๋กœ ์ œ๊ณตํ•˜์—ฌ student์˜ ํ•™์Šต์„ guideํ•œ๋‹ค.

Network architecture

์‹ ๊ฒฝ๋ง g๋Š” ViT๋‚˜ ResNet backbone f์™€ projection head h๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. (g=hโˆ˜f) Projection head๋Š” layer 3๊ฐœ์˜ MLP, l2 ์ •๊ทœํ™”, ๊ฐ€์ค‘์น˜๊ฐ€ ์ •๊ทœํ™”๋œ FC layer๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ €์ž๋“ค์€ ๋‹ค๋ฅธ projection head๋“ค๋„ ์‹คํ—˜ํ•ด๋ณด์•˜์ง€๋งŒ ์•ž์„œ ์„ค๋ช…ํ•œ ๋””์ž์ธ์ด DINO์— ๊ฐ€์žฅ ์ ํ•ฉํ•˜์˜€๋‹ค. ํŠนํžˆ ํฅ๋ฏธ๋กœ์šด ์ ์€ ํ‘œ์ค€ convnet๊ณผ ๋‹ฌ๋ฆฌ ViT ์•„ํ‚คํ…์ฒ˜๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ batch ์ •๊ทœํ™”(BN)๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ViT์— DINO๋ฅผ ์ ์šฉํ•  ๋•Œ๋Š” projection head์—์„œ BN์„ ์ œ๊ฑฐํ•˜์—ฌ ์ „์ฒด ์‹œ์Šคํ…œ์— BN์ด ์—†๋„๋ก ํ•˜์˜€๋‹ค.

Avoiding collapse

์—ฌ๋Ÿฌ self-supervised ๋ฐฉ๋ฒ•์ด contrastive loss, clustering constraints, predictor, BN ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ collapse๋ฅผ ํ”ผํ•˜๋ ค๊ณ  ํ•œ๋‹ค. DINO๋Š” ์—ฌ๋Ÿฌ ์ •๊ทœํ™”๋กœ ์•ˆ์ •ํ™”๋  ์ˆ˜ ์žˆ์ง€๋งŒ collapse๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด momentum teacher output์˜ centering ๋ฐ sharpening๋งŒ์œผ๋กœ ์ž‘๋™ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

Centering์€ ํ•œ ์ฐจ์›์ด ์ง€๋ฐฐํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜์ง€๋งŒ uniform ๋ถ„ํฌ๋กœ์˜ collapse๋ฅผ ์กฐ์žฅํ•˜๋Š” ๋ฐ˜๋ฉด, sharpening์€ ๋ฐ˜๋Œ€ ํšจ๊ณผ๋ฅผ ๋‚ธ๋‹ค. ๋‘ ์—ฐ์‚ฐ์„ ๋ชจ๋‘ ์ ์šฉํ•˜์—ฌ collpase๋ฅผ ํ”ผํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜๋„๋ก ๊ฐ ํšจ๊ณผ์˜ ๊ท ํ˜•์„ ๋งž์ถ˜๋‹ค. Collapse๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•˜์—ฌ centering์„ ์‚ฌ์šฉํ•˜๋ฉด batch์— ๋Œ€ํ•œ ์˜์กด๋„๋ฅผ ๋‚ฎ์ถ”๊ธฐ ์œ„ํ•ด ์•ˆ์ •์„ฑ์„ ๋‚ฎ์•„์ง„๋‹ค. ์ด๋Š” centering ์—ฐ์‚ฐ์ด 1์ฐจ batch ํ†ต๊ณ„์—๋งŒ ์˜์กดํ•˜๋ฉฐ teacher์— bias ํ•ญ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ค‘์‹ฌ c๋Š” EMA๋กœ ์—…๋ฐ์ดํŠธ๋˜๋ฉฐ batch size๊ฐ€ ๋‹ค๋ฅด๋”๋ผ๋„ ์ž˜ ์ ์šฉ๋œ๋‹ค.

์—ฌ๊ธฐ์„œ ์€ rate parameter์ด๊ณ  ๋Š” batch size์ด๋‹ค. Sharpening์€ teacher softmax normalization์˜ ๋ฅผ ๋‚ฎ์€ ๊ฐ’์œผ๋กœ ๋‘๋Š” ๊ฒƒ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋‹ค.

2. Implementation and evaluation protocols

Vision Transformer

์ €์ž๋“ค์€ DeiT์˜ implementation์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๋“ค์˜ ์„ค์ •์€ ๋‹ค์Œ ํ‘œ์™€ ๊ฐ™๋‹ค.


ViT ์•„ํ‚คํ…์ฒ˜๋Š” ๊ฒน์น˜์น˜ ์•Š๋Š” ์—ฐ์†์ ์ธ ์˜ ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ  (“/16”)์ด๋‚˜  (“/8”)์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด ํŒจ์น˜๋“ค์€ linear layer์„ ํ†ต๊ณผํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ์ด ๋œ๋‹ค.

์ €์ž๋“ค์€ ์ถ”๊ฐ€ ํ•™์Šต๊ฐ€๋Šฅํ•œ ํ† ํฐ์„ ์‹œํ€€์Šค์— ๋„ฃ์–ด ์ „์ฒด ์‹œํ€€์Šค์˜ ์ •๋ณด๋ฅผ ์ง‘๊ณ„ํ•˜๋„๋ก ํ•˜์˜€์œผ๋ฉฐ, ์ถœ๋ ฅ์— projection head h๋ฅผ ์—ฐ๊ฒฐํ•˜์˜€๋‹ค. ์ด ํ† ํฐ์€ ์–ด๋– ํ•œ ๋ ˆ์ด๋ธ”์ด๋‚˜ supervision์— ์—ฐ๊ฒฐ๋˜์ง€๋Š” ์•Š์ง€๋งŒ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค๊ณผ์˜ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด ํด๋ž˜์Šค ํ† ํฐ [CLS]์ด๋ผ ๋ถ€๋ฅธ๋‹ค. ํŒจ์น˜ ํ† ํฐ๊ณผ [CLS] ํ† ํฐ์€ pre-norm layer normalization์„ ๊ฐ€์ง„ ํ‘œ์ค€ Transformer network์— ์ž…๋ ฅ๋œ๋‹ค.

Transformer๋Š” self-attention๊ณผ feed-forward layer์˜ ์‹œํ€€์Šค์ด๋ฉฐ skip connection์œผ๋กœ ๋ณ‘๋ ฌํ™”๋œ๋‹ค. Self-attention layer๋Š” attention mechanism์œผ๋กœ ๋‹ค๋ฅธ ํ† ํฐ ํ‘œํ˜„์„ ๋ณด๊ณ  ๊ฐ ํ† ํฐ ํ‘œํ˜„๋“ค์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

Implementation details

  • ๋ฐ์ดํ„ฐ์…‹: ImageNet ๋ฐ์ดํ„ฐ์…‹์— ๋ ˆ์ด๋ธ” ์—†์ด ์‚ฌ์ „ ํ•™์Šต
  • batch size 1024, adamw optimizer, 16 GPUs
  • learning rate๋Š” ์ฒ˜์Œ 10 epoch๋งŒ 0.005×batchsize/256๊นŒ์ง€ warmup ํ›„ cosine schedule๋กœ decay
  • weight decay: cosine schedule๋กœ 0.04์—์„œ 0.4
  • τs=0.1τt๋Š” 0.04์—์„œ 0.07๋กœ ์ดˆ๋ฐ˜ 30 epoch๋™์•ˆ linear-warmup
  • BYOL์˜ data augmentation (color jittering, Gaussian blur and solarization)๊ณผ multi-crop์„ ์‚ฌ์šฉ

Evaluation protocols

Self-supervised learning์„ ํ‰๊ฐ€ํ•˜๋Š” ํ‘œ์ค€ ํ”„๋กœํ† ์ฝœ์€ ๊ณ ์ •๋œ feature๋“ค์„ linear classifer๋กœ ํ•™์Šต์‹œํ‚ค๊ฑฐ๋‚˜ feature์„ downstream task์—์„œ finetuneํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Linear evaluation์„ ์œ„ํ•ด์„œ random resize crop๊ณผ horizontal flips augmentation์„ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ณ  central crop์— ๋Œ€ํ•œ accuracy๋ฅผ ์ธก์ •ํ•˜์˜€๋‹ค. Finetuning evaluation์„ ์œ„ํ•ด์„œ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋กœ ์‹ ๊ฒฝ๋ง์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์ ์‘์‹œ์ผฐ๋‹ค.

ํ•œํŽธ, ๋‘ evaluation ๋ชจ๋‘ hyperparameter์— ๋ฏผ๊ฐํ•˜๋ฏ€๋กœ learning rate๋ฅผ ๋ฐ”๊พธ๋ฉด ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ์ •ํ™•๋„๊ฐ€ ํฌ๊ฒŒ ๋ณ€ํ•˜๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ €์ž๋“ค์€ feature๋“ค์˜ ํ’ˆ์งˆ์„ ๊ฐ„๋‹จํ•œ ๊ฐ€์ค‘์น˜ k-NN classifer๋กœ ์ธก์ •ํ•˜์˜€๋‹ค. ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ณ ์ •์‹œํ‚ค๊ณ  feature๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค ์ €์žฅํ•œ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ k-NN classifer๋Š” ์ด๋ฏธ์ง€์˜ feature์„ ๋ ˆ์ด๋ธ”์— ํˆฌํ‘œํ•˜๋Š” k๊ฐœ์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ €์žฅ๋œ feature๊ณผ ์ผ์น˜์‹œํ‚จ๋‹ค.

์ €์ž๋“ค์€ ๋‹ค์–‘ํ•œ k์— ๋Œ€ํ•˜์—ฌ ์‹คํ—˜์„ ํ•œ ๊ฒฐ๊ณผ 20์œผ๋กœ ๋‘๋Š” ๊ฒƒ์ด ์ „์ฒด์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ์ œ์ผ ์ข‹์•˜๋‹ค๊ณ  ํ•œ๋‹ค. ์ด evaluation ๋ฐฉ๋ฒ•์€ ์ถ”๊ฐ€ hyperparameter tuning์ด๋‚˜ data augmentation์ด ํ•„์š” ์—†์œผ๋ฉฐ ํ•˜์œ„ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•˜์—ฌ 1๋ฒˆ๋งŒ ์‹คํ–‰ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— feautre evaluation์„ ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

Main Results

1. Comparing with SSL frameworks on ImageNet

๋‹ค์Œ์€ ๋‹ค์–‘ํ•œ self-supervised ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ImageNet์—์„œ์˜ linear๊ณผ k-NN evaluation์˜ top-1 accuracy์ด๋‹ค. Throughput (im/s)์€ NVIDIA V100 GPU์—์„œ ํ•œ ๋ฒˆ์— 128๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์ถœ๋ ฅํ•  ๋•Œ ์ธก์ •ํ•œ ๊ฐ’์ด๋‹ค.

2. Properties of ViT trained with SSL

Nearest neighbor retrieval with DINO ViT

๋‹ค์Œ์€ ์ด๋ฏธ์ง€ retrieval์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ๋น„๊ต์ด๋‹ค. Supervision(Sup.)์ด๋‚˜ DINO๋กœ ImageNet์ด๋‚˜ Google Landmarks v2(GLDv2) ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‚ฌ์ „ ํ•™์Šต๋œ feature์˜ retrieval ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ๋‹ค.


๋‹ค์Œ์€ copy detection task์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ๋น„๊ต์ด๋‹ค. INRIA Copydays ๋ฐ์ดํ„ฐ์…‹์˜ “๊ฐ•ํ•œ” ๋ถ€๋ถ„์ง‘ํ•ฉ์— ๋Œ€ํ•˜์—ฌ mean average precision (mAP)๋ฅผ ์ธก์ •ํ•˜์˜€๋‹ค. Copy detection task์€ blur, insertions, print and scan ๋“ฑ์œผ๋กœ ์™œ๊ณก๋œ ์ด๋ฏธ์ง€๋ฅผ ์ธ์‹ํ•˜๋Š” task์ด๋‹ค.

Discovering the semantic layout of scenes

์•„๋ž˜ ํ‘œ๋Š” DAVIS-2017 video instance segmentation ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ์ถœ๋ ฅ ํŒจ์น˜ ํ† ํฐ์„ ํ‰๊ฐ€ํ•œ ๊ฒƒ์ด๋‹ค. Jm์€ mean region similarity์ด๊ณ  Fm์€ mean contour-based accuracy์ด๋‹ค. ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋Š” 480p์ด๋‹ค.


๋‹ค์Œ์€ ์„œ๋กœ ๋‹ค๋ฅธ head๋“ค์ด ๋‹ค๋ฅธ semantic region์— ์ฐธ์—ฌํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ฆผ์ด๋‹ค.


๋‹ค์Œ์€ supervised์™€ DINO์˜ segmentation ๊ฒฐ๊ณผ์ด๋‹ค. Self-attention map์— ์ž„๊ณ„๊ฐ’์„ ์ฃผ์–ด mask๋ฅผ ์–ป์–ด ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์ด๋‹ค.

Transfer learning on downstream tasks

๋‹ค์Œ์€ ์—ฌ๋Ÿฌ ํ•˜์œ„ task์—์„œ DINO๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ feature์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•œ ํ‘œ์ด๋‹ค.

Ablation Study of DINO

1. Importance of the Different Components

๋‹ค์Œ์€ ์š”์†Œ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์ œ๊ฑฐํ•œ ๋ณ€ํ˜• ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค.


๋‹ค์Œ์€ ๋‹ค์–‘ํ•œ ํŒจ์น˜ ํฌ๊ธฐ์—์„œ ViT-S ๋ชจ๋ธ์˜ k-NN classification ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค.

2. Impact of the choice of Teacher Network

๋‹ค์Œ์€ ImageNet์— ๋Œ€ํ•œ k-NN classifer๋กœ ์ธก์ •ํ•œ top-1 accuracy์ด๋‹ค.


์™ผ์ชฝ์€ ํ•™์Šต ์ค‘์˜ momentum teacher๊ณผ student์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค. ์˜ค๋ฅธ์ชฝ์€ ๋‹ค์–‘ํ•œ teacher network์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ๋น„๊ต์ด๋‹ค.

3. Avoiding collapse

๋‹ค์Œ์€ centering๊ณผ sharpening์ด collapse๋ฅผ ํ”ผํ•˜๋Š” ๋ฐ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ญํ• ์„ ์—ฐ๊ตฌํ•œ ๊ฒƒ์ด๋‹ค.

Collapse์—๋Š” ๋‘ ๊ฐ€์ง€ ํ˜•ํƒœ๊ฐ€ ์žˆ๋‹ค. ํ•˜๋‚˜๋Š” ์ž…๋ ฅ์„ ๋ฌด์‹œํ•˜๊ณ  ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์ด ๋ชจ๋“  ์ฐจ์›์—์„œ ๊ท ์ผํ•œ ๊ฒƒ์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ํ•œ ์ฐจ์›์ด ์ง€๋ฐฐ์ ์ธ ๊ฒƒ์ด๋‹ค. Centering์€ ํ•œ ์ฐจ์›์ด ์ง€๋ฐฐ์ ์ธ collapse๋ฅผ ํ”ผํ•˜์ง€๋งŒ ๊ท ์ผํ•œ ์ถœ๋ ฅ์„ ์œ ๋„ํ•˜๋ฉฐ, sharpening์€ ๋ฐ˜๋Œ€ ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์€ cross-entropy H๋ฅผ entropy h์™€ KL divergence DKL๋กœ ๋‚˜๋ˆ„์–ด ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋‹ค.

H(Pt,Ps)=h(Pt)+DKL(Pt|Ps)

KL divergence๊ฐ€ 0์ด๋ผ๋Š” ๊ฒƒ์€ ์ถœ๋ ฅ์ด ์ƒ์ˆ˜๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ ์ด๋Š” collapse์ด๋‹ค. ๋‹ค์Œ ํ‘œ๋Š” centering๊ณผ sharpening์„ ํ•™์Šต์— ์‚ฌ์šฉํ•  ๋•Œ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๋•Œ์˜ entropy์™€ KL divergence๋ฅผ ์ธก์ •ํ•œ ๊ฒƒ์ด๋‹ค.

4. Compute requirements

๋‹ค์Œ ํ‘œ๋Š” 2๊ฐœ์˜ 8-GPU machine์—์„œ ViT-S/16 DINO์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ด ์‹œ๊ฐ„๊ณผ GPU๋‹น ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.


Multi-crop์ด ์ •ํ™•๋„์™€ ์‹คํ–‰์‹œ๊ฐ„ tradeoff๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

5. Training with small batches

๋‹ค์Œ์€ batch size์— ๋Œ€ํ•œ ํšจ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ธ ํ‘œ์ด๋‹ค.

๋ฐ˜์‘ํ˜•