[Paper] Multi-Concept Customization of Text-to-Image Diffusion

728x90

요약

Text-to-Image Diffusion 모델에 Customize를 해주겠다

Multi-Concept Customization of Text-to-Image Diffusion Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, Jun-Yan Zhu

이 논문 좀 특이한게 초장부터 사진질임

Figure 1. Given a few images of a new concept, our method augments a pre-trained text-to-image diffusion model, enabling new generations of the concept in unseen contexts

새로운 사진 몇장으로 기존 Pre-trained text-to-image diffusion model 이 기존에 없던 context를 만들어낸 것이랍니다

Furthermore, we propose a method for composing multiple new concepts together, for example, V ∗ dog wearing sunglasses in front of a moongate. We denote personal categories with a new modifier token V ∗

두개의 개념을 포함할 수 있는 방법을 제안하려고 하고 위에 예시 같이 V* (modifier token)을 이용해서 개인 카테고리까지 사용할 수 있도록 하는 방법을 제시한다고 한다

Abstract

~, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items).
Can we teach a model to quickly acquire a new concept, given a few examples?
Furthermore, can we compose multiple new concepts together?

두 가지를 해결하고자 한 것 같다

Quickly teach Thier "Own Concept" (New Concept , e.g 자기들 강아지)
- 적은 parameter optimize 하고 6초 밖에 안 걸린다고 한다
Jointly train for multiple concepts or combine multiple fine-tuned models
- 한가지가 아니라 두가지를 한번에 만들고 싶은 것이다
  - 달리는 강아지를 그려줘 (one concept) -> 빨강색으로 물들고 있는 바다에서 달리는 강아지를 그려줘
  - "A dog running in a sea tinged with red"

이 두번째 것 같은 경우는 우리가 stable Diffusion demo를 통해 해볼 수 있는데

https://stablediffusionweb.com/#demo

Stable Diffusion Online

Stable Diffusion Online Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, cultivates autonomous freedom to produce incredible imagery, empowers billions of people to create stunnin

stablediffusionweb.com

이렇게 두가지 Concept이 겹치는 경우 바다와 개 둘중에 어떤 걸 red로 그릴지 헷갈려 한다

이걸 해결하고자 도전한 논문이라 볼 수 있다

Information

기존 text to image model

By simply querying a text prompt, users are able to generate images of unprecedented quality

단순하게 text prompt를 querying하므로써 전례에 없던 품질의 이미지를 생성해냈다

여기서 말하는 Querying a text prompt 의 뜻은

모델에 텍스트 설명 또는 프롬프트를 제공한 다음 다음 이미지 기반으로 이미지를 생성하는 것을 의미한다

여기서 Prompt란 LLM으로부터 응답을 생성하기 위한 입력값을 의미한다

<한계점>

users often wish to synthesize specific concepts from their own personal lives
...
As these concepts are by nature personal, they are unseen during large-scale model training

사용자가 그들의 삶의 개인적인 경험에서 나오는 특별한 개념을 추가하고자한다

하지만 이런 데이터는 우리가 사전학습을 할 때 검색이 안되기 때문에 학습이 되지 않은 개념이다

Model Customization

This poses a few challenges – first, the model tends to forget [14, 38, 55] or change [37, 43] the meanings of existing concepts: e.g., the meaning of “moon” being lost when adding the “moongate” concept.
Secondly, the model is prone to overfit the few training samples and reduce sampling variations.
...
compositional fine-tuning :the ability to extend beyond tuning for a single, individual concept and compose multiple concepts together

<문제점>

우리가 위에서 언급한 "Synthesize specific concept"를 추가하려면 문제점들이 있다

기존에 가지고 있었던 개념이 변질되거나 사라진다
- 위에서 moongate라는 걸 학습했더니 moon 자체가 gate처럼 그려지는 현상
우리가 넣은 Synthesize specific concept에 대하여 overfitting이 되는 현상 (애초에 입력하는 데이터가 적을 것이므로)
compositional fine tuning : 원래 그 자체의 개념(moon)과 새로운 개념(moongate)의 공존 가능성
- 원래 개념(individual concept)은 그 개념대로 가져가면서 이를 이용한 새로운 개념(multiple concept)을 같이 가져가는 어려움이 발생
- concept mixing 과 concept omission 이 발생

<해결방안>

To overcome the above-mentioned challenges, we identify a small subset of model weights, namely the key and value mapping from text to latent features in the cross-attention layers.
Fine-tuning these is sufficient to update the model with the new concept

위에 문제들을 해결하기 위해 text에서 Cross-attention layer에 있는 latent feature들을 key 와 value로 mapping을 시킨다

(이걸 논문에서는 small subset of model weight 라고 함)

이렇게 key와 value만 업데이트 해도 모델이 새로운 개념을 학습하는데에 충분하다고 한다

* Concept Forgetting을 막기위한 노력

To prevent model forgetting, we use a small set of real images with similar captions as the target images.

또한 학습을 할 때 Concept이 사라지는 것을 막기 위해 새로운 이미지와 원래 있던 real image 조금을 같이 사용했다고 한다

이러한 방식으로 위에서 말한 Syntheisize specific concept을 학습시킬 수 있다

이 논문은 새로운 개인적인 개념학습 뿐만 아니라 두가지의 concept을 학습시키는 multiple concept 도 설명하고 있는데

To inject multiple concepts, our method supports training on both simultaneously or training them separately and then merging

두가지 방법이 있다고 한다

- 동시에 두개의 개념을 한번에 학습시키는 경우

- 각각의 개념을 학습시킨 다음 합치는 경우

We build our method on Stable Diffusion and experiment on various datasets with as few as four training images. For adding single concepts, our method shows better text alignment and visual similarity to the target images than concurrent works and baselines. More importantly, our method can compose multiple new concepts efficiently, whereas concurrent methods struggle and often omit one. Finally, our method only requires storing a small subset of parameters (3% of the model weights) and reduces the fine-tuning time (6 minutes on 2 A100 GPUs, 2 − 4× faster compared to concurrent works).

이러한 방식으로 어떻게 진행 되었나면

model : stable Diffusion을 사용
Data : 여러개의 데이터 셋에서 4개 이상의 훈련 이미지로 실험 진행
method
- method 1 : add syntheisize specific concept (For adding single concept)
  - 이러한 방법은 타켓이미지와 text alignment 나 visual 적으로 더 유사한 결과를 얻음
- method 2 : mutiple concept
  - 다른 방법들은 하나의 개념을 omit 시키지만 얘네 방법은 여러개의 새로운 개념을 조합할 수 있다고 한다
  - (절대 안 좋다고 말 안하지)
Time complexity : 2개의 A 100 GPU에서 6초가 걸리고 기존보다 빠르다고 한다 (2-4의미가 2배~4배 인지는 모르겠음)
주장하는 장점은 모델의 weight를 전부 다 fine-tuning하는 것이 아닌 모델 weight의 3%만 fine tuning 했다는 점
- 대표적으로 Dreambooth는 모델의 weight 전부를 fine tuning 한다

<보충 설명 : Cross Attention>

Attention (query / key / value)

사용자가 찾고 싶은 값(query)이 입력된다.
딕셔너리에 저장된 모든 키(key)들과 query 간의 '유사도'를 계산한다.
유사도를 확률 형태로 변환한다.
3번에서의 값에 따라 값(value)들의 가중합(weighted sum)을 구한 것을 최종 결과로 반환한다.

Cross attention
- an attention mechanism in Transformer architecture that mixes two different embedding sequences
- the two sequences can be of different modalities (e.g. text, image, sound)
- key, value로 같은 값을 사용하지만 query는 다른 값을 사용하는 어텐션 연산(즉, query ≠ key = value)
  - Seq2Seq + 어텐션 모델에서의 어텐션 연산
  - 트랜스포머 모델의 디코더에서 사용하는 어텐션 연산
Self attention
- query, key, value로 모두 같은 값을 사용하는 어텐션 연산(즉, query = key = value)
  - 트랜스포머 모델의 인코더에서 사용하는 어텐션 연산

즉 cross attention은 두가지 다른 embedding sequences를 혼합하는 방법이다(이래서 query가 다름)

집작컨데 이 논문에서는 두가지 embedding sequence를 image 와 text 로 만들었고

text feature : c 와 latent image feature : f 가 주어 졌을때

라 할 때 K , V 의 weight만 update하는 방식을 택한거 같다

Related work

(논문과 별개로 제가 그냥 정리했습니다(내용이 조금 상이할 수도 있습니다))

Deep generative models

Generative models aim to synthesize samples from a data distribution, given a set of training examples.

생성 모델은 데이터의 분포로 부터 샘플을 합성하는 것을 목표로 한다

이에 대한 예시로는 "GANs , VAEs, autoregressive, flow-based, diffusion" 등이 있다

recent text-to-image models, trained on extremely large-scale data, have demonstrated remarkable generalization ability. However, such models are by nature generalists and struggle to generate specific instances

이전의 text-to-image model은 방대한 양의 데이터를 통해 학습이 되어 일반화 능력은 뛰어나지만

오히려 이 방대한 데이터 때문에 개인적이거나 희귀한 word에 취약하다는 것을 강조한다

Image and model editing

... a user often wishes to edit a single, specific image.
Several works aim at leveraging the capabilities of generative models, such as GANs [2–4, 52, 89] or diffusion models [11, 33, 47] towards editing.
... A closely related line of work edits a generative model directly. Whereas these methods aim to customize GANs, our focus is on text-to-image models.

이전에도 사람들이 single, specific image를 만드는 것을 원해서 시도가 없었던 것은 아니였지만

이를 수행하기 위해 생성모델, 즉 GAN을 edit하는 방법이 있긴 했는데 얘네는 text-to-image model에 focus 되어 있다고 한다

Transfer learning

A method of efficiently producing a whole distribution of images is leveraging a pretrained model and then using transfer learning

* leveraging : 활용하다

이미지들의 전체 분포를 효율적으로 생성하는 방법은 Pre-train model과 Tranger learning을 이용하는 것이다

여기서 Transfer learning이란

사전 훈련된 모델을 새로운 작업에 대한 모델의 시작점으로 재사용하는 기계 학습 방법이다.

예를 들면

'이미지 분류' 문제를 해결하는데 사용했던 네트워크를 다른 데이터셋 혹은 다른 문제(task)에 적용시켜 푸는 것을 의미한다 (추가예시 : 동물들을 분류하던 학습된 모델을 들고와서 사람 얼굴들을 추가 학습시켜서 나이를 분별하는 모델을 만드는 것)

Different from these works, which focus on tuning whole models to single domains, we wish to acquire multiple new concepts without catastrophic forgetting
... we can synthesize the new concepts in composition with these existing concepts
... we adapt a small number of existing parameters and do not require additional parameters

하지만 위에 예시인 경우 (동물들을 분류하던 학습된 모델을 들고와서 사람 얼굴들을 추가 학습시켜서 나이를 분별하는 모델을 만드는 것) 이렇게 재학습된 모델은 동물을 잘 분류하지는 못한다 (잘할 수도 있지만 직관적으로 이해하기 위한 예시입니다)

이는 기존에 있던 Concept들을 잊어 버렸기 때문에 즉, 전체 모델을 단일 도메인으로 튜닝을 진행 했기 때문에 생기는 문제이다이를 막기 위해 new concept을 기존에 있던 concept 과 "Composition"시키는 방법이다 (심지어 parameter를 많이 바꾸지도 않고 추가적인 parameter 또한 필요하지 않는다고 한다)

Adapting text-to-image models

Similar to our goals, two concurrent works, DreamBooth [62] and Textual Inversion [20]

얘네랑 유사한 시도를 한게 Dreambooth 와 Textual Inversion이라고 한다

이 세개의 차이점을 요약하자면 다음과 같다

< 차이점 >

DreamBooth

개인화가 가능한 Text-image generative model (기존의 image를 재구성하는 기술을 보이기 위해 고안)
개인화를 표현하기 위해 subject에 고유 식별자 [V]를 추가하는 방법을 사용 ex) a [V] dog
Fine-tuning all the parameters (모든 parameter를 Fine tuning)

Textual Inversion

Introducing and optimizing a word vector for the new concept (새로운 컨셉의 모델 vector를 기존의 모델에 주입하고 optimize 함)
Pre-trained된 model의 text embedding space에서 represent specific(이미지를 잘 설명하는) 새로운 임베딩을 찾고 이 임베팅을 새로운 단어(Pseydo-words : 입력된 모르는 단어)와 연결하여서 새로운 단어를 추가하는 방식
요약하자면 이미지와 그에 대한 concept이 들어 오면 이미지와 가장 유사한 단어를 찾고 그 단어 = concept 으로 연결시키는 방법으로 새로운 concept를 추가한다

Custom Diffusion

Compositional fine-tunning of multiple concepts
멀티플 컨셉에 대해서도 Compositional 하려는 노력
Only fine-tune a subset of cross-attention layer parameters which significantly reduces the fine-tunning time
Cross-attention layer에 subset 만 fine-tuning 하므로써 시간을 많이 단축

Designing an Encoder for Fast Personalization of Text-to-Image Models

보충설명 : [DreamBooth]

https://dreambooth.github.io/

DreamBooth

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation Nataniel Ruiz Yuanzhen Li Varun Jampani Yael Pritch Michael Rubinstein Kfir Aberman Google Research It’s like a photo booth, but once the subject is captured, it can be

dreambooth.github.io

Fine tuning을 모델에 대해서 전체 모델에 대한 Fine tunning을 진행

하지만 Specific 한 이전의 Prior 들의 특징을 유지할 수 있는 loss를 사용

내가 키우는 동물에 대해서 만들어 낼 수 있다

ex) [V] dog 이런식으로 구분을 줌

아래의 그림이 희망적인 케이스

하지만 한계점이 확실하였는데

아래의 3가지 문제점을 살펴보면

(a) 같은 경우는 우리가 달에 무언가가 있는 사진과 ISS(우주정거장) 에 무언가가 있는 사진은 데이터가 많이 부족하기 때문에 이런 것에서는 약한 모습을 보인다

(b) 같은 경우는 입력한 이미지 개체와 프롬프트의 내용이 꼬여서 (entanglement 현상) 개체에 대한 모양이나 색이 바껴버림

보충설명 : [Text - Inversion]

https://arxiv.org/abs/2208.01618

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and nove

arxiv.org

요 링크에서 사용가능

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Textual-Inversion

새로 가르치고자 하는 개념에 대해서 하나의 Word로 나타낼 수 있게끔 하고

그 word 의 token 들을 모델이 Optimized 해서 학습할 수 있겠끔 하는 연구

이게 참 어려워 보이는데 이해하기 쉬운 블로그를 찾긴 했는데

사진들을 첨부하기가 곤란해서 밑에 참고 사이트에 링크만 걸고 최대한 설명해보면

위에 모델의 그림을 통해 설명하면 굉장히 간단하다

먼저 "A Photo of S* "라는 문장을 설정한 다음

이를 Token으로 Convert 시킨다

그러면 다음과 같이 [508, 701, 73, *] 로 바뀐다 (실제 구현에서는 *이 338, 265 로 구현된다고 한다)

이 부분은 구현 부분에만 있습니다

이렇게 바뀐 토큰이 CLIP에 들어가기 위해 맨 앞에 시작토큰(42604)을 넣고 토큰의 길이가 77 이 될 때까지 끝 토큰(42605)를 채운다

이렇게 되면 다음과 같이 [42604, 508, 701, 73, 338, 265, 42605, 42605, ..., 42605] 가 된다

이 묶음을 CLIP Transformer 모델에 넣고 77 x 768 크기의 고정된 임베딩으로 변환

지금 보면 v* 외에는 자물쇠가 걸려져 있고 v*에는 ? 가 채워져 있는데

We optimize the embedding vector v∗ associated with our pseudo-word S∗, using a reconstruction objective

이 처럼 새로운 "embedding v"를 추가하여서 나머지는 건드리지 않고 이 ? 부분만 학습을 하는 것이다

이렇게 만들어진 77 x 768임베딩을 diffusion 모델에 넣어서 노이즈로 부터 이미지를 만들어 내고 원래 이미지와 비교하여 loss를 만듬

이렇게 만든 loss를 임베딩에 적용해서 임베딩 벡터를 변화시키는데 다른 토큰은 무시(고정)하고 오직 S*만 변화시킨다

(즉, S* 임베딩만 optimizier에 넣어서 S*만 변하도록 함)

이렇게 하면 S*으로만 훈련하는 것과 비슷한 효과를 보임

이렇기 때문에 생기는 문제가 있는데

가장 잘 설명한 예시를 찾아서 모자이크를 최대한 이용해서 한번 첨부해보겠다

만약 다음과 같은 사진을

1 girl, solo, portrait, looking at viewer, art of S* 라고 문장을 넣을때

이 모델은 S*에만 optimizer 가 적용이 되고 그렇기 때문에 훈련이 완료된 모델에서S*을 넣어보면

https://arca.live/b/hypernetworks/62511421?p=1

다음과 같은 사진이 나온다

전혀 상관 없는 심지어 1 girl 도 아닌 정보를 담은 사진이 나온 이유는

1 girl, solo, portrait, looking at viewer 들을 제외한 특징들이 S*에 학습이 되었기 때문이다

즉, "A photo of S* of" 형태의 문장을 설정하여 주어진 작은 데이터셋에서 이미지를 재구성 하는 것으로 이어지는 단일 단어 임베딩을 찾는 것을 목표로 진행되고 추가로 텍스트 인코더의 임베딩 공간에서 높은 수준의 의미론과 미세한 시각적 detail을 모두 capture할 수 있는 새로운 Pseudo-word를 찾는 것이 목표로 한다.

(https://devocean.sk.com/blog/techBoardDetail.do?page=&query=&ID=164320&boardType=writer&searchData=sam56903&subIndex=&idList=&pnwriterID=sam56903)

Method

왼쪽 Regularization Dataset을 만드는 부분 (사진 왼쪽) / Cross0-attention layer에 Subset만 fine-tuning 부분(사진 오른쪽)

다음 두 가지 방법을 사용한다고 합니다

Regularzation Dataset

- To prevent language drift

Fine-Tuning a Subset of Cross-attention

- Faseter convergence, prevent overfitting, only 5% of all parameters

Single-Concept Fine-tuning

Given a pretrained text-to-image diffusion model, we aim to embed a new concept in the model given as few as four images and the corresponding text description

pretrained text-to-image diffusion model에 4개의 이미지와 그에 대한 설명만으로 new concept을 임베딩 하는 방법이라고 한다

그와 동시에 기존의 지식(개념)또한 유지하는 방법이다

This can be challenging as the updated text-to-image mapping might easily overfit the few available images.

대충 생각해도 어려운게 모델은 엄청 좋은데 학습할 데이터는 4장이니깐 overfitting날 수도 있다

실험설계를 다음과 같이 했다고 한다

Backbone : Stable Diffusion (Latent Diffusion Model 위에 구현되어 있음)

-> Latent Diffusion 모델 중 하나가 Stable Diffsion이다

1. LDM이 VAE의 hybrid objective, Patch-GAN. LPIPS를 이용해서 이미지들을 latent representation으로 encoding한다

(이런 절차를 거치기 때문에 encoder과 decoder를 구동하면 입력 이미지를 복원 할 수 있다)

2. 그 다음 text conditon을 cross-attention을 사용하는 모델에 주입하여서 diffusion model에 latent representation을 학습한다

그 학습하는 수식은 다음과 같다

Learning Objective of Diffusion Models

Diffusion models [28, 68] are a class of generative models that aim to approximate the original data distribution q(x0) with pθ(x0):
where x1 to xT are latent variables of a forward Markov chain

Diffusion model은 generative model로써 x1부터 xT까지 Forward Markov chain인

xt = √αtx0 + √1 − αt 을 만족시키는 latent variable x1 ~ xT에 대해 pθ(x0)로 부터 실제 data 분포인 q(x0)를 근사화하는 것을 목적으로 한다

이 모델은 주로 1000의 고정 길이를 가지는 마르코 체인의 역과정을 학습한다

Given noisy image xt at timestep t, the model learns to denoise the input image to obtain xt−1

즉, 시간 t에서의 잡음 이미지 xt가 주어졌을때 xt-1를 언기 위해서 input image를 denoise하는 방식으로 학습한다

The training objective of the diffusion model can be simplified to:
where ∈θ is the model prediction and wt is a time-dependent weight on the loss

∈θ : 모델 예측 값

wt : t초일 때의 weight

c : text로 나타나게 되는 condition

t : diffusion의 Time step

xt : noise image

During inference, a random Gaussian image (or latent) xT is denoised for fixed timesteps using the modeㅣ

Inference 할 때는 random Gaussian image xT가 시간에 걸쳐서 denoise 된다

Naive baseline for the goal of fine-tuning is to update all layers to minimize the loss in 수식 (2) for the given text-image pairs

하지만 이렇게 전체 layer를 update하는 것은

Computationally inefficient하고

few images 에 대해서 전체를 Fine tunning해서 overfitting 난다고 한다

Rate of change of weights

Following Li et al. [39], we analyze the change in parameters for each layer in the finetuned model on the target dataset with the loss in Eqn. 2, i.e., ∆l = ||θ 0 l − θl ||/||θl ||, where θ 0 l and θl are the updated and pretrained model parameters of layer l.

이 개념(moon gate)을 Training 시키는 개념에 대해서 모든 layer에 대해서 Fine tuning을 진행하게 될 때

layer 가 얼마나 변하는지를 보기 위해서 다음과 같이

델타 l 을 다음과 같이 두었다

(현제 모델의 weight의 norm 중에 변화량의 norm이 얼만큼 차지하는 지)

세타 프라임은 update된 model parameter고 세타는 pretrain된 model 파라미터이다

위에 나온 3개가 다음과 같을때

cross-attention (between the text and image)

self-attention (within the image itself)

rest of parameters (conv blocks, norm layers in the diffusion model U-Net)

이렇게 했을 때 다른 layer 보다 Cross Attention layer(text와 image 사이)의 weight들의 변화가 컸었다

This suggests it plays a significant role during fine-tuning, and we leverage that in our method.

(model의 5% 밖에 차지하고 있지 않기 때문에 Cross Attention layer만 Fine tuning 하더라도 좋은 성능을 보일것이라 가정)

왜냐하면 변화는 엄청 많이 되는데 모델 parameter의 5%밖에 안되면 그만큼 성능에 크리티컬한 애라고 생각한것 같다

Model Fine-tuning

위에 언급한 대로 cross-attention block을 건드렸다

Cross-attention block modifies the latent features of the network according to the condition features, (i.e., text features in the case of text-to-image diffusion models)

condition feature에 따른 네트워크의 latent feature에 따라 수정했다고 한다 (text-to-image diffusion 같은 경우 text feature)

<Single-head Cross Attention>

text feature : c 와 latent image feature : f 가 주어 졌을때

Query projection Weight의 경우 이미지 feature에 대해서만

Key 와 Value에 대해서는 Text Feature c 에 대해서만 연산이 진행된다

이렇게 설정하고 우리가 아는 attention 수식을 사용하여

와 같이 계산을 하고

text와 연관 되는 부분은 Wk 와 Wv 이기 때문에

이 두개의 parameter만 update하는 것으로 실험

Latent image feature f and text feature c are projected into query Q, key K, and value V . Output is a weighted sum of values, weighted by the similarity between the query and key features. We highlight the updated parameters Wk and Wv in our method

위에 그림에서도 Wq는 Frozen되어 있는 것을 볼 수 있다

여기서 사용한 latent feature가 관찰 대상들을 잘 설명할 수 있는 잠재 공간 latent space에서 나온 feature의 의미로 저자들이 사용을 했다면 이 사람들은 사진을 인식하는 것 보다 사진을 잘 설명하는 것에만 focus 맞춘 거 일 수도 있다는 생각이 들었다

그에 뒷 받침되는 문장이

The task of fine-tuning aims at updating the mapping from given text to image distribution, and the text features are only input to Wk and Wv projection matrix in the cross-attention block
...
Therefore, we propose to only update Wk and Wv parameters of the diffusion model during the fine-tuning process

Fine-tuning의 목적을 text에서 이미지의 분포로의 매핑을 업데이트하는 것을 목표로 했다고 한다

As shown in our experiments, this is sufficient to update the model with a new text-image paired concept.

이렇게 update해도 충분히 새로운 text-image pair를 업데이트하는 것이 충분하다고 한다

Text Encoding

Given target concept images, we require a text caption as well. If there exists a text description, e.g., moongate, we use that as a text caption.

만약 text caption이 있는 경우 (e.g. moongate : moon + gate) 그 text description을 그대로 가져다가 씀

For personalizationrelated use-case where the target concept is a unique instance of a general category, e.g., pet dog, we introduce a new modifier token embedding, i.e., V ∗ dog.

general category에서 Unique한 instance일 경우 (e.g. pet dog : 유저의 반려동물)

new modifier token embedding을 부여 (e.g. V* dog) -> 이런 부분은 DreamBooth에 영향을 받지 않았나 생각

During training, V ∗ is initialized with a rare occurring token embedding and optimized along with cross-attention parameters. An example text caption used during training is, photo of a V∗ dog

그리고 이 V* token 에 대해 rare occuring token embedding으로 부터 initialized시키고 cross-attention parameter와 함께 optimized 된다 (위에서 보인 ? 학습과정)

Regularization Dataset

추가적으로 Regularization dataset이라는 것을 모델 학습에 대해서 구현을 하게 되는데

Fine-tuning on the target concept and text caption pair can lead to the issue of language drift

그냥 moongate 라는 단어 자체가 유저가 moon과 gate를 합쳐서 만든 합성어라 할 때

moongate에 대해서 few sample들로 fine tuning을 거치게 되면 모델이 moon 과 gate의 개념에 대해서 잃어버리게 된다

(중간 사진 w/o Reg -> without Regularization : moon이 좀 이상해짐.. gate 처럼 되어 버렸다)

To prevent this, we select a set of 200 regularization images from the LAION-400M [66] dataset with corresponding captions that have a high similarity with the target text prompt, above threshold 0.85 in CLIP [54] text encoder feature space.

이런 경우를 language drift 라고 하는데 이러한 문제를 막기 위해서

200 set의 regularization images를 같이 선택을 해서 Train 시켰다

이 200개의 set은 CLIP text encoder feature space 에서 threshold 0.85 때만 가져왔다

(LAION-400M dataset 에 대해)

왜냐하면 연구가 진행되고 있는 pretrained Diffusion 모델은 Stable Diffusion 모델을 사용하고 있고

이 Stable Diffusion 모델은 LAION-5B dataset 에 대해 학습을 했기에 LAION-400M 은 subset이라 볼 수 있다

Multiple-Concept Compositional Fine-tuning

두가지 개념을 사용

Joint training on Multiple concepts
Constrained optimization to merge concepts

Joint training on Multiple concepts

For fine-tuning with multiple concepts, we combine the training datasets for each individual concept and train them jointly with our method
To denote the target concepts, we use different modifier tokens, Vi∗ ,

- training dataset들의 각각의 개념을 combine 시킨 다음 한꺼번에 Jointly 하게 single concept method로 train

- Use different modifier tokens (V*, U*, W*...) 와 같이 서로 다른 modifier token들을 부여하면서 training 시킴

restricting the weight update to cross-attention key and value parameters leads to significantly better results for composing two concepts compared to methods like DreamBooth, which fine-tune all the weights.

DreamBooth보다 좋다는 것을 주장하고 있는데 이 비결은 DreamBooth는 Fine-tune 과정에서 모든 weight를 갱신하지만 얘네는 Cross-attention의 key와 value parameter만을 바꾸기 때문에 두 concept을 composing하는데 좋다고 합니다

Multi-concept fine-tuning results.
First row: our method has higher visual similarity with the personal cat and chair images shown in the first column while following the text condition.
Second row: DreamBooth omits the cat in 3 out of 4 images, whereas our method generates both cats and wooden pots.
Third row: our method generates the target flower in the wooden pot while maintaining the visual similarity to the target images.
Fourth row: generating the target table and chair together in a garden.
For all settings, our optimization-based approach and joint training perform better than DreamBooth, and joint training performs better than the optimization-based method.

Constrained optimization to merge concepts

As our method only updates the key and value projection matrices corresponding to the text features, we can subsequently merge them to allow generation with multiple fine-tuned concepts

- Subsequently(이후, 바로)하게 여러 concept들을 merge

이게 가능한 이유는 key와 value의 projection matrices만 update하기 때문에 두개를 붙일 수 있다는 건데

근거를 조금 더 살펴보자

다음과 같이 matrics를 세팅하고

위에 set은 모든 L개의 cross-attention layer의 key와 value matrics를 나타낸다

이거는 추가된 개념 n ∈ {1 · · · N} 에 대한 Concept이 update가 된 모델의 matrics라 할때

As our subsequent optimization applies to all layers and key-value matrices, we will omit superscripts {k, v} and layer l for notational clarity.

그 이후에 따라오는 optimization은 모든 layer와 key-value matrics에 적용되기 때문에 아래 첨자 {k, v}, l 다 뺀다고 한다(혼동방지) -> 근데 난 이게 더 혼동됨ㅜㅜㅜ

- Formulate the composition objective as the following constrained least squares problems:

다음과 같은 "constrained(제한된) least(최소) squares(제곱) problem"으로 목표를 이루고자 하였다

C는 d 차원을 가지는 text features라 하고

Creg는 임의로 sampling된 1000개미만의 caption의 text feature

c1 ~ cN : 각각 하나하나의 concept

W0는 원래 기존의 모델의 Key나 value weight (set로 묶고 생략했으므로)

W 는 update된 모델의 key나 value weight

모든 N개의 개념에서 s개의 target word로 구성하고 모든 caption(설명)은 flatten되고 concate 되어 있다

(아마 s개의 target word를 caption이라 설명한게 아닐까 생각이 된다)

새로운 feature가 들어와도 여전히 optimize 할 수 있도록 argmin 사용

재학습을 그러니깐 FIne tuning을 할 때 기존 가지고 있던 의미와 가장 유사한 놈으로 update하는 방식인 것 같다

Intuitively, the above formulation aims to update the matrices in the original model, such that the words in target captions in C are mapped consistently to the values obtained from finetuned concept matrices

직관적으로, 원래 모델의 행렬을 업데이트 해서 C의 target caption의 단어가 Finetune해서 새로 얻은 matrics값과 일관되게 매핑할 수 있도록 해준다

Creg가 non-degenerate and the solution exists 이라 가정할때

Lagrange multipliers로 이 objective를 풀 수 있다

이렇게 하면 좋은 점이

Our proposed methods lead to the coherent generation of two new concepts in a single scene, as shown in Section 4.2.

두 개의 새로운 개념이 일관되게 생성할 수 있도록 한다고 한다

(아마 두개의 개념을 담은 단어하나가 따져서 그걸로 매핑이 되서 만들 때 그 단어를 사용하는 것이 아닐까라는 생각을 한다)

이 설명과 결과를 이어서 추론을 진행해보면

representing the concept in an artistic style of watercolor paintings. Our method can also generate the mountains in the background,

" A watercolor painting of V* tortoise plushy on a mountain" 에 대해 먼저 v*으로 거북이를 구분해 낼 것이고

벨벳 같은 거북이 인형이 산을 오르고 있는 수채화 그림 이라는 Concept은

산을 오르고 있는 벨벳 같은 거북이 인형 라는 concept 과 watercolor painting 이라는 개념을 합친 것이 아닐까 생각한다

DreamBooth는 전체 parameter조정으로 인한 overfitting이 나타난거 같고 textual inversion은 치명적인 단점인 context를 유지 못한다는 것을 보여준다

Text Inversion으로 multiconcept 적용한 경우 (이것도 다음과 같이 문제가 있다고 주장)

Multi-concept composition using Textual Inversion

We observe that Textual Inversion struggles with the composition of two fine-tuned objects as shown in the above sample generations as well.

Training details

We train the models with our method for 250 steps in single-concept and 500 steps in two-concept joint training, on a batch size of 8 and learning rate 8×10^(−5).

250 step 을 single concept에 500 step을 two-concept joint training에 사용했다고 한다

target image를 0.4 ~ 1.4 만큼 늘렸다 줄였다가 도 하고 비율에 따라 "very small", "far away", "zoomed in", "close up"를 프롬포트에 추가하기도 했다고 합니다.

논문에 사용한 모든 데이터 셋은 Unsplash에서 다운 받았다고 나와 있습니다(moongate 빼고)

For selecting the rare token as the modifier token V ∗ , we count the occurrence of the total 49408 tokens in 200K captions sampled from the LAION-400M dataset. We then select the token with ∼ 5−10 occurrences, with alphabetic representation, and not a substring of another token

V* 같은 modifier token 만들기 위해서 LAION-400M에서 무작위로 20만 개의 caption에서 49408개의 토큰 발생을 세었고 그런다음 5~10번 발생한 토큰 중 알파벳으로 표시되고 다른 토큰의 하위 문자열이 아닌 것을 선택했다

Experiments

Evaluation metircs

-Image alignments : Similarity in CLIP Image Feature

-Text alignment : Text-image similarity in CLIP

-KID : vaildation set of 500 real images from LAION0-400M (similar concept retrieval) - 이미지의 퀄리티를 봤다

-Human preference

Discussion and Limitations

Limitations of our multi-concept fine-tuning.

As shown in Figure 11 in the paper, our method fails at difficult compositions like generating personal cat and dog in the same scene. We observe that the pretrained model also struggles with such compositions and hypothesize that our model inherits these limitations. Here, we analyze the attention map of each token on the latent image features in Figure 16. The “dog” and “cat” token attention maps are largely overlapping for both our and pretrained models, which might lead to worse composition.

My Thinking

두개의 parameter만 update하는 것으로 실험

Latent image feature f and text feature c are projected into query Q, key K, and value V . Output is a weighted sum of values, weighted by the similarity between the query and key features. We highlight the updated parameters Wk and Wv in our method

위에 그림에서도 Wq는 Frozen되어 있는 것을 볼 수 있다

그에 뒷 받침되는 문장이

The task of fine-tuning aims at updating the mapping from given text to image distribution, and the text features are only input to Wk and Wv projection matrix in the cross-attention block
...
Therefore, we propose to only update Wk and Wv parameters of the diffusion model during the fine-tuning process

Fine-tuning의 목적을 text에서 이미지의 분포로의 매핑을 업데이트하는 것을 목표로 했다고 한다

As shown in our experiments, this is sufficient to update the model with a new text-image paired concept.

근데 약간 느낌이 "우리 실험봤지? 이걸로도 충분해~" 이런 느낌이여서 이 부분에서 조금 개선해 볼 수 있지 않을까 라는 생각을 한다

뿐만 아니라 Text Encoding 과정에서도

Given target concept images, we require a text caption as well. If there exists a text description, e.g., moongate, we use that as a text caption.

만약 text caption이 있는 경우 (e.g. moongate : moon + gate) 그 text description을 그대로 가져다가 씀

이렇게 있는 description을 그대로 가져오면 문제가 있지 않을까 라는 의구심은 들지만 딱 어떠한 예시를 들지는 못하겠다

논문링크

https://arxiv.org/abs/2212.04488

Multi-Concept Customization of Text-to-Image Diffusion

While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire

arxiv.org

Github

https://github.com/adobe-research/custom-diffusion

GitHub - adobe-research/custom-diffusion: Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (CVPR 2023)

Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (CVPR 2023) - GitHub - adobe-research/custom-diffusion: Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffu...

github.com

[논문리딩] AER: Auto-Encoder with Regression for TimeSeries Anomaly Detection (4)	2025.07.29
[paper] Transformer 실습을 통해 익히는 Pytorch 기초 (Attention is All you Need) (0)	2023.08.22
[논문리딩] BERT : Pre-training of Deep Bidirectional Transformer for Language Understanding (4)	2023.03.26
[Paper] GoogLeNet: Going deeper with convolutions (0)	2022.10.12
[Paper]FaceNet: A Unified Embedding for Face Recognition and Clustering (0)	2022.09.19