paint-brush
TokenFlow's Implementation Details: Everything That We Usedby@kinetograph

TokenFlow's Implementation Details: Everything That We Used

tldt arrow

Too Long; Didn't Read

We use Stable Diffusion as our pre-trained text-to-image model; we use the StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage.
featured image - TokenFlow's Implementation Details: Everything That We Used
Kinetograph: The Video Editing Technology Publication HackerNoon profile picture
0-item

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Method

4.1 Key Sample and Joint Editing

4.2 Edit Propagation Via TokenFlow

5 Results

5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation

5.3 Ablation Study

6 Discussion

7 Acknowledgement and References

A Implementation Details

A IMPLEMENTATION DETAILS

StableDiffusion. We use Stable Diffusion as our pre-trained text-to-image model; we use the StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage.


DDIM inversion. In all of our experiments, we use DDIM deterministic sampling with 50 steps. For inverting the video, we follow Tumanyan et al. (2023) and use DDIM inversion with classifierfree guidance scale of 1 and 1000 forward steps; and extract the self-attention input tokens from this process similarly to Qi et al. (2023).


Runtime. Since we don’t compute the attention module on most video frames (i.e., we only compute the self-attention output on the keyframes) our method is efficient in run-time, and the sampling of the video reduces the time of per-frame editing by 20%. The inversion process with 1000 steps is the main bottleneck of our method in terms of run-time, and in many cases a significantly smaller amount of steps is suffieicent (e.g. 50). Table 3 reports runtime comparisons using 50 steps in all methods. Notably, our sampling time is indeed faster than that of per-frame editing (PnP).



Baselines. For running the baseline of Tune-a-video (Wu et al., 2022) we used their official repository. For Gen-1 (Esser et al., 2023) we used their platform on Runaway website. This platform outputs a video that is not in the same length and frame-rate as the input video; therefore, we could not compute the warping error on their results. For text-to-video-zero (Khachatryan et al., 2023b) we used their official repository, with their depth conditioning configuration. For Fate-Zero (Qi et al., 2023) with used their official repository, and verified the run configurations with the authors.


This paper is available on arxiv under CC BY 4.0 DEED DEED license.

Authors:

(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;

(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;

(3) Shai Bagon, Weizmann Institute of Science;

(4) Tali Dekel, Weizmann Institute of Science.