README.md
9.3 KB · 220 lines · markdown Raw
1 ---
2 language:
3 - en
4 - de
5 - es
6 - fr
7 - ja
8 - ko
9 - zh
10 - it
11 - pt
12 library_name: diffusers
13 license: other
14 license_name: ltx-2-community-license-agreement
15 license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
16 pipeline_tag: image-to-video
17 arxiv: 2601.03233
18 tags:
19 - image-to-video
20 - text-to-video
21 - video-to-video
22 - image-text-to-video
23 - audio-to-video
24 - text-to-audio
25 - video-to-audio
26 - audio-to-audio
27 - text-to-audio-video
28 - image-to-audio-video
29 - image-text-to-audio-video
30 - ltx-2
31 - ltx-video
32 - ltxv
33 - lightricks
34 pinned: true
35 demo: https://app.ltx.studio/ltx-2-playground/i2v
36 ---
37
38 # LTX-2 Model Card
39
40 This model card focuses on the LTX-2 model, as presented in the paper [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://huggingface.co/papers/2601.03233). The codebase is available [here](https://github.com/Lightricks/LTX-2).
41
42 LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
43
44 [![LTX-2 Open Source](https://img.youtube.com/vi/8fWAJXZJbRA/maxresdefault.jpg)](https://www.youtube.com/watch?v=8fWAJXZJbRA)
45
46 # Model Checkpoints
47
48 | Name | Notes |
49 |--------------------------------|----------------------------------------------------------------------------------------------------------------|
50 | ltx-2-19b-dev | The full model, flexible and trainable in bf16 |
51 | ltx-2-19b-dev-fp8 | The full model in fp8 quantization |
52 | ltx-2-19b-dev-fp4 | The full model in nvfp4 quantization |
53 | ltx-2-19b-distilled | The distilled version of the full model, 8 steps, CFG=1 |
54 | ltx-2-19b-distilled-lora-384 | A LoRA version of the distilled model applicable to the full model |
55 | ltx-2-spatial-upscaler-x2-1.0 | An x2 spatial upscaler for the ltx-2 latents, used in multi stage (multiscale) pipelines for higher resolution |
56 | ltx-2-temporal-upscaler-x2-1.0 | An x2 temporal upscaler for the ltx-2 latents, used in multi stage (multiscale) pipelines for higher FPS |
57
58 ## Model Details
59 - **Developed by:** Lightricks
60 - **Model type:** Diffusion-based audio-video foundation model
61 - **Language(s):** English
62
63 # Online demo
64 LTX-2 is accessible right away via the following links:
65 - [LTX-Studio text-to-video](https://app.ltx.studio/ltx-2-playground/t2v)
66 - [LTX-Studio image-to-video](https://app.ltx.studio/ltx-2-playground/i2v)
67
68 # Run locally
69
70 ## Direct use license
71 You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the [license](./LICENSE).
72
73 ## ComfyUI
74 We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager.
75 For manual installation information, please refer to our [documentation site](https://docs.ltx.video/open-source-model/integration-tools/comfy-ui).
76
77 ## PyTorch codebase
78
79 The [LTX-2 codebase](https://github.com/Lightricks/LTX-2) is a monorepo with several packages. From model definition in 'ltx-core' to pipelines in 'ltx-pipelines' and training capabilities in 'ltx-trainer'.
80 The codebase was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7.
81
82 ### Installation
83
84 ```bash
85 git clone https://github.com/Lightricks/LTX-2.git
86 cd LTX-2
87
88 # From the repository root
89 uv sync
90 source .venv/bin/activate
91 ```
92
93 ### Inference
94
95 To use our model, please follow the instructions in our [ltx-pipelines](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/README.md) package.
96
97 ## Diffusers 🧨
98
99 LTX-2 is supported in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) for text & image-to-video generation.
100 Read more on LTX-2 with diffusers [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2#diffusers.LTX2Pipeline.__call__.example).
101
102 ### Use with diffusers
103 To achieve production quality generation, it's recommended to use the two-stage generation pipeline.
104 Example for 2-stage inference of text-to-video:
105 ```python
106 import torch
107 from diffusers import FlowMatchEulerDiscreteScheduler
108 from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
109 from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
110 from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
111 from diffusers.pipelines.ltx2.export_utils import encode_video
112
113 device = "cuda:0"
114 width = 768
115 height = 512
116
117 pipe = LTX2Pipeline.from_pretrained(
118 "Lightricks/LTX-2", torch_dtype=torch.bfloat16
119 )
120 pipe.enable_sequential_cpu_offload(device=device)
121
122 prompt = "A beautiful sunset over the ocean"
123 negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."
124
125 # Stage 1 default (non-distilled) inference
126 frame_rate = 24.0
127 video_latent, audio_latent = pipe(
128 prompt=prompt,
129 negative_prompt=negative_prompt,
130 width=width,
131 height=height,
132 num_frames=121,
133 frame_rate=frame_rate,
134 num_inference_steps=40,
135 sigmas=None,
136 guidance_scale=4.0,
137 output_type="latent",
138 return_dict=False,
139 )
140
141 latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
142 "Lightricks/LTX-2",
143 subfolder="latent_upsampler",
144 torch_dtype=torch.bfloat16,
145 )
146 upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
147 upsample_pipe.enable_model_cpu_offload(device=device)
148 upscaled_video_latent = upsample_pipe(
149 latents=video_latent,
150 output_type="latent",
151 return_dict=False,
152 )[0]
153
154 # Load Stage 2 distilled LoRA
155 pipe.load_lora_weights(
156 "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
157 )
158 pipe.set_adapters("stage_2_distilled", 1.0)
159 # VAE tiling is usually necessary to avoid OOM error when VAE decoding
160 pipe.vae.enable_tiling()
161 # Change scheduler to use Stage 2 distilled sigmas as is
162 new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
163 pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
164 )
165 pipe.scheduler = new_scheduler
166 # Stage 2 inference with distilled LoRA and sigmas
167 video, audio = pipe(
168 latents=upscaled_video_latent,
169 audio_latents=audio_latent,
170 prompt=prompt,
171 negative_prompt=negative_prompt,
172 num_inference_steps=3,
173 noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L218
174 sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
175 guidance_scale=1.0,
176 output_type="np",
177 return_dict=False,
178 )
179
180 encode_video(
181 video[0],
182 fps=frame_rate,
183 audio=audio[0].float().cpu(),
184 audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
185 output_path="ltx2_lora_distilled_sample.mp4",
186 )
187 ```
188 For more inference examples, including generation with the distilled checkpoint, visit [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2#diffusers.LTX2Pipeline.__call__.example).
189
190 ## General tips:
191 * Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
192 * In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
193 * For tips on writing effective prompts, please visit our [Prompting guide](https://ltx.video/blog/how-to-prompt-for-ltx-2)
194
195 ### Limitations
196 - This model is not intended or able to provide factual information.
197 - As a statistical model this checkpoint might amplify existing societal biases.
198 - The model may fail to generate videos that matches the prompts perfectly.
199 - Prompt following is heavily influenced by the prompting-style.
200 - The model may generate content that is inappropriate or offensive.
201 - When generating audio without speech, the audio may be of lower quality.
202
203 # Train the model
204
205 The base (dev) model is fully trainable.
206
207 It's extremely easy to reproduce the LoRAs and IC-LoRAs we publish with the model by following the instructions on the [LTX-2 Trainer Readme](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-trainer/README.md).
208
209 Training for motion, style or likeness (sound+appearance) can take less than an hour in many settings.
210
211 ## Citation
212
213 ```bibtex
214 @article{hacohen2025ltx2,
215 title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
216 author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
217 journal={arXiv preprint arXiv:2601.03233},
218 year={2025}
219 }
220 ```