README.md
9.1 KB · 205 lines · markdown Raw
1 ---
2 base_model: Lightricks/LTX-2.3
3 language:
4 - en
5 - de
6 - es
7 - fr
8 - ja
9 - ko
10 - zh
11 - it
12 - pt
13 library_name: ggml
14 license: other
15 license_name: ltx-2-community-license-agreement
16 license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
17 pipeline_tag: image-to-video
18 arxiv: 2601.03233
19 tags:
20 - image-to-video
21 - gguf
22 - unsloth
23 - text-to-video
24 - video-to-video
25 - image-text-to-video
26 - audio-to-video
27 - text-to-audio
28 - video-to-audio
29 - audio-to-audio
30 - text-to-audio-video
31 - image-to-audio-video
32 - image-text-to-audio-video
33 - ltx-2
34 - ltx-2-3
35 - ltx-video
36 - ltxv
37 - lightricks
38 pinned: true
39 demo: https://app.ltx.studio/ltx-2-playground/i2v
40 widget:
41 - text: florist
42 output:
43 url: unsloth_flowers.mp4
44 ---
45
46 This is a GGUF quantized version of [LTX-2.3](https://huggingface.co/Lightricks/LTX-2.3). <br>
47 unsloth/LTX-2.3-GGUF uses [Unsloth Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs) methodology for SOTA performance.
48 - Important layers are upcasted to higher precision.
49 - Uses tooling from [ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) by city96.
50
51 There are two sets of GGUF's published. One for the dev model and one for the distilled. The distilled model is optimized for few step generation, think 4-8 steps. dev on the other hand needs more steps at least 20, but you get better outputs. The distilled variant is useful as a drafting model or a refining model.
52
53 In fact the workflow published below, uses the distilled lora on top of the dev model to refine the intial output.
54 <div>
55 <div style="display: flex; gap: 5px; align-items: center; ">
56 <a href="https://github.com/unslothai/unsloth/">
57 <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
58 </a>
59 <a href="https://discord.gg/unsloth">
60 <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
61 </a>
62 <a href="https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs">
63 <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
64 </a>
65 </div>
66 </div>
67
68 # Workflow
69
70 Download the mp4 in the repo and open it with ComfyUI. The workflow to reproduce the video is embedded in the file.
71 <Gallery />
72
73 To install ComfyUI
74 ```
75 python3 -m venv .diffusion
76 source .diffusion/bin/activate
77 git clone https://github.com/Comfy-Org/ComfyUI.git
78 cd ComfyUI
79 pip install -r requirements.txt
80 pip install huggingface_hub
81 cd custom_nodes/
82 git clone https://github.com/city96/ComfyUI-GGUF.git
83 cd ComfyUI-GGUF/
84 pip install -r requirements.txt
85 cd ..
86 git clone https://github.com/kijai/ComfyUI-KJNodes.git
87 cd ComfyUI-KJNodes/
88 pip install -r requirements.txt
89 cd ../../models
90 ```
91
92 To download the model files used
93 ```
94 ln -s "$(hf download unsloth/LTX-2.3-GGUF ltx-2.3-22b-dev-Q4_K_M.gguf --quiet)" unet/.
95 ln -s "$(hf download unsloth/LTX-2.3-GGUF vae/ltx-2.3-22b-dev_video_vae.safetensors --quiet)" vae/.
96 ln -s "$(hf download unsloth/LTX-2.3-GGUF vae/ltx-2.3-22b-dev_audio_vae.safetensors --quiet)" vae/.
97 ln -s "$(hf download unsloth/LTX-2.3-GGUF text_encoders/ltx-2.3-22b-dev_embeddings_connectors.safetensors --quiet)" text_encoders/.
98
99 ln -s "$(hf download Lightricks/LTX-2.3 ltx-2.3-22b-distilled-lora-384.safetensors --quiet)" loras/.
100 ln -s "$(hf download Lightricks/LTX-2.3 ltx-2.3-spatial-upscaler-x2-1.0.safetensors --quiet)" latent_upscale_models/.
101 ln -s "$(hf download unsloth/gemma-3-12b-it-qat-GGUF gemma-3-12b-it-qat-UD-Q4_K_XL.gguf --quiet)" text_encoders/.
102 ln -s "$(hf download unsloth/gemma-3-12b-it-qat-GGUF mmproj-BF16.gguf --quiet)" text_encoders/.
103 ```
104
105 Then launch ComfyUI, make sure you're using an up to date version of ComfyUI and all custom nodes.
106 ```
107 cd ..
108 python main.py
109 ```
110
111 ---
112 # LTX-2.3 Model Card
113
114 This model card focuses on the LTX-2.3 model, which is a significant update to the [LTX-2 model](https://huggingface.co/Lightricks/LTX-2) with improved audio and visual quality as well as enhanced prompt adherence.
115 LTX-2 was presented in the paper [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://huggingface.co/papers/2601.03233).
116
117 💻💻 **If you want to dive in right to the code - it is available [here](https://github.com/Lightricks/LTX-2).** 💾💾
118
119 LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
120
121 [![LTX-2 Open Source](ltx2.3-open.png)](https://youtu.be/o-7us-BR_gQ)
122
123 # Model Checkpoints
124
125 | Name | Notes |
126 |------------------------------------|--------------------------------------------------------------------------------------------------------------------|
127 | ltx-2.3-22b-dev | The full model, flexible and trainable in bf16 |
128 | ltx-2.3-22b-distilled | The distilled version of the full model, 8 steps, CFG=1 |
129 | ltx-2.3-22b-distilled-lora-384 | A LoRA version of the distilled model applicable to the full model |
130 | ltx-2.3-spatial-upscaler-x2-1.0 | An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
131 | ltx-2.3-spatial-upscaler-x1.5-1.0 | An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution |
132 | ltx-2.3-temporal-upscaler-x2-1.0 | An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS |
133
134 ## Model Details
135 - **Developed by:** Lightricks
136 - **Model type:** Diffusion-based audio-video foundation model
137 - **Language(s):** English
138
139 # Online demo
140 LTX-2.3 is accessible right away via the [API Playground](https://console.ltx.video/playground/).
141
142 # Run locally
143
144 ## Direct use license
145 You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the [license](./LICENSE).
146
147 ## ComfyUI
148 We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager.
149 For manual installation information, please refer to our [documentation site](https://docs.ltx.video/open-source-model/integration-tools/comfy-ui).
150
151 ## PyTorch codebase
152
153 The [LTX-2 codebase](https://github.com/Lightricks/LTX-2) is a monorepo with several packages. From model definition in 'ltx-core' to pipelines in 'ltx-pipelines' and training capabilities in 'ltx-trainer'.
154 The codebase was tested with Python >=3.12, CUDA version >12.7, and supports PyTorch ~= 2.7.
155
156 ### Installation
157
158 ```bash
159 git clone https://github.com/Lightricks/LTX-2.git
160 cd LTX-2
161
162 # From the repository root
163 uv sync
164 source .venv/bin/activate
165 ```
166
167 ### Inference
168
169 To use our model, please follow the instructions in our [ltx-pipelines](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/README.md) package.
170
171 ## Diffusers 🧨
172
173 LTX-2.3 support in the [Diffusers Python library](https://huggingface.co/docs/diffusers/main/en/index) is coming soon!
174
175 ## General tips:
176 * Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
177 * In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
178 * For tips on writing effective prompts, please visit our [Prompting guide](https://ltx.video/blog/how-to-prompt-for-ltx-2)
179
180 ### Limitations
181 - This model is not intended or able to provide factual information.
182 - As a statistical model this checkpoint might amplify existing societal biases.
183 - The model may fail to generate videos that matches the prompts perfectly.
184 - Prompt following is heavily influenced by the prompting-style.
185 - The model may generate content that is inappropriate or offensive.
186 - When generating audio without speech, the audio may be of lower quality.
187
188 # Train the model
189
190 The base (dev) model is fully trainable.
191
192 It's extremely easy to reproduce the LoRAs and IC-LoRAs we publish with the model by following the instructions on the [LTX-2 Trainer Readme](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-trainer/README.md).
193
194 Training for motion, style or likeness (sound+appearance) can take less than an hour in many settings.
195
196 ## Citation
197
198 ```bibtex
199 @article{hacohen2025ltx2,
200 title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
201 author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
202 journal={arXiv preprint arXiv:2601.03233},
203 year={2025}
204 }
205 ```