README.md · text-to-video-ms-1.7b

1

---

2

license: cc-by-nc-4.0

3

tags:

4

- text-to-video

5

duplicated_from: diffusers/text-to-video-ms-1.7b

6

---

7

8

# Text-to-video-synthesis Model in Open Domain

9

10

This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.

11

12

**We Are Hiring!** (Based in Beijing / Hangzhou, China.)

13

14

If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.

15

16

EMAIL: yingya.zyy@alibaba-inc.com

17

18

## Model description

19

20

The text-to-video generation diffusion model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.

21

22

This model is meant for research purposes. Please look at the [model limitations and biases and misuse](#model-limitations-and-biases), [malicious use and excessive use](#misuse-malicious-use-and-excessive-use) sections.

23

24

## Model Details

25

26

- **Developed by:** [ModelScope](https://modelscope.cn/)

27

- **Model type:** Diffusion-based text-to-video generation model

28

- **Language(s):** English

29

- **License:**[ CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)

30

- **Resources for more information:** [ModelScope GitHub Repository](https://github.com/modelscope/modelscope), [Summary](https://modelscope.cn/models/damo/text-to-video-synthesis/summary).

31

- **Cite as:**

32

33

## Use cases

34

35

This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions.

36

37

## Usage

38

39

Let's first install the libraries required:

40

41

```bash

42

$ pip install diffusers transformers accelerate torch

43

```

44

45

Now, generate a video:

46

47

```python

48

import torch

49

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

50

from diffusers.utils import export_to_video

51

52

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")

53

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

54

pipe.enable_model_cpu_offload()

55

56

prompt = "Spiderman is surfing"

57

video_frames = pipe(prompt, num_inference_steps=25).frames

58

video_path = export_to_video(video_frames)

59

```

60

61

Here are some results:

62

63

<table>

64

<tr>

65

<td><center>

66

An astronaut riding a horse.

67

<br>

68

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astr.gif"

69

alt="An astronaut riding a horse."

70

style="width: 300px;" />

71

</center></td>

72

<td ><center>

73

Darth vader surfing in waves.

74

<br>

75

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vader.gif"

76

alt="Darth vader surfing in waves."

77

style="width: 300px;" />

78

</center></td>

79

</tr>

80

</table>

81

82

## Long Video Generation

83

84

You can optimize for memory usage by enabling attention and VAE slicing and using Torch 2.0.

85

This should allow you to generate videos up to 25 seconds on less than 16GB of GPU VRAM.

86

87

```bash

88

$ pip install git+https://github.com/huggingface/diffusers transformers accelerate

89

```

90

91

```py

92

import torch

93

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

94

from diffusers.utils import export_to_video

95

96

# load pipeline

97

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")

98

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

99

100

# optimize for GPU memory

101

pipe.enable_model_cpu_offload()

102

pipe.enable_vae_slicing()

103

104

# generate

105

prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"

106

video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames

107

108

# convent to video

109

video_path = export_to_video(video_frames)

110

```

111

112

113

## View results

114

115

The above code will display the save path of the output video, and the current encoding format can be played with [VLC player](https://www.videolan.org/vlc/).

116

117

The output mp4 file can be viewed by [VLC media player](https://www.videolan.org/vlc/). Some other media players may not view it normally.

118

119

## Model limitations and biases

120

121

* The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.

122

* This model cannot achieve perfect film and television quality generation.

123

* The model cannot generate clear text.

124

* The model is mainly trained with English corpus and does not support other languages at the moment**.

125

* The performance of this model needs to be improved on complex compositional generation tasks.

126

127

## Misuse, Malicious Use and Excessive Use

128

129

* The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.

130

* It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.

131

* Prohibited for pornographic, violent and bloody content generation.

132

* Prohibited for error and false information generation.

133

134

## Training data

135

136

The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B-en), [ImageNet](https://www.image-net.org/), [Webvid](https://m-bain.github.io/webvid-dataset/) and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.

137

138

_(Part of this model card has been taken from [here](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis))_

139

140

## Citation

141

142

```bibtex

143

@article{wang2023modelscope,

144

title={Modelscope text-to-video technical report},

145

author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},

146

journal={arXiv preprint arXiv:2308.06571},

147

year={2023}

148

}

149

@InProceedings{VideoFusion,

150

        author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},

151

title = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},

152

booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

153

month = {June},

154

year = {2023}

155

}

156

```

157