-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
185 lines (169 loc) · 9.67 KB
/
index.html
File metadata and controls
185 lines (169 loc) · 9.67 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>MoDA: Multi-modal Diffusion Architecture for Talking Head Generation</title>
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="static/css/bulma-slider.min.css">
<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<link href="./asserts/style.css" rel="stylesheet">
<script async src="//busuanzi.ibruce.info/busuanzi/2.3/busuanzi.pure.mini.js"></script>
</head>
<body>
<div class="content">
<h1><strong>MoDA: Multi-modal Diffusion Architecture for Talking Head Generation</strong></h1>
<div align="center">
<strong>Authors</strong><br><br>
Xinyang Li<sup>1,2</sup>,
Gen Li<sup>2</sup>,
Zhihui Lin<sup>1,3</sup>,
Yichen Qian<sup>1,3 †</sup>,
Gongxin Yao<sup>2</sup>,
Weinan Jia<sup>1</sup>,
Weihua Chen<sup>1,3</sup>,
Fan Wang<sup>1,3</sup><br><br>
<sup>1</sup>Xunguang Team, DAMO Academy, Alibaba Group
<sup>2</sup>Zhejiang University
<sup>3</sup>Hupan Lab <br><br>
<sup>†</sup>Corresponding authors:
<a href="mailto:yichen.qyc@alibaba-inc.com">yichen.qyc@alibaba-inc.com</a>,
<a href="mailto:l_xyang@zju.edu.cn">l_xyang@zju.edu.cn</a>
</div>
<br>
<div align="center">
<a href="https://github.com/lixinyyang/MoDA">
<img src="https://img.shields.io/badge/Github-Code-blue" alt="Github">
</a>
<a href="https://arxiv.org/abs/2507.03256">
<img src="https://img.shields.io/badge/Paper-Arxiv-red" alt="Paper on Arxiv">
</a>
</div>
<div class="row" style="border: 1px solid #a3a3a3; border-radius: 4px; margin-top: 20px;">
<video style="width: 100%; object-fit: cover;" controls>
<source src="moda/moda.mp4" type="video/mp4">
</video>
</div>
</div>
<div class="content">
<h2 style="text-align:center"><strong>Abstract</strong></h2>
<div id="teasers">
<img src="asserts/frameworks.png", style="width: 100%;">
<figcaption></figcaption>
</div>
<p style="line-height: 30px;">
Talking head generation with arbitrary identities and speech audio remains a crucial problem in
the realm of the virtual metaverse. Recently, diffusion models have become a
popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual
artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates
the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate
multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to
model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results
demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications.
</p>
</div>
<div class="content">
<h2 style="text-align: center;"><strong>gallery</strong></h2>
<h3>Talking Head Generation in Complex Scenarios.</h3>
<div class="gallery">
<div class="row">
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/1.mp4" type="video/mp4">
</video>
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/2.mp4" type="video/mp4">
</video>
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/3.mp4" type="video/mp4">
</video>
</div>
<div class="row">
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/4.mp4" type="video/mp4">
</video>
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/5.mp4" type="video/mp4">
</video>
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/6.mp4" type="video/mp4">
</video>
</div>
<div class="row">
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/7.mp4" type="video/mp4">
</video>
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/8.mp4" type="video/mp4">
</video>
<video style="width: 33%; object-fit: cover;" controls>
<source src="moda/Complex Scenarios/9.mp4" type="video/mp4">
</video>
</div>
</div>
<h3>Fine-grained Emotion Control.</h3>
<div class="gallery">
<div class="row">
<div style="width: 50%; padding: 0 5px; box-sizing: border-box;">
<p style="text-align: center;"><strong>Happy</strong></p>
<video style="width: 100%; object-fit: cover;" controls>
<source src="moda/Emotion Control/1-1.mp4" type="video/mp4">
</video>
</div>
<div style="width: 50%; padding: 0 5px; box-sizing: border-box;">
<p style="text-align: center;"><strong>Sad</strong></p>
<video style="width: 100%; object-fit: cover;" controls>
<source src="moda/Emotion Control/1-2.mp4" type="video/mp4">
</video>
</div>
</div>
<div class="row">
<div style="width: 50%; padding: 0 5px; box-sizing: border-box;">
<p style="text-align: center;"><strong>Happy</strong></p>
<video style="width: 100%; object-fit: cover;" controls>
<source src="moda/Emotion Control/2-1.mp4" type="video/mp4">
</video>
</div>
<div style="width: 50%; padding: 0 5px; box-sizing: border-box;">
<p style="text-align: center;"><strong>Sad</strong></p>
<video style="width: 100%; object-fit: cover;" controls>
<source src="moda/Emotion Control/2-2.mp4" type="video/mp4">
</video>
</div>
</div>
</div>
<h3>Long Videos Generation.</h3>
<div class="gallery">
<div class="row">
<video style="width: 50%; object-fit: cover;" controls>
<source src="moda/Long Videos Generation/1.mp4" type="video/mp4">
</video>
<video style="width: 50%; object-fit: cover;" controls>
<source src="moda/Long Videos Generation/2.mp4" type="video/mp4">
</video>
</div>
</div>
<!-- Citation Section -->
<div class="gallery">
<h2 class="title is-3 has-text-centered">Citation</h2>
<div class="citation-block">
<pre style="background:#f5f5f5; padding:14px; margin: 0; white-space: pre-wrap;">
<code>
@article{li2025moda,
title = {MoDA: Multi-modal Diffusion Architecture for Talking Head Generation},
author = {Li, Xinyang and Li, Gen and Lin, Zhihui and Qian, Yichen and
Yao, Gongxin and Jia, Weinan and Chen, Weihua and Wang, Fan},
journal = {arXiv preprint arXiv:2507.03256},
year = {2025}
}
</code>
</pre>
</div>
</div>
</div>
</section>
<footer style="text-align: center; font-size: medium; color: blueviolet;">
<span id="busuanzi_container_page_pv">Page Views: <span id="busuanzi_value_page_pv"></span></span>
</footer>
</body>
</html>