Cogvideox_fun_inp pipeline by satani99 · Pull Request #13331 · huggingface/diffusers

satani99 · 2026-03-25T05:05:58Z

Added CogvideoxFunInpaintPipeline

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

dg845 · 2026-03-26T01:39:56Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```python
+        >>> import torch


Suggested change

>>> import torch

>>> import PIL.Image

>>> import torch

As we use PIL.Image below when creating mask_video.

dg845 · 2026-03-26T01:41:11Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+        >>> video = load_video(
+        ...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
+        ... )
+        >>> mask_video = [Image.new("L", frame.size, 255) for frame in video]


Suggested change

>>> mask_video = [Image.new("L", frame.size, 255) for frame in video]

>>> mask_video = [PIL.Image.new("L", frame.size, 255) for frame in video]

Follow up change to #13331 (comment).

dg845 · 2026-03-26T01:42:45Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+        >>> mask_video = [Image.new("L", frame.size, 255) for frame in video]
+        >>> prompt = "A cinematic mountain hike with dramatic lighting."
+
+        >>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="pt").frames[0]


Suggested change

>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="pt").frames[0]

>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="np").frames[0]

export_to_video only accepts list[np.ndarray] or list[PIL.Image.Image] video arguments.

dg845 · 2026-03-26T01:45:07Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+
+        return prompt_embeds, negative_prompt_embeds
+
+    def _preprocess_mask_video(self, mask_video, height: int, width: int) -> torch.Tensor:


Suggested change

def _preprocess_mask_video(self, mask_video, height: int, width: int) -> torch.Tensor:

@staticmethod

def _preprocess_mask_video(mask_video, height: int, width: int) -> torch.Tensor:

Since _preprocess_mask_video does not use the self argument

dg845 · 2026-03-26T01:48:29Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+            videos = []
+            for i in range(video.size(0)):
+                video_bs = video[i : i + 1]
+                video_bs = self.vae.encode(video_bs)[0]
+                video_bs = video_bs.sample()
+                videos.append(video_bs)
+            video = torch.cat(videos, dim=0)


I think we should encode the video once instead of batch element by batch element, as this would better respect sliced/tiled encoding settings on the VAE. Users can always specify tiled encoding via pipe.vae.enable_tiling.

HuggingFaceDocBuilderDev · 2026-03-26T01:49:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-03-26T01:49:53Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+        if return_noise:
+            outputs += (noise,)
+
+        if return_video_latents:
+            outputs += (video_latents,)


I think we should remove the return_noise and return_video_latents arguments because these extra outputs are not used by CogVideoXFunInpaintPipeline.

dg845 · 2026-03-26T01:51:08Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+            mask_pixel_values = []
+            for i in range(masked_image.size(0)):
+                mask_pixel_value = masked_image[i].unsqueeze(0)
+                mask_pixel_value = self.vae.encode(mask_pixel_value)[0]
+                mask_pixel_value = mask_pixel_value.mode()
+                mask_pixel_values.append(mask_pixel_value)
+            masked_image_latents = torch.cat(mask_pixel_values, dim=0)


Similar to #13331 (comment), I think we should encode masked_image once here to respect any slicing/tiling enabled on the VAE.

dg845 · 2026-03-26T01:52:02Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+        else:
+            masked_image_latents = None
+
+        return mask, masked_image_latents


Suggested change

return mask, masked_image_latents

return masked_image_latents

I think we should remove the mask return value here as it is not used in __call__.

dg845 · 2026-03-26T01:54:58Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+        if video is not None and latents is not None:
+            raise ValueError("Only one of `video` or `latents` should be provided.")
+
+    def fuse_qkv_projections(self) -> None:


I think we should remove the fuse_qkv_projections/unfuse_qkv_projections pipeline methods in favor of calling the corresponding methods directly on the transformer (e.g. pipe.transformer.fuse_qkv_projections()), similar to what we do for VAE slicing/tiling methods. CC @yiyixuxu

dg845 · 2026-03-26T01:57:12Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+    def interrupt(self):
+        return self._interrupt
+
+    def get_timesteps(self, num_inference_steps, timesteps, strength, device):


I think the code would be more clear if we inlined the logic in the get_timesteps method in __call__. We generally prefer inlining smaller methods into the main pipeline code.

dg845 · 2026-03-26T01:57:55Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_inpaint.py

+        width = width or self.transformer.config.sample_width * self.vae_scale_factor_spatial
+        num_frames = num_frames or self.transformer.config.sample_frames
+
+        num_videos_per_prompt = 1


I think we should support num_videos_per_prompt > 1 unless there is a strong reason not to.

dg845

Thanks for the PR! I left an initial review :). When I run the tests locally with

pytest tests/pipelines/cogvideo/test_cogvideox_fun_inpaint.py

a lot of the tests fail with the following error:

FAILED tests/pipelines/cogvideo/test_cogvideox_fun_inpaint.py::CogVideoXFunInpaintPipelineFastTests::test_inference_batch_consistent - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Can you look into this?

satani99 and others added 4 commits March 20, 2026 21:26

Add CogVideoX Fun inpaint pipeline

d5c0448

add args in pipeline

24202e6

add comments in cogvideo_fun_pipeline

a9f101a

Merge branch 'main' into cogvideo_fun

2fef21a

sayakpaul requested a review from dg845 March 25, 2026 05:11

dg845 reviewed Mar 26, 2026

View reviewed changes

	>>> mask_video = [Image.new("L", frame.size, 255) for frame in video]
	>>> mask_video = [PIL.Image.new("L", frame.size, 255) for frame in video]

	>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="pt").frames[0]
	>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="np").frames[0]


		return prompt_embeds, negative_prompt_embeds

		def _preprocess_mask_video(self, mask_video, height: int, width: int) -> torch.Tensor:

	def _preprocess_mask_video(self, mask_video, height: int, width: int) -> torch.Tensor:
	@staticmethod
	def _preprocess_mask_video(mask_video, height: int, width: int) -> torch.Tensor:

	return mask, masked_image_latents
	return masked_image_latents

Conversation

satani99 commented Mar 25, 2026

Before submitting

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants