Skip to content

Cogvideox_fun_inp pipeline#13331

Open
satani99 wants to merge 4 commits intohuggingface:mainfrom
satani99:cogvideo_fun
Open

Cogvideox_fun_inp pipeline#13331
satani99 wants to merge 4 commits intohuggingface:mainfrom
satani99:cogvideo_fun

Conversation

@satani99
Copy link
Contributor

Added CogvideoxFunInpaintPipeline

Before submitting

Who can review?

@sayakpaul

@sayakpaul sayakpaul requested a review from dg845 March 25, 2026 05:11
EXAMPLE_DOC_STRING = """
Examples:
```python
>>> import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> import torch
>>> import PIL.Image
>>> import torch

As we use PIL.Image below when creating mask_video.

>>> video = load_video(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
... )
>>> mask_video = [Image.new("L", frame.size, 255) for frame in video]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> mask_video = [Image.new("L", frame.size, 255) for frame in video]
>>> mask_video = [PIL.Image.new("L", frame.size, 255) for frame in video]

Follow up change to #13331 (comment).

>>> mask_video = [Image.new("L", frame.size, 255) for frame in video]
>>> prompt = "A cinematic mountain hike with dramatic lighting."
>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="pt").frames[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="pt").frames[0]
>>> output = pipe(prompt=prompt, video=video, mask_video=mask_video, output_type="np").frames[0]

export_to_video only accepts list[np.ndarray] or list[PIL.Image.Image] video arguments.


return prompt_embeds, negative_prompt_embeds

def _preprocess_mask_video(self, mask_video, height: int, width: int) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _preprocess_mask_video(self, mask_video, height: int, width: int) -> torch.Tensor:
@staticmethod
def _preprocess_mask_video(mask_video, height: int, width: int) -> torch.Tensor:

Since _preprocess_mask_video does not use the self argument

Comment on lines +423 to +429
videos = []
for i in range(video.size(0)):
video_bs = video[i : i + 1]
video_bs = self.vae.encode(video_bs)[0]
video_bs = video_bs.sample()
videos.append(video_bs)
video = torch.cat(videos, dim=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should encode the video once instead of batch element by batch element, as this would better respect sliced/tiled encoding settings on the VAE. Users can always specify tiled encoding via pipe.vae.enable_tiling.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +446 to +450
if return_noise:
outputs += (noise,)

if return_video_latents:
outputs += (video_latents,)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the return_noise and return_video_latents arguments because these extra outputs are not used by CogVideoXFunInpaintPipeline.

Comment on lines +477 to +483
mask_pixel_values = []
for i in range(masked_image.size(0)):
mask_pixel_value = masked_image[i].unsqueeze(0)
mask_pixel_value = self.vae.encode(mask_pixel_value)[0]
mask_pixel_value = mask_pixel_value.mode()
mask_pixel_values.append(mask_pixel_value)
masked_image_latents = torch.cat(mask_pixel_values, dim=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #13331 (comment), I think we should encode masked_image once here to respect any slicing/tiling enabled on the VAE.

else:
masked_image_latents = None

return mask, masked_image_latents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return mask, masked_image_latents
return masked_image_latents

I think we should remove the mask return value here as it is not used in __call__.

if video is not None and latents is not None:
raise ValueError("Only one of `video` or `latents` should be provided.")

def fuse_qkv_projections(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the fuse_qkv_projections/unfuse_qkv_projections pipeline methods in favor of calling the corresponding methods directly on the transformer (e.g. pipe.transformer.fuse_qkv_projections()), similar to what we do for VAE slicing/tiling methods. CC @yiyixuxu

def interrupt(self):
return self._interrupt

def get_timesteps(self, num_inference_steps, timesteps, strength, device):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code would be more clear if we inlined the logic in the get_timesteps method in __call__. We generally prefer inlining smaller methods into the main pipeline code.

width = width or self.transformer.config.sample_width * self.vae_scale_factor_spatial
num_frames = num_frames or self.transformer.config.sample_frames

num_videos_per_prompt = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support num_videos_per_prompt > 1 unless there is a strong reason not to.

Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left an initial review :). When I run the tests locally with

pytest tests/pipelines/cogvideo/test_cogvideox_fun_inpaint.py

a lot of the tests fail with the following error:

FAILED tests/pipelines/cogvideo/test_cogvideox_fun_inpaint.py::CogVideoXFunInpaintPipelineFastTests::test_inference_batch_consistent - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Can you look into this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants