Implementation of content embedding injection: Concatenation vs. Gated Addition


**Context:**
I am reviewing the implementation of the `forward` pass in the **Flow-based Custom Transformer** for the Singing Voice Synthesis (SVS) task. While comparing the current codebase with the provided paper description, I noticed a potential discrepancy regarding how the content embedding $z_c$ is integrated into the model.

**Discrepancy Details:**
- **Paper Description:** The paper states: 
  > "...We then **concatenate** $x_t$ with the content embedding $z_c$ from the BBC Encoder... This allows the model to use self-attention to learn content and style transfer."
- **Current Code Implementation:** In the current `forward` function (https://github.com/AaronZ345/TCSinger2/blob/main/ldm/modules/diffusionmodules/tcsinger2.py#L403-L404
), the content embedding (derived from MIDI and phonemes) is added to the combined latent variable using a gated mechanism:

```python
# Combine midi and phoneme embeddings
content = midi + ph
content = self.final_proj(content.transpose(1, 2)).transpose(1, 2)

# ... (x_combined is defined as concat of prompt and x)

# Current injection method: Gated Addition (Line 85-87)
gate = torch.sigmoid(self.gate_content(content))
x_combined += content * gate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of content embedding injection: Concatenation vs. Gated Addition #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implementation of content embedding injection: Concatenation vs. Gated Addition #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions