Context:
I am reviewing the implementation of the forward pass in the Flow-based Custom Transformer for the Singing Voice Synthesis (SVS) task. While comparing the current codebase with the provided paper description, I noticed a potential discrepancy regarding how the content embedding $z_c$ is integrated into the model.
Discrepancy Details:
# Combine midi and phoneme embeddings
content = midi + ph
content = self.final_proj(content.transpose(1, 2)).transpose(1, 2)
# ... (x_combined is defined as concat of prompt and x)
# Current injection method: Gated Addition (Line 85-87)
gate = torch.sigmoid(self.gate_content(content))
x_combined += content * gate
Context:$z_c$ is integrated into the model.
I am reviewing the implementation of the
forwardpass in the Flow-based Custom Transformer for the Singing Voice Synthesis (SVS) task. While comparing the current codebase with the provided paper description, I noticed a potential discrepancy regarding how the content embeddingDiscrepancy Details:
forwardfunction (https://github.com/AaronZ345/TCSinger2/blob/main/ldm/modules/diffusionmodules/tcsinger2.py#L403-L404), the content embedding (derived from MIDI and phonemes) is added to the combined latent variable using a gated mechanism: