Skip to content

feat: Add Snappy support#466

Open
Erouan50 wants to merge 9 commits intoNullus157:mainfrom
Erouan50:feat/snappy
Open

feat: Add Snappy support#466
Erouan50 wants to merge 9 commits intoNullus157:mainfrom
Erouan50:feat/snappy

Conversation

@Erouan50
Copy link
Copy Markdown

Hello there!

This pull request introduces support for framed snappy decoder and encoder. I'm not an expert in compression/decompression, but I feel the snappy one is quite different from the others. My work was inspired by the gzip decoding (for the state machine), the work done in the snap crate in the framing reader and writer, and of course the actual spec. One big difference is the whole frame must be buffered before compressing or decompressing it.

I have a few questions/notes:

  • The "buffering" can be skipped if the input and/or output buffers are the right size. Should I implement that in this pull request?
  • One test is failing (snappy::tokio::bufread::decompress::trailer), but I don't know how to fix it. If I understand the test correctly, it checks if the decoder can stop decoding with some trailing data in the buffer. But with the framing format, I don't know how to address that. Should I stop decoding when the chunk type is invalid, or should I stop and fail (current behavior)?
  • Should I add more tests? Because the test data ([1, 2, 3...]) are not compressible in snappy, the encoder and decoder never compress or decompress anything. But is it the job of the proptests to create some random test data?

Erouan50 and others added 4 commits March 27, 2026 09:53
Implement the Snappy framing format as a new compression algorithm,
following the same pattern as the existing codecs (zstd, lz4, etc.).

The encoder produces Snappy frames with stream identifiers,
compressed/uncompressed chunks (falling back to uncompressed when
compression ratio is poor), and masked CRC32C checksums. The decoder
is a state machine that parses frame headers, buffers chunk data, and
verifies checksums before emitting decompressed output.

New dependencies: `snap` for raw Snappy compression and `crc32c` for
checksum computation. Both are optional behind the `snappy` feature
flag.
@NobodyXu
Copy link
Copy Markdown
Collaborator

The "buffering" can be skipped if the input and/or output buffers are the right size. Should I implement that in this pull request?

I'm ok with either, though based on the size of the PR and that its CI is failing, I think getting a baseline implementation merged is a more sensible decision

@Erouan50
Copy link
Copy Markdown
Author

The "buffering" can be skipped if the input and/or output buffers are the right size. Should I implement that in this pull request?

I'm ok with either, though based on the size of the PR and that its CI is failing, I think getting a baseline implementation merged is a more sensible decision

That's fair. Let me fix the proptest. I forgot them...

@Erouan50
Copy link
Copy Markdown
Author

Erouan50 commented Mar 27, 2026

The "buffering" can be skipped if the input and/or output buffers are the right size. Should I implement that in this pull request?

I'm ok with either, though based on the size of the PR and that its CI is failing, I think getting a baseline implementation merged is a more sensible decision

@NobodyXu I do agree to implement that in other PR. I fixed the tests except the one I mentioned in the description:

     Summary [  41.303s] 884 tests run: 882 passed, 2 failed, 0 skipped
        FAIL [   0.013s] async-compression::snappy snappy::futures::bufread::decompress::trailer
        FAIL [   0.008s] async-compression::snappy snappy::tokio::bufread::decompress::trailer

Do you have any opinion on how should I fix it? Or should I just skip this test for snappy?

Copy link
Copy Markdown
Collaborator

@NobodyXu NobodyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'm really unhappy with all the repeated vec creation which resulted in repeated copying and heap allocation

Comment on lines +30 to +35
0xFF => Self::Stream,
0x00 => Self::Compressed,
0x01 => Self::Uncompressed,
0xFE => Self::Padding,
0x02..=0x7f => Self::ReservedUnskippable(value),
0x80..=0xFD => Self::ReservedSkippable(value),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to hardcode it here?

Ideally the ChunkType should be in snap

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is but the enum is not public.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better if snap makes it public under the raw module

}

fn mask_crc(crc: u32) -> u32 {
(crc.wrapping_shr(15) | crc.wrapping_shl(17)).wrapping_add(0xA282EAD8)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another magic value, better in snap, I assume this is probably duplicate code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's duplicate code. The function is not public.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have it public under snap::raw module?

Hardcoding it here really feels wrong

Copy link
Copy Markdown
Collaborator

@NobodyXu NobodyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question regarding why encoder is designed with extra buffering

I can understand that the decoder needs it as we cannot control on bow the chunks are created

@NobodyXu
Copy link
Copy Markdown
Collaborator

Do you have any opinion on how should I fix it? Or should I just skip this test for snappy?

Does snappy have a "file header" that recorded the size of the entire stream?

If we have it, then we can configure the decoder to stop decoding once that limit is matched, and only way to continue would be reinit

Otherwise we cannot fix it

@Erouan50
Copy link
Copy Markdown
Author

Do you have any opinion on how should I fix it? Or should I just skip this test for snappy?

Does snappy have a "file header" that recorded the size of the entire stream?

If we have it, then we can configure the decoder to stop decoding once that limit is matched, and only way to continue would be reinit

Otherwise we cannot fix it

No, the first "stream" chunk is just there to mark the start of the file, but it doesn't include the length of the whole stream.

@Erouan50
Copy link
Copy Markdown
Author

Thank you, I'm really unhappy with all the repeated vec creation which resulted in repeated copying and heap allocation

You're right, I apologize for that... I tried to apply your suggestions. It's not perfect (I don't like the reset() + clear() + resize() in the decoder for example), but I hope I'm going to the right direction?

@NobodyXu

This comment has been minimized.

Copy link
Copy Markdown
Collaborator

@NobodyXu NobodyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

My feedback is that the copying around of uncompressed sections is unnecessary and should be directly copied into the output

And that snap::raw should have more public items to prevent duplicate, and that it needs to support uninitialised buffer for maximum efficiency

let mut decoder = snap::raw::Decoder::new();
decoder.decompress(data, out_buf)?;
}
ChunkType::Uncompressed => out_buf.extend_from_slice(data),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should skip this copy, by putting the chunk type into a new variant UncompressedCopy so that it'd copy from data, along with an offset, initially set to 4 to skip the header.

Using that offset we can construct a PartialBuffer

Comment on lines +43 to +45
out_buf.resize(uncompress_length, 0);
let mut decoder = snap::raw::Decoder::new();
decoder.decompress(data, out_buf)?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame that snap does not support writing to uninitialised buffer, zeroing the whole vec would indeed take quite some time


impl DecodeV2 for SnappyDecoder {
fn reinit(&mut self) -> std::io::Result<()> {
*self = Self::new();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*self = Self::new();
self.state = State::default();
self.in_buf.clear();
self.out_buf.reset();

}
}

pub fn read_u24_le(slice: &[u8; 3]) -> u32 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn read_u24_le(slice: &[u8; 3]) -> u32 {
fn read_u24_le(slice: &[u8; 3]) -> u32 {

This function is used by snappy submodules and doesn't have to be public

))?;

let chunk_type = ChunkType::from(header_part[0]);
// SAFETY: header_part is guaranteed to have at least 4 bytes due to split_first_chunk
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SAFETY: header_part is guaranteed to have at least 4 bytes due to split_first_chunk
// header_part is guaranteed to have at least 4 bytes due to split_first_chunk

Safety is for unsafe block and function

Comment on lines +70 to +72
Skipping(usize),
Buffering {
remaining: usize,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Skipping(usize),
Buffering {
remaining: usize,
Skipping(NonZeroUize),
Buffering {
remaining: NonZeroUsize,

Using NonZeroUsize would make it clear the invariant of the state

out_buf.resize(max_compress_size + 8, 0);

let mut encoder = snap::raw::Encoder::new();
let compress_data = encoder.compress(in_buffer, &mut out_buf[8..])?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame snap::raw doesn't support uninitialised buffer

Comment on lines +42 to +45
let chunk_type = if compress_data >= in_buffer.len() - (in_buffer.len() / 8) {
out_buf.clear();
out_buf.resize(in_buffer.len() + 8, 0);
out_buf[8..].copy_from_slice(in_buffer);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a new state UncompressedCopy instead of doing this intermediate copy

@NobodyXu
Copy link
Copy Markdown
Collaborator

It's not perfect (I don't like the reset() + clear() + resize() in the decoder for example), but I hope I'm going to the right direction?

Yeah it's the right direction, it looks ugly because snap::raw does not support uninitialised buffer, speaking of which the {de, en}code_vec is also very inefficient due to that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants