Conversation
Implement the Snappy framing format as a new compression algorithm, following the same pattern as the existing codecs (zstd, lz4, etc.). The encoder produces Snappy frames with stream identifiers, compressed/uncompressed chunks (falling back to uncompressed when compression ratio is poor), and masked CRC32C checksums. The decoder is a state machine that parses frame headers, buffers chunk data, and verifies checksums before emitting decompressed output. New dependencies: `snap` for raw Snappy compression and `crc32c` for checksum computation. Both are optional behind the `snappy` feature flag.
I'm ok with either, though based on the size of the PR and that its CI is failing, I think getting a baseline implementation merged is a more sensible decision |
That's fair. Let me fix the |
@NobodyXu I do agree to implement that in other PR. I fixed the tests except the one I mentioned in the description: Do you have any opinion on how should I fix it? Or should I just skip this test for snappy? |
NobodyXu
left a comment
There was a problem hiding this comment.
Thank you, I'm really unhappy with all the repeated vec creation which resulted in repeated copying and heap allocation
| 0xFF => Self::Stream, | ||
| 0x00 => Self::Compressed, | ||
| 0x01 => Self::Uncompressed, | ||
| 0xFE => Self::Padding, | ||
| 0x02..=0x7f => Self::ReservedUnskippable(value), | ||
| 0x80..=0xFD => Self::ReservedSkippable(value), |
There was a problem hiding this comment.
Do we have to hardcode it here?
Ideally the ChunkType should be in snap
There was a problem hiding this comment.
It is but the enum is not public.
There was a problem hiding this comment.
I think it'd be better if snap makes it public under the raw module
| } | ||
|
|
||
| fn mask_crc(crc: u32) -> u32 { | ||
| (crc.wrapping_shr(15) | crc.wrapping_shl(17)).wrapping_add(0xA282EAD8) |
There was a problem hiding this comment.
Another magic value, better in snap, I assume this is probably duplicate code?
There was a problem hiding this comment.
Yes it's duplicate code. The function is not public.
There was a problem hiding this comment.
Can we have it public under snap::raw module?
Hardcoding it here really feels wrong
NobodyXu
left a comment
There was a problem hiding this comment.
Question regarding why encoder is designed with extra buffering
I can understand that the decoder needs it as we cannot control on bow the chunks are created
Does snappy have a "file header" that recorded the size of the entire stream? If we have it, then we can configure the decoder to stop decoding once that limit is matched, and only way to continue would be Otherwise we cannot fix it |
No, the first "stream" chunk is just there to mark the start of the file, but it doesn't include the length of the whole stream. |
You're right, I apologize for that... I tried to apply your suggestions. It's not perfect (I don't like the |
This comment has been minimized.
This comment has been minimized.
NobodyXu
left a comment
There was a problem hiding this comment.
Thanks!
My feedback is that the copying around of uncompressed sections is unnecessary and should be directly copied into the output
And that snap::raw should have more public items to prevent duplicate, and that it needs to support uninitialised buffer for maximum efficiency
| let mut decoder = snap::raw::Decoder::new(); | ||
| decoder.decompress(data, out_buf)?; | ||
| } | ||
| ChunkType::Uncompressed => out_buf.extend_from_slice(data), |
There was a problem hiding this comment.
We should skip this copy, by putting the chunk type into a new variant UncompressedCopy so that it'd copy from data, along with an offset, initially set to 4 to skip the header.
Using that offset we can construct a PartialBuffer
| out_buf.resize(uncompress_length, 0); | ||
| let mut decoder = snap::raw::Decoder::new(); | ||
| decoder.decompress(data, out_buf)?; |
There was a problem hiding this comment.
It's a shame that snap does not support writing to uninitialised buffer, zeroing the whole vec would indeed take quite some time
|
|
||
| impl DecodeV2 for SnappyDecoder { | ||
| fn reinit(&mut self) -> std::io::Result<()> { | ||
| *self = Self::new(); |
There was a problem hiding this comment.
| *self = Self::new(); | |
| self.state = State::default(); | |
| self.in_buf.clear(); | |
| self.out_buf.reset(); |
| } | ||
| } | ||
|
|
||
| pub fn read_u24_le(slice: &[u8; 3]) -> u32 { |
There was a problem hiding this comment.
| pub fn read_u24_le(slice: &[u8; 3]) -> u32 { | |
| fn read_u24_le(slice: &[u8; 3]) -> u32 { |
This function is used by snappy submodules and doesn't have to be public
| ))?; | ||
|
|
||
| let chunk_type = ChunkType::from(header_part[0]); | ||
| // SAFETY: header_part is guaranteed to have at least 4 bytes due to split_first_chunk |
There was a problem hiding this comment.
| // SAFETY: header_part is guaranteed to have at least 4 bytes due to split_first_chunk | |
| // header_part is guaranteed to have at least 4 bytes due to split_first_chunk |
Safety is for unsafe block and function
| Skipping(usize), | ||
| Buffering { | ||
| remaining: usize, |
There was a problem hiding this comment.
| Skipping(usize), | |
| Buffering { | |
| remaining: usize, | |
| Skipping(NonZeroUize), | |
| Buffering { | |
| remaining: NonZeroUsize, |
Using NonZeroUsize would make it clear the invariant of the state
| out_buf.resize(max_compress_size + 8, 0); | ||
|
|
||
| let mut encoder = snap::raw::Encoder::new(); | ||
| let compress_data = encoder.compress(in_buffer, &mut out_buf[8..])?; |
There was a problem hiding this comment.
It's a shame snap::raw doesn't support uninitialised buffer
| let chunk_type = if compress_data >= in_buffer.len() - (in_buffer.len() / 8) { | ||
| out_buf.clear(); | ||
| out_buf.resize(in_buffer.len() + 8, 0); | ||
| out_buf[8..].copy_from_slice(in_buffer); |
There was a problem hiding this comment.
We should have a new state UncompressedCopy instead of doing this intermediate copy
Yeah it's the right direction, it looks ugly because |
Hello there!
This pull request introduces support for framed snappy decoder and encoder. I'm not an expert in compression/decompression, but I feel the snappy one is quite different from the others. My work was inspired by the gzip decoding (for the state machine), the work done in the snap crate in the framing reader and writer, and of course the actual spec. One big difference is the whole frame must be buffered before compressing or decompressing it.
I have a few questions/notes:
snappy::tokio::bufread::decompress::trailer), but I don't know how to fix it. If I understand the test correctly, it checks if the decoder can stop decoding with some trailing data in the buffer. But with the framing format, I don't know how to address that. Should I stop decoding when the chunk type is invalid, or should I stop and fail (current behavior)?[1, 2, 3...]) are not compressible in snappy, the encoder and decoder never compress or decompress anything. But is it the job of the proptests to create some random test data?