Jump to content

Talk:LZ4 (compression algorithm)

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
[edit]

The link https://www.openhub.net/p/lz4/ in external links is obsolete: LZ4 deleted on openhub. Serge31416 (talk) 08:26, 28 September 2017 (UTC)[reply]

Removals

[edit]
[Copied from User talk:Intgr#Deletion from LZ4]

This is a small matter. As it happens, I don't support your removal of the statement "In a worst-case scenario, incompressible data gets increased by 0.4%." (and sketchy source) from the LZ4 article.

By design, to quote the entire incompressible byte stream adds a quoting overhead of one 0xFF length byte associated with every 255 bytes of uncompressed data. It's slightly more OR than making an unsourced claim that wrapping quote marks around an N character string increases the representation to N+2 characters.

By taking this statement out, you are catering to the common misunderstanding that there is such a thing as a compression algorithm which only ever makes the object smaller. By far this is the more severe of the two evils.

I also don't support this removal (one-time-only IP editor):

https://en.wikipedia.org/w/index.php?title=LZ4_%28compression_algorithm%29&diff=prev&oldid=653088035

Your call on both issues. I'm not going to wade in with my own edits. — MaxEnt 03:13, 20 April 2015 (UTC)[reply]

@MaxEnt: I copied this discussion here from my talk page so other people interested in the article can chime in and/or improve it.
The first edit that MaxEnt is talking about is this. Admittedly I didn't do a good job at explaining that change in the edit comment.
I mainly object to the material removed in both edits on sourcing grounds; if there were good sources supporting the claims, I would be all for adding them back. But usually blogs are not considered reliable sources.
I also tried repeating this on my own and could not reproduce the 0.4% overhead. I tried tons of times, but always arrived at the same result:
% dd if=/dev/urandom bs=100000 count=1 |lz4 |wc -c
1+0 records in
1+0 records out
100000 bytes (100 kB) copied, 0.00602624 s, 16.6 MB/s
100019
Am I missing something? That's a 0.02% overhead, and it's even lower for larger input sizes. -- intgr [talk] 08:53, 20 April 2015 (UTC)[reply]
Dear intgr, what you are missing is that you measured 0.02% for one specific case. That specific case is not the "worst-case scenario". --DavidCary (talk) 09:13, 30 May 2015 (UTC)[reply]
@DavidCary: Random data is the worst case scenario for data compression algorithms. And while there's a small chance of random data containing runs of compressible patterns, if that were the case, I should see some variance in output sizes, as different random streams would have slightly different compression ratios. Thus I conclude that LZ4 wasn't able to compress these streams at all.
And while I understand that this is WP:OR, my point is that the original claim came from an un-reliable source and goes contrary to my observed behavior, so that's two strikes against making this claim and using this source. -- intgr [talk] 13:40, 30 May 2015 (UTC)[reply]
Great! Perhaps I will learn something new!
I am imagining an intelligent and malicious adversary Mallory could somehow pick specific bytes that would give the worst possible "compression" (i.e., the maximum expansion), generating Mallory's file.
I am also imagining that Bob sticks his hand in an urn of all 256 possible bytes and picks a series of random bytes (with replacement), generating Bob's file.
I am aware of many proofs that "random data cannot be compressed", so applying LZ4 compression (or any other data compression algorithm that makes some files smaller) to Bob's file (for the vast majority of possible files Bob might produce), results in a "compressed" file that is slightly larger than Bob's uncompressed file.
I see that the lz4.h source code and lz4.c source code (primary sources, and therefore not preferred for Wikipedia references)
claim that 'the maximum size that LZ4 compression may output in a "worst case" scenario (input data not compressible)' is, for raw input data 'isize' in length,
(isize) + ((isize)/255) + 16)
It's not clear to me if there is some known worst-case input string that actually forces this amount of expansion, or if (like many people writing data compression algorithms) the author deliberately chose a quick-to-calculate approximation that always gives values larger than the actual worst case.
It seems to me that, after Mallory reads the source code to one specific implementation of some specific "data compression algorithm", that Mallory could deliberately design a sequence of bytes that would give the worst possible compression.
Wouldn't that the worst-case file carefully crafted by Mallory have worse expansion than the typical file randomly generated by Bob?
I would be surprised and delighted if that were not the case. --DavidCary (talk) 23:16, 14 June 2015 (UTC)[reply]