Rendered at 20:23:00 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
jbosh 16 hours ago [-]
I love it. So much in computers is trade offs and this was a fun read exploring it.
It would be interesting to see some economics of what 8,000% increase in encoding time takes to make that money back in terms of storage or bandwidth. I also wonder how brotli/lzma would compare here. Are there some obscene modes on those that had similar results?
userbinator 15 hours ago [-]
I also wonder how brotli/lzma would compare here.
Far better, just like anything else based on arithmetic coding. The main distinction here is that the output can still be decompressed with a standard Inflate implementation.
atiedebee 5 hours ago [-]
Except that brotli uses Huffman coding. It's main claim to fame is using higher order statistics to select a Huffman table and its built-in dictionary.
This class of compression programs sees larger differences due to the way the data is modelled instead of the specific entropy coder used.
a_t48 16 hours ago [-]
zstd has higher level modes. Default is -3. I saw a good tradeoff between compression speed and ratio up to -9 or so. From -20 to -22 it will use much more memory and IIRC can have downstream effects on decompression speed. I'm using -9 for my container registry and plan to recompress at a higher level for commonly accessed base layers, as well as give customers a button that lets them pay a bit more to do it themselves.
loeg 14 hours ago [-]
To be a little pedantic, the usual zstd levels are positive integers (1-22 default 3). The negative integers denote "fast" modes with worse compression (there are only a few of these).
edflsafoiewq 12 hours ago [-]
I think those are CLI options, not negative signs. Ie. you call zstd -3 for compression level 3.
a_t48 13 hours ago [-]
Whoops! You're right, and it's too late to edit.
chickenbig 8 hours ago [-]
Even compression level 1 or 2 is pretty good.
I once used https://github.com/google/riegeli and a low zstd compression level to store large quantities of protobuf data in an efficient manner (in terms of CPU, RAM and streaming to disk). Shame Riegeli is not well known, not well documented and does not have many tests.
Zenst 12 hours ago [-]
Process-intensive, but higher compression has clear strategic value. Distant satellites such as Voyager, where bandwidth is severely limited, could transmit more data using such capabilities. Equally, for long-term archival storage, improved compression would allow far greater volumes of data to be preserved on durable, life-long media formats.
XorNot 11 hours ago [-]
Distant space probes are power constrained though.
It's entirely possible the degradation of their RTG power sources would be more expensive doing the compression then just sending the data as is.
lstodd 11 hours ago [-]
RTG degrade no matter what you do with the resulting heat. It doesn't matter if you compress stuff or just let them cpus idle be.
XorNot 7 hours ago [-]
That's the point: you're going to spend a lot more time compressing when you could've just been sending data.
And you're eating into a limited overall power and weight budget to do rather then say, run the science on the probe.
There is also zopfli and it's decadent ECT that allow for more extreme tradeoffs.
blobbers 13 hours ago [-]
As someone currently exploring grid searches of encodings + compressor combos, and currently looking at neural compressors that reduce size almost half that of a traditional compressor yet take order from ms -> minutes to operate in either direction, I appreciate a good compression post!
userbinator 14 hours ago [-]
It's interesting to see just how far Deflate can be taken, and to know that even after decades there is still some (admittedly tiny) room for improvement.
Optimal LZ is well-known, and so is static Huffman, but their combination creates some additional inefficiencies(opportunities).
...and of course it's written by someone with a Russian name, and has that characteristic style common to many other articles about data compression.
"OpenZL delivers high compression ratios while preserving high speed, a level of performance that is out of reach for generic compressors. OpenZL takes a description of your data and builds from it a specialized compressor optimized for your specific format."
Retro_Dev 11 hours ago [-]
OpenZL is nice, but it's often less useful than you think - it requires that you know the structure of your data, and don't care about inspecting that data outside of your program. I've extracted one too many png files from a word document (by renaming .docx to .zip) to desire OpenZL everywhere... It might be better as a short-term "data in transit" compression than for long term storage.
pella 10 hours ago [-]
Please check the OpenZL v0.2 + Silesia corpus benchmark.
"OpenZL to offer 10% faster compression speed and 70% faster decompression speed compared to Zstandard level 1 on the Silesia corpus in our benchmarks."
"OpenZL now ships its own LZ codec, exposed as ZL_GRAPH_LZ, and the serial profile in zli. It is still being actively developed to expand its feature set and improve performance on small inputs."
The future may be ~ AI-assisted format detection + OpenZL
(~ OpenZL-AI-LLM recognises the data structure, then guides OpenZL toward the best lossless compression path )
pella 5 hours ago [-]
"The unreasonable effectiveness of our first foray into training leads us to believe that the graph model is uniquely positioned to facilitate ML-guided generation of compressors. We are tempted to view this as “the next big thing” in production-scale compression. Whereas compression research has up to now eluded those without domain expertise, we believe the future of application-specific compressors will be unlocked via investment in automated learning methods."
1. Upload data using conventional compression method (or uncompressed)
2. Spend orders of magnitude (literally) more on compute to run the LLM on the data than any compression algorithm would ever take.
5 hours ago [-]
Someone 14 hours ago [-]
So, what’s the effect on memory usage?
And for decompression, the effect on memory usage and timings?
lifthrasiir 14 hours ago [-]
For decompression, nothing changes because DEFLATE is asymmetric; compressor can spend however much time to optimize the compressed stream independently from decompressor.
masklinn 10 hours ago [-]
Deflate also has a fixed 32K window so even with indexes and parallelism there’s only so much you can blow up memory use.
It would be interesting to see some economics of what 8,000% increase in encoding time takes to make that money back in terms of storage or bandwidth. I also wonder how brotli/lzma would compare here. Are there some obscene modes on those that had similar results?
Far better, just like anything else based on arithmetic coding. The main distinction here is that the output can still be decompressed with a standard Inflate implementation.
This class of compression programs sees larger differences due to the way the data is modelled instead of the specific entropy coder used.
I once used https://github.com/google/riegeli and a low zstd compression level to store large quantities of protobuf data in an efficient manner (in terms of CPU, RAM and streaming to disk). Shame Riegeli is not well known, not well documented and does not have many tests.
It's entirely possible the degradation of their RTG power sources would be more expensive doing the compression then just sending the data as is.
And you're eating into a limited overall power and weight budget to do rather then say, run the science on the probe.
https://trac.ffmpeg.org/wiki/Encode/H.264#FAQ
...and of course it's written by someone with a Russian name, and has that characteristic style common to many other articles about data compression.
(~ OpenZL-AI-LLM recognises the data structure, then guides OpenZL toward the best lossless compression path )
2. Spend orders of magnitude (literally) more on compute to run the LLM on the data than any compression algorithm would ever take.
And for decompression, the effect on memory usage and timings?