Archive Compression Schemes

Is this the best format for data archival? I'd ask in the Questions thread, but maybe we can have some general archive discussion or something as well.

Other urls found in this thread:

ck.kolivas.org/apps/lrzip/
nongnu.org/lzip/xz_inadequate.html
p7zip.sourceforge.net/)
7-zip.org/history.txt
my.mixtape.moe/abkdqn.tar.xz
twitter.com/infinitechan/status/977460936427544576
en.wikipedia.org/wiki/Information_security#Key_concepts
mattmahoney.net/dc/zpaq.html
0x0.st/s6yf.xz
nongnu.org/lzip/tarlz.html
twitter.com/NSFWRedditVideo

Use 7zip and compress with LZMA2 on ultra.

I've heard that LZMA2's a shitty format. Not to mention, you're not going to get to keep any filesystem data if you don't tar it first.

What does that even mean? Tar is used specifically because it can store stuff like xattrs, sometimes (to copy a directory with scp or netcat while preserving really everything).

xz is probably better, simply because it's more used (thus, important bugs are probably already fixed). Please don't post that lzip rant, I know of it.
Something like a faster paq8px would be pretty cool, though; dictionary based methods (LZ*) are a bit boring, now.

ck.kolivas.org/apps/lrzip/
kolivas did nothing wrong, he is a victim of the international Red Hat EEE conspiracy

Why in the fuck are people still using a format that was designed for tape drives?

in case anyone was wondering: nongnu.org/lzip/xz_inadequate.html


wise guy, eh?


Why does it matter if it was designed for tape drives or IPFS? the central goal of storing data and metadata remains the same.

If there's some popular criticism and you already know about it, why not adress it?
Especially since the link posted by seems to raise valid concerns.

Media-specific optimizations are a thing, not sure if they are relevant here though.

OP, here's a useful chart I made (for images, I'll do text and audio later):
> uname -aLinux gentoo 4.18.15-gentoo #1 SMP Thu Oct 18 20:27:49 CEST 2018 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux> pax -v sort -t, -k2,2 -n compression-benchmark.csv | csv-viewer -l -┌──────────────────┬────────────────────┬───────────────────────┬─────────────────────────┐│ Command │ Compression ratio │ Compression time (s) │ Decompression time (s) │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ paq8px -3 │ 0.1710 │ 8753.8900 │ 0.0000 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ 7z -m0=ppmd -mx9 │ 0.3270 │ 7.8200 │ 8.4700 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ bzip2 -9k │ 0.3440 │ 3.5400 │ 1.4700 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ lzip -9k │ 0.3560 │ 18.0600 │ 0.9800 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ plzip -n8 -9k │ 0.3560 │ 18.4800 │ 1.0900 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ 7z -mx9 │ 0.3590 │ 8.3100 │ 0.7700 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ xz -T0 -9k │ 0.3590 │ 14.9900 │ 0.9000 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ zstd -q -T0 -19 │ 0.3930 │ 8.9700 │ 0.1400 │├──────────────────┼────────────────────┼───────────────────────┼─────────────────────────┤│ gzip -9k │ 0.4930 │ 3.2000 │ 0.2800 │└──────────────────┴────────────────────┴───────────────────────┴─────────────────────────┘


It's abandoned, full of bugs and the internal zpaq is really outdated. It also doesn't have a gzip styled interface, which is a big nono and zstd got a long match mode for some time now.

Fortunately, it changed. Tar has had an index allowing you to extract files without reading the whole archive for a long time now. The real problem is the fragmentation and horrendous interface, though. POSIX was useful (for once) and fixed it with pax, but it's not very used, sadly.

The problem is that while xz is indeed full or problems, lzip is used by absolutely nobody, making me pretty nervous concerning its exhaustive testing. Also, xz has a busybox applet, which is pretty useful.

Since it's for archival, consider using something that will guarantee to work. A lot of formats that have highest compression ratio are not backwards compatible with every new version released, if new versions are even released. If you want to be able to open the archive 40 000 years from now just use ZIP (native Windows support) or tar with gzip or bzip2, ZPAQ is a safe bet too because it offers good compression and “All versions of zpaq can read archives produced by older versions back to version 1.00 (March 2009)”.

Speaking of compression… I archive YouTube channels quite often and no matter what compression formats I use, I barely squeeze out any difference between compressed and uncompressed file sizes. Does anyone has any suggestions?

Attached: technologic.jpg (692x300, 85.25K)

The only way to make lzip more popular is to use it.

Can't you just use 7zip via Wine?
Pretty sure 7zip is infinitely more used by end users than either xz and lzip, and by default it uses LZMA2, while also offering LZMA, bzip2, deflate, and a few more standards.

Well obviously, I posted paq8px just as a comparison, since it's one of the best (baring cmix, which requires too much RAM and time).
Are you retarded? h264/vp9 are already compressed using very advanced lossy techniques, you won't improve on them by trying further compression.

Yeah-yeah, I know. But still, it'd be nice to squeeze 500 videos.

Limiting the resolution to 720p or even 540p would give you pretty good savings, honestly. Or wait for av1 to be deployed on Jewtube globally.

Yeah, I hope they won't deploy DRM with it.

zstd is the best overall, I wish more FOSS project would use it by default.

lzip > xz
I use either lzip or gzip for pretty much everything.

Why use 7zip from Wine? p7zip (p7zip.sourceforge.net/) exists and works well. I either use 7zip or zip for any archives that need to be read by botnet users.

Like what? Access rights? Those are computer specific and not even worth archiving.
It's a compression algorithm.
The OP says Archive Compression Schemes which makes me think this thread is about long term archival.
In that case it's best to use something common that compresses really well.
LZMA2 does that. With multiple cores!

Why is LZMA not listed? It tends to beat ppmd.

I just noticed it is as "lzip" but also make an entry for 7zip using lzma on ultra

zstd is the mediocre in all dimensions. It will sometimes be best depending on the situation. In particular, suppose you want to send data over a wire as fast as possible. The time taken is d + s/b + c (d:decompression time, c:compression time, b: byterate, s:compressed size). Graphing that:
b(B/s) 1.00E+03 1.00E+06 1.00E+09paq 1.48E+04 8.76E+03 8.75E+037z1 1.16E+04 2.79E+01 1.63E+01bz 1.22E+04 1.72E+01 5.02E+00lz 1.26E+04 3.17E+01 1.91E+01plz 1.26E+04 3.22E+01 1.96E+017z2 1.27E+04 2.18E+01 9.09E+00xz 1.27E+04 2.86E+01 1.59E+01zstd 1.39E+04 2.30E+01 9.12E+00gz 1.75E+04 2.09E+01 3.50E+00none 3.54E+04 3.54E+01 3.54E-02
at 1k/s 7z -m0=ppmd is best. at 1M/s bz wins. at 1G/s, no compression is best, second best is gz. On this graph, I see nowhere that zstd is best, but lets add another one assuming that compression happens once, but downloading +decompression happens n times (ie d+(s/b+c)*n) (also standardize on b=1e6):
n 1.00E+01 1.00E+03 1.00E+05 1.00E+07paq 8.81E+03 1.48E+04 6.15E+05 6.06E+077z1 2.08E+02 2.01E+04 2.01E+06 2.01E+08bz 1.40E+02 1.37E+04 1.37E+06 1.37E+08lz 1.54E+02 1.36E+04 1.36E+06 1.36E+08plz 1.56E+02 1.37E+04 1.37E+06 1.37E+087z2 1.43E+02 1.35E+04 1.35E+06 1.35E+08xz 1.51E+02 1.36E+04 1.36E+06 1.36E+08zstd 1.50E+02 1.41E+04 1.41E+06 1.41E+08gz 1.81E+02 1.78E+04 1.77E+06 1.77E+08none 3.54E+02 3.54E+04 3.54E+06 3.54E+08
For ten downloads bz is best. at 1000 they're all identical lol, but 7z2 is best. at 100,000 paq starts throwing it's weight, the rest are all identical again, same for 10,000,000. The lesson basically is that compression time matters very little, of primary importance (for downloads) is ratio and decompression time.
>done in excel
For archival, literally the only thing that matters is ratio (and reliability). You can run it over night if it's too slow, along with your indexer and defragger.

...

More charts:
* text: a ~4.4M syslog* photo:> du -h -- "Honda CR-X front-left.ppm"18M Honda CR-X front-left.ppm> file -b -- "Honda CR-X front-left.ppm"netpbm image data, size = 3008 x 2000, rawbits, pixmap* flat image:> du -h -- "Satanichia smug.ppm"4.1M Satanichia smug.ppm> file -b -- "Satanichia smug.ppm"Netpbm image data, size = 1400 x 1000, rawbits, pixmap* audio: > du -h -- "10. Song of Sirens.wav"27M 10. Song of Sirens.wav> file -b -- "10. Song of Sirens.wav"RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz
Some remarks, ffv1 looks very good as a quite slow PNG replacement (considering it's probably not as optimized) and jpeg2000 is pretty impressive for a "forgotten" codec.

Attached: flat_image.png (916x1300 9.47 KB, 9.7K)

Looks like the photo one didn't upload.

Attached: photo.png (916x1300, 12.66K)

Fyi, you can upload your scripts, results, and test suites ITT as ".lzma.pdf".
Your images really mean nothing if we can't replicate your results, esp. on different hardware.

p7zip has not been updated in 2 years, while 7zip is in active development and its last release was 6 months ago.
As you can see at
7-zip.org/history.txt
there were quire a few bugfixes after 17.00beta

How the fuck is jpeg2000 forgotten when the entire movie industry uses when releasing to theaters? If you want to include obscure compression formats, switch FLAC with TAK as it compresses better, but practically nobody uses it.

So does that mean it doesn't work any more?
I'd better alert all my 7z archives that they're old and broken.

m8, I can't even post with Tor on this shitty imageboard (I dided yesterday because the onion service was down). Anyway, I could at least post the images if someone was interested but nobody is, that's why I didn't bother.

1) I'm obviously talking about jpeg2000 lossless.
2) Even is M-JPEG2000 is "used" (no support in ffmpeg means it's dead, for me), the image format itself isn't.
3) The differences between flac, tak, optimfrog, wavpack, etc... are so tiny they're meanless these days. Flac, at least, is fast, compresses well and uses a sane way of tagging.

Does 0x0.st allow tor posting?

Isn't that only for uncompressed archives?

I poke around in firmware\rom images as a hobby, and I've recently had to come to terms with the fact that you have to treat most of them as if they were an archive format designed by a dozen assholes that hate each-other.

It means you probably shouldn't rely on an old version for long term data stroage, smartass.
It also means the old version might stop working/break in weird ways as the systems around it change, making it harder to access the data when you need it.

Well, isn't that obvious? Not tar's fault.

my.mixtape.moe/abkdqn.tar.xz
Here's the stuff, if you want it. You'll need a POSIX compliant time (read: not your shell builtin), flac and imagemagick.

If I mentioned it, it's because I want to be able to replicate results on different hardware. It's is inexcusable that to this day attachments aren't allowed on anonymous services when even Ron agreed these should be the norm[1], but don't bullshit me when you've already risked image attaching ITT. You can always externally host&link:

Yes

Thanks, will verify when I get home. Didn't know mixtape allowed nonjs uploads, and on TOR.
Strange you choose .tar.xz when .xz is unreliable, and non-containter format.
I'll presume you're ill experienced with CIA triad, since you did not yield me checksums of files, and the archive itself.
[1]twitter.com/infinitechan/status/977460936427544576

Nigger, the CIA has better to do than spoofing a dumb image compression test upload.
Even if they wanted to get you specifically, they would just deploy some zero day on your favourite weird porn site.

This isn't an autism competition, you know.

It allows Tor, but yeah, you need JS. If you know a pomf alternative that doesn't, tell me.

Filter yourself from this website:
en.wikipedia.org/wiki/Information_security#Key_concepts

Well, it's a "QED via peer review"

You've been linked twice to HTTPS://0x0.st twice already. Your dismissal is starting to read disingenuous.

I agree not to rely on it for long term storage. That's what what the ubiquitous tar and zip are for.

For my purposes though I look at it like this...
#1 I highly doubt newer versions of Windows 7zip are going to magically stop reading archives from 7zip 16.02.
#2 If I encounter a brand new 7z file I can't open with 16.02 then maybe I'll consider using wine to run the newest 7z if there isn't a native solution. I suspect I'll be able to find an alternative by that time as projects like libarchive and unarr exist.
#3 I mostly use it for unimportant stuff like gaymez and junk for botnet users. I probably should switch to tar.{xz,bzip2,gz} in the future for this as I see now 7zip boasts that it can fully support those archives.


But the thread is mostly about compression and bare tar files are much less common to download than compressed tar files. It's very annoying to have to completely decompress a 2GB tar.xz file just to extract a 4KB file embedded at the end.
The solution would be to encase individually compressed files in a bare tar archive of course, but in my experience that's annoying and error prone. Maybe I just don't know what I'm doing though.

It's 404. Stop being an idiot and just use 0x0.st

Probably ZPAQ. It has built-in support for incremental backups and encryption and the fastest PAQ compressor around if you need it. It's amazing for text with a lot of duplication. 7-Zip and tar.xz can't complete. However, it is not for system backups on U*nx because the format does not store owner/group.
mattmahoney.net/dc/zpaq.html

Attached: zpaq 10gb.png (651x396, 54.49K)

1) Look at benchmarks, it's "okay" but it's symmetrical (encoding as long as decoding), sadly. LZMA -9 (be it xz, lzip or 7z) is generally almost as good for a fraction of the cost.
2) The fucking tool can't even decompress/compress to/from stdout, it's just shit for Windows lusers. If it had a gzip interface (which is now the standard for POSIX compression tools), it'd be a lot more interesting.

On my old junk computers zpaq -m3 compresses faster than 7z -mx=9 or xz -9 with better ratios, but the decompression time is roughly symmetrical like you say; -m1 and -m2 decompress faster than they compress. ZPAQ is definitely Windowsy. You can extract a file from an archive to a named pipe, but that's it.

Name your price
phda9 1.6 15,040,647 117,039,346 41,911 xd 117,081,257 84713 88401 4996 CM 83cmix v16 14,955,482 116,912,035 226,121 s 117,138,156 613898 658679 27708 CM 83paq8pxd_v47 -s15 16,080,717 127,404,715 139,841 s 127,544,556 75022 75611 27500 CM 81durilca'kingsize -m13000 -o40 -t2 16,209,167 127,377,411 407,477 xd 127,784,888 1398 1797 13000 PPM 31cmve 0.2.0 -m2,3,0x7fed7dfd 16,424,248 129,876,858 307,787 x 130,184,645 1140801 19963 CM 81paq8hp12any -8 16,230,028 132,045,026 330,700 x 132,375,726 37660 37584 1850 CM 41drt|emma 1.23 16,523,517 134,164,521 1,358,251 xd 135,522,772 73006 67097 3800 CM 81zpaq 6.42 -m s10.0.5fmax6 17,855,729 142,252,605 4,760 sd 142,257,365 6699 14739 14000 CM 61drt|lpaq9m 9 17,964,751 143,943,759 110,579 x 144,054,338 868 898 1542 CM 41mcm 0.83 -x11 18,233,295 144,854,575 79,574 s 144,934,149 394 281 5961 CM 7210nanozip 0.09a -cc -m32g -p1 -t1 -nm 18,594,163 148,545,179 783,642 x 149,328,821 1149 1141 32000 CM 74xwrt 3.2 -l14 -b255 -m96 -s -e40000 -f200 18,679,742 151,171,364 52,569 s 151,223,933 2537 2328 1691 CMfp8 v3 -8 18,438,169 153,188,176 50,068 s 153,238,244 20605 22593 1192 CM 26WinRK 3.03 pwcm +td 800MB SFX 18,612,453 156,291,924 99,665 xd 156,391,589 68555 800 CM 10ppmonstr J -m1700 -o16 19,055,092 157,007,383 42,019 x 157,049,402 3574 ~3600 1700 PPMzcm 0.93 -m8 -t1 19,572,089 159,135,549 227,659 x 159,363,208 421 411 3100 CM 48slim 23d -m1700 -o12 19,077,276 159,772,839 69,453 x 159,842,292 5232 ~5400 1700 PPMbwmonstr 0.02 20,307,295 160,468,597 69,401 x 160,537,998 331801 156147 590 BWT 30nanozipltcb 0.09 20,537,902 161,581,290 133,784 x 161,715,074 64 30 3350 BWT 40M03 1.1b 1000000000 20,710,197 163,667,431 50,468 x 163,717,899 457 406 5735 BWT 5220glza 0.10.1 -x -p3 20,356,097 163,768,203 69,935 s 163,838,138 8184 11.9 8205 Dict 67bcm 0.14 c1000 20,736,614 163,885,873 74,569 x 163,960,442 162 153 5000 BWT 60bsc 2.00 -b1000p 20,789,147 163,888,465 122,581 s 164,011,046 237 199 5095 BWT 39bbb m1000 20,847,290 164,032,650 11,227 s 164,043,877 4524 2619 1401 BWTpcompress 3.1 -c libbsc -l14 -s1000m 20,769,968 163,391,884 1,370,611 x 164,762,495 359 74 3300 BWT 48paq9a -9 19,974,112 165,193,368 13,749 s 165,207,117 3997 4021 1585 CMuda 0.300 19,393,460 166,272,261 11,264 x 166,283,525 25282 25174 180 CMBWTmix v1 c10000 20,608,793 167,852,106 9,565 x 167,861,671 1794 690 5000 BWT 49lrzip 0.612 -z -L 9 -p 1 19,847,690 169,318,794 99,363 x 169,418,157 2987 2929 2700 CM 33cm4_ext 20,188,048 170,566,799 204,782 x 170,771,581 4123 4130 1906 CM 2630M1x2 v0.6 7 enwik7.txt 20,723,056 172,212,773 38,467 s 172,251,240 711 715 1051 CM 26cmm4 v0.1e 96 20,569,034 172,669,955 31,314 x 172,701,269 2052 2056 1321 CMccmx 1.30 7 20,857,925 174,142,092 15,014 x 174,157,106 1313 1338 1332 CMbit 0.7 -p=5 20,823,204 174,425,039 62,493 x 174,487,532 2050 2100 663 CM 26mcomp 2.00 -mw -M320m 21,103,670 174,388,351 172,531 x 174,560,882 473 399 1643 BWT 26epmopt|epm r9 -m800 -n20 --fixedorder:12 19,713,502 174,817,424 141,101 x 174,958,525 3179 3376 800 PPMWinUDA 2.91 mode 3 (194 MB) 20,332,366 174,975,730 17,203 x 174,992,933 23610 23473 194 CMlstm-compress 20,494,577 174,868,709 157,238 s 175,025,947 114764 114908 9 LSTM 83dark 0.51 -b333mf 21,169,819 175,471,417 34,797 x 175,506,214 533 453 1692 BWTFreeArc 0.40pre-4 -mppmd:1012m:o13:r1 20,931,605 175,254,732 748,202 x 176,002,934 1175 1216 1046 PPM40

Thanks for wasting my time. I'll disable Tor for today, but don't talk to me or my wife's son again.
0x0.st/s6yf.xz

The last official release of zpaq was in 2016 whereas the last commit that touched lrzip's copy was from this year. Enjoy your CVEs

The bundled version of zpaq is 5.0, nigger.

Is this what uninformed meme-induced paranoia looks like?

I thought people used tar because it preserved owner and file permissions. There would be no point in tar zipped files otherwise.

Even ZIP can store *nix permissions (but not Windows permissions, curiously). It's really all about the ownership.

Please list the CVEs for zpaq's reference implementation.

Oh yeah, there aren't any...

What about xattrs?

Go back to reddit. Tar deals with files so other programs can deal with only compressing a stream instead of rewriting tar every time.

Tar files with individually compressed members -- why isn't this done in practice?
nongnu.org/lzip/tarlz.html implements this but it's the only place I've ever seen something like it mentioned.
Sure you can roll your own but that's annoying and prone to errors.

I believe this is what zip does.

p7zip is even available on debian-based distros via apt-get.

You're looking at the decompression times, right? Makes sense... Hadn't heard of zstd.

True, as do many other archive formats. Zip doesn't support the super kewl LZMA or LZMA2. The best you can get out of it is bzip2.
I'm talking about just bare tar files with individually compressed members. Something that can be operated on with standard old tooling but a newer -- or alternative -- tar implementations can deal with (mostly) transparently.

I believe PeaZip has its own implementation of 7zip, but I can't find anything so it most likely uses p7zip, too.

In the readme.txt
source package and please see precompiled program's packages to know what third
parts executables (7z, arc, paq...) are needed by PeaZip.