Compression

Do people actually still use compression tools? It seems like they kinda went out of fashion.

Also

Attached: packed.png (247x78, 2.94K)

Other urls found in this thread:

nongnu.org/lzip/xz_inadequate.html
ijdc.net/article/view/151/224
twitter.com/NSFWRedditGif

feels bad man

Every archiver from CompactPro to 7-Zip could do that, it's just slow, since it has to decompress and recompress basically the entire archive. Unless you mean the eunuchs style of compressing a TAR file, which is retarded.

inb4 tar isn't portable, use the POSIX format

Nope, it doesn't work. Try it yourself.
Make a solid 7z archive of a file, then open the archive in the 7z GUI and drop in a copy of the same original file with a different name. It will double the size of the archive.
And even considering that, I'm not convinced it isn't possible to do it without decompressing and recompressing everything, at least if you're willing to lose some compressing ability. After all compression is based on a dictionary, if you can keep the same dictionary used for the original files you should be able to just compress the later data without recompressing the original data.

I use winrar for the also only for encryption. Shit is like open standard and can be opened by any archiving program which makes it confy, without compromising security.

aes256 + 60 char password. Doesn't matter if its PGP or WinRar.

Attached: Cszs-dhWAAA16kL.jpg (675x800, 199.38K)

Normies are too dumb for compressors and with bandwidths people have available slowly rising the balance shifted and it was more important to make tards able to handle the files than to download them faster.

I don't think you understand what a solid archive is

They have not. Windows is still in the stone age in terms of applied compression usage. Zstd and LZ4 are pretty exciting these days and you still can't beat XZ.

Seriously guys, add something to NTFS or your new ReFS.


Ark for linux is my comfy. I wish winrar would at least visible source their rar files as they're god tier in terms of ability to resist bit destruction of the underlying files with added redundancy flags. No DAR+PAR2 needed. I don't like using a filesystem/ compression algo unless the source is out somewhere...
But no, they're anal about their century old compression algo getting loose.

So fuck 'em I guess.

Attached: 1546575233155.jpg (640x774, 54.19K)

Sure I understand. You turn everything into a byte stream and run a compression algorithm over it. But I'm not convinced it isn't possible to append data at the end of the compressed stream and still get some compression.
After all compression is based on a dictionary, if you can keep the same dictionary used for the original byte stream you should be able to just compress the additional data using the dictionary built for the original archive without recompressing the original data.
Now, if you modified the original dictionary, then yes, you're going to have to recompress everything, unless you just append more entries at the end of the existing dictionary.

nongnu.org/lzip/xz_inadequate.html
TL;DR: it's amateur software, and you'll be bitten in the ass eventually.

Christ.

Maybe I should just write my own graphical &/ or terminal addon for Ark to utilize Par2 with whatever you want to throw at it.

Attached: 1546503782278.png (577x382, 271.99K)

It literally is not possible unless you have a compression format with some very specific and unusual properties.
In most formats, addding data does not translate into appending bytes at the end of the compressed file, as the new data will certainly change lenght headers, CRCs, and other redundant enconding.
The dictionary not changing can easily mean the end of the old data concatenating with the start of the new data to create bogus results.
Imagine if "PINGAS" decompresses to the navy seal copypasta in an archive ending with "PIN", and then you attach data starting with "GAS".
The only way to fix that is to decompress and recompress everything, at least for normal formats.

I don't think you understand what is going on in the software, or the design decisions necessary.
If you expect putting in an identical file with different name to not take up double space, then the software would need to check every file in the archive against every other file in the archive for this potential' saving. The order of calculations goes up dramatically, and the time you need to wait also.
You would have to be separating the filenames from the content and hash/checksum them separately.
When a user creates an archive it is normal to assume the user would like the archive created as quickly as possible.

True, for most formats you will have to re-read the past data to recalculate the checksum. But CPU and memory wise that's gonna be less expensive than recompressing everything. And considering most compressors can only work in single threaded mode, this is likely to speed up things a lot.
Unless there's a format which only does checksums for small blocks, then you can just fix the last block.
Not really I think. Compression formats which work like that must have a way to specify how long is the sequence of bytes for each dictionary reference in the compressed stream and whether something is a literal or a dictionary reference, otherwise how could you decode anything at all? So the decompressor will know the symbol in the compressed stream ends at "N", and "G" is the beginning of a new symbol. Because otherwise you couldn't compress anything even in normal operation, since at any point in the process you don't know whether you should stop and process the past n bytes as a reference or keep appending bytes to the reference.
But yes, it is true that for re-compressing an archive you would have to re-create the dictionary first by reading from the existing archive first, but considering dictionaries are generally at most a few gigs as to fit in memory, for large archives it'll be a small portion of the overall compression time.
Think about it. If you take the time to prime the encoder as it would originally be at the end of the original archive, and begin processing and appending more data, it's just as if the original byte stream never ended in the first place.


t. webdev

Nope. If the goal was to create the "archive" as quickly as possible then he wouldn't use compression at all. He would just use a regular folder, or if the goal was to have everything in a single file as quickly as possible, then he would use the "Store" (ie. zero compression) method in the archiver.
When using compression, time is obviously not the main goal.
When you compress two copies of the same file together in either winrar or 7zip in one go and enable "solid archive", the result is an archive taking less space than one individual copy of the file. The problem here is adding files to that archive. Deduplicating data when making an archive in one go is a solved problem.

Yes. I use transparent compression on pretty much every drive I own. I have compressed almost 3x as much data at each of my drives combined could handle on their own. So I'd say it's still very much fashionable. For everyone who isn't a poo in the loo that is.

Well it is.
If you tried to compress 3 text files and receive a dialog "ETA 15 hours" you'd quickly find that statement is incorrect.

No shit Sherlock, that's what I was complaining about in the first place. Unless you're saying it's inherently impossible, in which case, prove it.

I use SHADOW compression which can bring dozens of terabytes down to a dew KB.

gnu tar will include facebooks zstandard in future releases, what do you think about it?

I use fusecompress for email.

If CPU and memory are a concern, why use solid archives?
More importantly, why use solid archives if you expect to modify them later?
Nope, there are way better methods to do it, but they are not append-friendly as they reserve unused sequences as keys for the dictionary.
It's essentially a very fancy weighted encoding.

t.DRM fag
So you blame the fact that you don't know how to use 7zip on it's design ? Interesting.
Peazip is also a thing btw.

7-Zip masterrace fagit!

Best way to archive in 2019:

Attached: smugpepe.gif (499x499, 44.58K)

Because solid archives compress duplicate files adequately, which could be cool for instance for backups.
And not having to recompress not only saves you CPU and memory, but also saves you from needing double the disk size of the uncompressed data and possibly getting data corruption from moving around a lot of data instead of it remaining statically on the disk.
What do you mean "unused sequences"? You mean sequences which are not used as literals? Again, ANY compression format must have a way to determine the length of each sequence as being either a key or a literal, since otherwise any given cut in the compressed stream could be a literal or a dictionary key.

7zip doesn't have any redundancy functionality, nor does it have any way to automatically repair the archives.
RAR can repair an archive after both the beginning and end headers have been nuked, and if you add 5% data redundancy it can resist making multiple holes filled with gibberish across the archive, and the data decompresses just fine. With 7z if you damage any part of the archive, good luck getting those files back... RAR also supports pre-set profiles from the file manager's context menu which 7z does not.

You can get download a license file to get rid of the begging.

That doesn't save you from errors which might have happened to the data from the moment it's saved to the disk as raw data, fetched, stored in memory, processed and then saved as compressed data.
I think the only way to prevent bit rot is to have ECC memory and either software RAID 5 or a filesystem with error correction like ZFS. Software RAID because I believe it's easier for the data to get corrupted traveling from the CPU to the disk controller through the PCI interface than just being processed locally on the CPU.
Then yeah, you can confidently store your shit with parity data and be reasonably sure that it's going to be decompressed alright. Just make sure not to use PAR1, because it's not very resilient to corruption in the redundancy data.

Backup files are not supposed to be modified, if you are doing incremental backups you should not compress those.
Those are unreasonable concerns: you should never have that little free space, and you should absolutely not use an archive format that cannot notice and correct an error when processing an archive nor a system without some built in ECC (SATA has it, so you're probably set).
Sure, but that mostly happens in an implicit way, such as in the "expand these sequences as soon as you meet them" encoding, compared to the very simple and explicit "0 byte as end word" variable lenght encoding.
Unsurprisingly, most implicit methods (and a few explicit ones) are not append-friendly.

Why not? It'd be basically like ZFS or btrfs deuplication, except more effective (because it can handle arbitary offsets) and simpler to use with less of the complexity that's required for an actual filesystem. Not to mention it'd also work with disk images.
All current incremental schemes are either wasteful (copy the same file again even for 1 changed bit, or break the deduplication even for 1 additional byte in the middle of a file which offsets all the following data) or clunky, buggy pieces of shit.
Why not?
Besides, if we're talking about incremental backups with almost identical copies of the same files, the raw data could easily be 10 times the size of the compressed archive or more. One possible solution would be to decompress to a pipe and then compress again (I think currently GUI programs don't do that though) but you'd still need twice the space.
How would an archiver be able to detect bogus data beign received by the disk or coming from the SATA controller?
Sure, SATA has some error correction built into it, but in real life there is silent data corruption, orders of magnitude more than would be expected just from CRC collisions.
>The CERN study used a program that wrote large files into CERN’s various data stores, which represent a broad range of state-of-the-art enterprise storage systems (mostly RAID arrays), and checked them over a period of six months. A total of about 9.7 × 10^16 bytes was written and about 1.92 × 10^8 bytes was found to have suffered silent corruption, of which about 2/3 was persistent; re-reading did not return good data. ijdc.net/article/view/151/224
Other than doing the same operation twice and comparing hashes, the only way to be more or less safe is to use RAID or ZFS with ECC ram. Not even parchive can solve the issue of data being silently damaged while traveling from the disk to the CPU.
Not being friendly is not the same as being impossible.
The only issue I can see is if you would have to modify the data (some kind of checksum or folder structure I guess) that's stored in the compressed stream, in that case yeah, it probably would be impossible.

Backups are a last defense against data loss.
Incremental backups are backups focused on preventing data loss in a short term: they should only include a very small portion of your files (the ones you work with in that short term) and they should be fast to create and fast to roll back to.
Compressing incremental backups makes them slower to create/verify/roll back, adds a small risk of data loss, and only saves you a bit of dirt cheap storage space: at that point, you may as well skip incremental backups completely.
Because storage is dirt cheap, while data loss is not.
Then use some version control software to deduplicate and backup those files there.
Didn't mean that, I meant that the SATA controller is going to notice most read errors and either autocorrect them or retry the read, and then your archiver is going to notice most errors in the archive itself via CRCs and other verifications.
I have to doubt the stats in the CERN study as they imply a 1 in 10^9 bytes in 6 months permanent write corruptions.
That means, I should see bogus hashes on many (quick calc says 60%+) 1GB+ files after 6 months from them being written to disk, and my use case includes having a lot of large files that are validated via QuickSFV and stay around for several months.
Yet, I have seen but a single of those files fail hash verification in several years, on a system without ECC nor RAID and in general very far from the state of the art.
More importantly, no hash mismatches on 10+ GB files, which should have a 99,99+% of corruption.
Let's not even consider system images and backups themselves, even a simple windows install .iso should go bad more than half of the time.
I guess there was some issue at CERN, such as a disk sector going bad, because the 1 in 10^9 over 6 months figure is completely and utterly unbelievable.
In the way I used it, it's close enough.
It means those systems cannot guarantee that appending data will produce valid data as a result, let alone guarantee it will be the data you want.

A lot of networking protocols use it under the hood.

A lot of package management systems use it under the hood.

Self-extracting installers use it under the hood.

All the videos, audio and pictures you see on the internet use it under the hood.

They aren't out of fashion at all. What is out of fashion is asking the user to do a separate manual step to decompress files when the step can be automated.

OP is a faggot

i just use tar if i need to put multiple files in a archive.. not for compression really but its easier and faster to move multiple small files when they are in one archive.

Attached: ClipboardImage.png (480x360, 104.31K)

what fucking problem are you trying to solve? why do I care whether 4KB of my source code disappears vs the entire thing? i use backups for that. and any other type of data, like movies or music already turns to shit (e.g lost an entire region covered by i-frame and now you have youtube quality shit) the moment a single bit is lost

so how do I save my boobie vids without the nipples getting all corrupted and shit

Dunno what CERN did in that one infamous study but bitrot in modern hardware is not really common at all and it's much more likely that your storage hardware simply fails. There's such an heavy amount of CRC and ECC stuff going on in even the cheapest modern drives (who all use the same groups of controllers, firmware implementations etc. anyways, no real chance to save money by fucking things up there anymore) the chance that you silently get presented with a corrupted dataset is very, very low. Much more likely that the drive generates a reading fail because data failed checksumming. This is a problem imagined by software guys who don't know how the hardware side works. More likely are transmission errors because of dodgy NICs (*cough*realtek*cough*) and defective RAM but no filesystem will protect you from that.

...

What's the alternative except for Intel?

many should exist. just look at the linux kernel driver list