Just recently I needed to somehow generate a .tar.gz from a .raw ext4 image and, surprisingly, there's still no better option than actually mounting it and then creating an archive.
I managed to "isolate" it a bit with guestfish's tar-out, but still it's pretty slow as it needs to seek around the image (in my case over NBD) to get the actual files.
It lets you mount .tar files as a read only filesystem.
It’s cool because you basically get random access to the tarball without paying any decompression costs. (It builds an index saying exactly where so-and-so is for every file.)
SquashFS or cramps or such have less tooling, which makes the usage for generating, inspecting, ... more complex.
Otherwise, you can randomly access any file in a .tar as long as: - the file is seekable/range-addressible - you scan through it and build the file index first, either at runtime or in advance.
Uncompressed .tar is a reasonable choice for this application because the tools to read/write tar files are very standard, the file format is simple and well-documented, and it incurs no computational overhead.
Yes, uncompressed tar (with transfer compression, which is offered in HTTP) is an option for some amount of data.
Till the point where it isn't. zip has similar benefits as tar(+transfer compression) but a later point where it fails for such a scenario.
I don’t see how that plus a small index of offsets would be notably more or less work to do from using a zip file.
I had need to embed noVNC into an app recently in Golang. Serving files via net/http from the zip file is practically a one-liner (then just a Gorilla websocket to take the place of websockify).
The problem they’re solving is literally right there in the article you didn’t read.
It uses IndexedDB for the filesystem.
Rather Dumbly it is loading the files from a tar archive that is encoded into a PNG because tar files are one of the forbidden file formats.
The gzip-random-access problem one is a lot more difficult because the gzip has internal state. But in any case, solutions exist! Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant. https://github.com/mxmlnkn/ratarmount does this, apparently using https://github.com/pauldmccarthy/indexed_gzip internally. zlib even has an example of this method in its own source tree: https://github.com/gcc-mirror/gcc/blob/master/zlib/examples/...
All depends on the use case of course. Seems like the author here has a pretty specific one - though I still don't see what the advantage of this is vs extracting in JS and adding all files individually to memfs. "Without any copying" doesn't really make sense because the only difference is copying ONE 1MB tar blob into a Uint8Array vs 1000 1kB file blobs
One very valid constraint the author makes is not being able to touch the source file. If you can do that, there's of course a thousand better solutions to all this - like using zip, which compresses each file individually and always has a central index at the end.
Exactly. And often this state is either highly compressible or non-compressible but only sparsely used. The latter can then be made compressible by replacing the unused bytes with zeros.
Ratarmount uses indexed_gzip, and when parallelization makes sense, it also uses rapidgzip. Rapidgzip implements the sparsity analysis to increase compressibility and then simply uses the gztool index format, i.e., compresses each 32 KiB using gzip itself, with unused bytes replaced with zeros where possible.
indexed_gzip, gztool, and rapidgzip all support seeking in gzip streams, but all have some trade-offs, e.g., rapidgzip is parallelized but will have much higher memory usage because of that than indexed_gzip or gztool. It might be possible to compile either of these to WebAssembly if there is demand.
> Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant.
You may need to revisit the definition of a constant. A 1/32 additional data is small but it still grows the more data you’re trying to process (we call that O(n) not O(1)). Specifically it’s 3% and so you generally want to target 1% for this kind of stuff (once every 3mib)
And the process still has to read though the enter gzip once to build that index
1. Load the file index - this one scales with the number of files unless you do something else smart and get it down to O(log(n)). This gives you an offset into the file. *That same offset/32 is an offset into your gzip index.*
2. take that offset, load 32kB into memory (constant - does not change by number of files, total size, or anything else apart from the actual file you are looking at)
3. decompress a 1MB chunk (or more if necessary)
So yes, it's a constant.
This is how seeking can work in encrypted data btw without the ancillary data - you just increment the IV every N bytes so there’s a guaranteed mapping for how to derive the IV for a block so you’re bounded by how much extra you need to encrypt/decrypt to do a random byte range access of the plaintext.
But none of this is unique to gzip. You can always do this for any compression algorithm provided you can snapshot the state - the state at a seek point is always going to be fairly small.