Seeking / streaming from compressed files

Started by
12 comments, last by Nagle 1 year, 9 months ago

Hi everybody, I hope that you can push me into the right direction with the following problem / question: I am currently redesigning my resource loading architecture and I would like to be able to read assets from multiple types of storage (simple directory on SSD, compressed file containing multiple assets located in different directories within the compressed file, ..). To be able to stream assets into my engine I need some kind of seek(..) and read(position, length) functionality. This is of course trivially easy when working with standard C++ streams which read from the OS filesystem but I am now searching for a solutions how I can read compressed files and offer the same seek(..) and read(position, length) functionality without decompressing the whole file into memory (that defeats the purpose of streaming from storage). Any hints how I can achieve this or pointers to APIs for libraries who offer such a functionality would be greatly appreciated ?

PS I found vfspp (https://github.com/yevgeniy-logachev/vfspp)​ but unfortunately they seem to decompress the complete file into memory before seeking and reading of the file.

EDIT: clarified what vfspp is doing

Advertisement

I'm 99% certain that this can't be done. At least not, natively, without prior decompression. When you compress an archive that contains multiple files, it would usually compress everything together, so there is no “file A starts at X and file B starts at Y” anymore. You could create a file-format where files are individually compressed, and each compressed file is appended, then you keep a small dictionary of where each file starts as a header, then you can search via that. But this would of course lower the grade of compression that you get.

Thanks a lot Juliean for you answer. One followup question: when using libraries like minizip (https://github.com/zlib-ng/minizip-ng)​ it is is possible to decompress specific files from a zip file containing multiple files (this is how vfspp does allow seeking in files which I mentioned above). So in my naive understanding when it is possible to decompress one specific file shouldn't it be possible to decompress parts of that file (maybe needing to decompress the complete file up to the point which I am interested in)?

This can be done by compressing the data in blocks of a certain fixed size (e.g. 32 KB), writing those compressed blocks in order into the file, then writing a table at the end of the file which maps from uncompressed file offsets to the compressed block location. Then, to seek somewhere in the middle of the file, you first load the end table into memory, then figure out what block and where it starts using the table, then you have to decompress the block until you reach the desired offset in the uncompressed block data.

Aressera said:

This can be done by compressing the data in blocks of a certain fixed size (e.g. 32 KB), writing those compressed blocks in order into the file, then writing a table at the end of the file which maps from uncompressed file offsets to the compressed block location. Then, to seek somewhere in the middle of the file, you first load the end table into memory, then figure out what block and where it starts using the table, then you have to decompress the block until you reach the desired offset in the uncompressed block data.

Thanks! The downside of this solution is lower total compression efficiency, correct?

dosmarder said:
The downside of this solution is lower total compression efficiency, correct?

Yes, it's a tradeoff between bigger block size (better compression, but seeking is slower), and smaller block size (worse compression, but seeking is faster). 32 KB is a good starting point, since that is the default size of the data buffer for zlib library, so the compression rate will not be affected very much.

Aressera said:

Yes, it's a tradeoff between bigger block size (better compression, but seeking is slower), and smaller block size (worse compression, but seeking is faster). 32 KB is a good starting point, since that is the default size of the data buffer for zlib library, so the compression rate will not be affected very much.

Thanks a lot, I will look into that a little more. I have a last question: do you know how “bigger games” (in terms of development effort and budget) solve the problem of wanting to stream assets from disk while simultaneously keep the assets compressed? Is the approach the one you outlined here?

dosmarder said:
do you know how “bigger games” (in terms of development effort and budget) solve the problem of wanting to stream assets from disk while simultaneously keep the assets compressed? Is the approach the one you outlined here?

I have never worked on any big games, but I imagine they will either do something like I described above (whole file block compression), or compress just some parts of the data (where the biggest gains are). For instance, in my pack format, I compress certain assets using zlib (e.g. meshes), and other assets like audio and images get compressed using some special-purpose compression algorithm (Ogg Vorbis/FLAC/PNG/DDS for example). Usually the special-purpose algorithms will work better than entropy-based compression. Other smaller assets (entities/scenes/components) are not compressed at all. I could add compression in on those data types later if they become too large. This keeps the loading and seeking speed higher than applying compression to the whole archive. Often games will use a less efficient but faster compression format (e.g. LZ4), rather than zlib/deflate.

Juliean said:

I'm 99% certain that this can't be done.

Sure it can. See https://libzip.org/documentation/libzip.html

There are several libraries for reading and writing ZIP files. Reading is cheap. You can look up a file within the ZIP file by pathname, then start reading it without decompressing the whole archive. For this application, read-only mode is all you need.

Aressera said:

dosmarder said:
do you know how “bigger games” (in terms of development effort and budget) solve the problem of wanting to stream assets from disk while simultaneously keep the assets compressed? Is the approach the one you outlined here?

I have never worked on any big games, but I imagine they will either do something like I described above (whole file block compression), or compress just some parts of the data (where the biggest gains are). For instance, in my pack format, I compress certain assets using zlib (e.g. meshes), and other assets like audio and images get compressed using some special-purpose compression algorithm (Ogg Vorbis/FLAC/PNG/DDS for example). Usually the special-purpose algorithms will work better than entropy-based compression. Other smaller assets (entities/scenes/components) are not compressed at all. I could add compression in on those data types later if they become too large. This keeps the loading and seeking speed higher than applying compression to the whole archive. Often games will use a less efficient but faster compression format (e.g. LZ4), rather than zlib/deflate.

Actually games care more about decompression speed than size or even the time it takes to compress stuff (this is all off line time anyway). Most games nowadays use http://www.radgametools.com/oodle.htm​ this compression lib. Although I have seen zlib being used in a game I worked on directly too. Most file formats in games are setup so that you could more or less do a memcpy from the file buffer to your object with some minor fixing up of things in memory, file IO needs to be fast and not have a lot of unnessecary operations for these types of games (parsing is very much something you want to avoid).

The whole DirectStorage and NVME drives might change all this in the future because its much harder to get all the bandwith out of these drives with the current techniques being used. The NVME protocols really like lots and lots of requests to the drive at once something HDD and optical media really didnt like. So we might see the packaging and creation of larger archives for game data dissapear, in favor of just chuck more requests at the hardware to sustain the bandwith to RAM for games.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

This topic is closed to new replies.

Advertisement