Python Decompressing large size files using bz2 library














































Python Decompressing large size files using bz2 library



Usage of bz2 library to decompress large size files


In the introductory article(here), we saw basic use of bz2 library for data (de)compression. However, many a time the size of data is very large(~500 MB - 1GB) and hence cannot be loaded into RAM at once. Any attempt at loading may cause system crash or failure.

So, bz2 library provides BZ2Compressor() and BZ2Decompressor() classes for handling large data files.

Let's look at methods of BZ2Decompressor() class now:
class bz2.BZ2Decompressor()

Creating an instance of this class:
>>> import bz2
>>> obj = bz2.BZ2Decompressor()
>>> obj
<_bz2.BZ2Decompressor object at 0x00000238485F2DF0>
Now, obj can be used to decompress data incrementally. We can also choose the number of bytes to decompress each time.

Individual methods in BZ2Compressor:


decompress(datamax_length=-1)

Takes bytes data and returns decompressed data also as bytes. max_length specifies the byte size to be decompressed.

As we did in the case of incremental compression(here), the outputs have to be concatenated to get the final data back.

eof

True if the end-of-stream marker has been reached.


unused_data

Data found after the end of the compressed stream.

If this attribute is accessed before the end of the stream has been reached, its value will be b''

needs_input

False if the decompress() method can provide more decompressed data before requiring new uncompressed input.


Various cases arise depending on the data size and the max_length parameter:

  1. data.size() < max_length or max_length == -1: all the data decompressed and needs_input set to True.
  2. data.size() > max_length: max_length data decompressed, needs_input set to False. To get the data left in the internal buffer, pass b'' in the next call to decompress().
  3. EOF marker found before max_length: data up to EOF is returned and remaining data is stored under unused_data.

Caution: Sometimes, the compressed data may be created incrementally with multiple EOF streams. In this, we need to detect eof true case and create a new obj for further decompressing. Obviously, any bytes after the EOF marker in this stream has to be passed to the next obj along with new compressed data.

Putting it all together to decompress a large .bz2 compressed file:


import sys
import bz2

class DecompressData():
    self.prev_data = b''
    self.decomp = None

    def __init__(self, size):
        self.decomp = bz2.BZ2Decompressor()
        self.prev_data = b''
        self.chunk_size = size

    def decompress(self, iterable, output):
        for chunk in iterable:
            dc = self.decomp.decompress(chunk)

            output.write(prev_data + dc)
            self.prev_data = self.decomp.unused_data

            # if the eof has reached, we need a new object
            if self.decomp.eof == True:
                reset()

    def reset(self):
        self.decomp = bz2.BZ2Decompressor()

def main():
    # suppose sample.bz2 is a large file and 
    # cannot be decompressed in one go
    file_path = 'sample.bz2'
    
    # size in bytes to decompress
    size = 1024

    obj = DecompressData(size)

    # here, we are printing the decompressed object to terminal console
    with open(file_path, 'r') as f:
        iterable = iter(lambda: f.read(size), b'')
        obj.decompress(iterable, sys.stdout.buffer)
        f.close()

if __name__ == '__main__':
    main()


Thank you!


Comments