Python Compressing large size files using bz2














































Python Compressing large size files using bz2



Usage of bz2 library to compress large size files

In the introductory article(here), we saw basic use of bz2 library for data (de)compression. However, many a time the size of data is very large(~500 MB - 1GB) and hence cannot be loaded into RAM at once. Any attempt at loading may cause system crash or failure.

So, bz2 library provides BZ2Compressor() and BZ2Decompressor() classes for handling large data files.

Let's look at methods of BZ2Compressor class first:
class bz2.BZ2Compressor(compresslevel=9)
Like any other python class, an instance of this class returns a compressor object with a few methods.
compresslevel denotes "How much we want to compress our data".

Individual methods in BZ2Compressor:


compress(data) method

Provides data to the instantiated object.
Returns - compressed byte string or empty byte string on error.
>>> obj = bz2.BZ2Compressor()
>>> res = obj.compress(b"Hi, there!")
After, we are done with compressing we must flush() the compressor to clear out the buffer memory similar to the closing of a file.

flush() method

Caution: After flushing the object, the object cannot be used for further compression. Else, an error will be issued:
>>> obj.flush()
b'BZh9\x17rE8P\x90\x00\x00\x00\x00'
>>> obj.compress(b"Hi, there!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Compressor has been flushed


Putting it all together to incrementally compress a large .txt file: 


import bz2

# wrapper class to compress strings
class compress_data():

    def __init__(selfcompress_level=9):
        self.comp = bz2.BZ2Compressor(compress_level)

    def compress_chunk(selfchunk):
        chunk = bytes(chunk, 'utf-8')
        return self.comp.compress(chunk)

    def close(self):
        return self.comp.flush()


def main():
    # suppose sample.txt is a large file and 
    # cannot be compressed in one go
    file_path = 'sample.txt'

    # to store compressed output
    res = b''

    # instantiate a compressor object
    obj = compress_data()

    with open(file_path, 'r'as f:
        lines = f.readlines()
        for line in lines:
            res = res + obj.compress_chunk(line)
        res = res + obj.close()
        f.close()
    
    print(res)

if __name__ == '__main__':
    main()



In the next article, we will cover BZ2Decompressor() class.



Comments