Usage of bz2 library to compress large size files
In the introductory article(
here), we saw basic use of
bz2 library for data (de)compression. However, many a time the size of data is very large(~500 MB - 1GB) and hence cannot be loaded into RAM at once. Any attempt at loading may cause system crash or failure.
So, bz2 library provides BZ2Compressor() and BZ2Decompressor() classes for handling large data files.
Let's look at methods of BZ2Compressor class first:
class bz2.BZ2Compressor(compresslevel=9)
Like any other python class, an instance of this class returns a compressor object with a few methods.
compresslevel denotes "How much we want to compress our data".
Individual methods in BZ2Compressor:
compress(data) method
Provides data to the instantiated object.
Returns - compressed byte string or empty byte string on error.
>>> obj = bz2.BZ2Compressor()
>>> res = obj.compress(b"Hi, there!")
After, we are done with compressing we must flush() the compressor to clear out the buffer memory similar to the closing of a file.
flush() method
Caution: After flushing the object, the object cannot be used for further compression. Else, an error will be issued:
>>> obj.flush()
b'BZh9\x17rE8P\x90\x00\x00\x00\x00'
>>> obj.compress(b"Hi, there!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Compressor has been flushed
Putting it all together to incrementally compress a large .txt file:
import bz2
# wrapper class to compress strings
class compress_data():
def __init__(self, compress_level=9):
self.comp = bz2.BZ2Compressor(compress_level)
def compress_chunk(self, chunk):
chunk = bytes(chunk, 'utf-8')
return self.comp.compress(chunk)
def close(self):
return self.comp.flush()
def main():
# suppose sample.txt is a large file and
# cannot be compressed in one go
file_path = 'sample.txt'
# to store compressed output
res = b''
# instantiate a compressor object
obj = compress_data()
with open(file_path, 'r') as f:
lines = f.readlines()
for line in lines:
res = res + obj.compress_chunk(line)
res = res + obj.close()
f.close()
print(res)
if __name__ == '__main__':
main()
In the next article, we will cover BZ2Decompressor() class.
Comments