gzip – Read and write GNU zip files

Purpose:Read and write gzip files.
Python Version:1.5.2 and later

The gzip module provides a file-like interface to GNU zip files, using zlib to compress and uncompress the data.

Writing Compressed Files

The module-level function open() creates an instance of the file-like class GzipFile. The usual methods for writing and reading data are provided. To write data into a compressed file, open the file with mode 'w'.

import gzip
import os

outfilename = 'example.txt.gz'
output = gzip.open(outfilename, 'wb')
try:
    output.write('Contents of the example file go here.\n')
finally:
    output.close()

print outfilename, 'contains', os.stat(outfilename).st_size, 'bytes of compressed data'
os.system('file -b --mime %s' % outfilename)
$ python gzip_write.py
application/x-gzip
example.txt.gz contains 68 bytes of compressed data

Different compression levels can be used by passing a compresslevel argument. Valid values range from 1 to 9, inclusive. Lower values are faster and result in less compression. Higher values are slower and compress more, up to a point.

import gzip
import os
import hashlib

def get_hash(data):
    return hashlib.md5(data).hexdigest()

data = open('lorem.txt', 'r').read() * 1024
cksum = get_hash(data)

print 'Level  Size        Checksum'
print '-----  ----------  ---------------------------------'
print 'data   %10d  %s' % (len(data), cksum)

for i in xrange(1, 10):
    filename = 'compress-level-%s.gz' % i
    output = gzip.open(filename, 'wb', compresslevel=i)
    try:
        output.write(data)
    finally:
        output.close()
    size = os.stat(filename).st_size
    cksum = get_hash(open(filename, 'rb').read())
    print '%5d  %10d  %s' % (i, size, cksum)

The center column of numbers in the output of the script is the size in bytes of the files produced. As you see, for this input data, the higher compression values do not necessarily pay off in decreased storage space. Results will vary, depending on the input data.

$ python gzip_compresslevel.py
Level  Size        Checksum
-----  ----------  ---------------------------------
data       754688  e4c0f9433723971563f08a458715119c
    1        9839  892138dfbf549b01e22e77420bd3e500
    2        8260  7d7557f6fadc0b462c394826dfa69a77
    3        8221  81e74a6d6d942288a66848c4b316e9fc
    4        4160  784ef2f70f26a42fb403cf9fb2aed070
    5        4160  ed455327a153429372b19a0f46bdee85
    6        4160  0af3cb52ce259a209138d03cb3afeeb7
    7        4160  050e78f8158f29c8791875635a6bb5c0
    8        4160  b50d5853e3adcacea128df15824cab39
    9        4160  e759e368e396c1bd02aa048c02835ed9

A GzipFile instance also includes a writelines() method that can be used to write a sequence of strings.

import gzip
import itertools
import os

output = gzip.open('example_lines.txt.gz', 'wb')
try:
    output.writelines(itertools.repeat('The same line, over and over.\n', 10))
finally:
    output.close()

os.system('gzcat example_lines.txt.gz')
$ python gzip_writelines.py
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.

Reading Compressed Data

To read data back from previously compressed files, simply open the file with mode 'r'.

import gzip

input_file = gzip.open('example.txt.gz', 'rb')
try:
    print input_file.read()
finally:
    input_file.close()

This example reads the file written by gzip_write.py from the previous section.

$ python gzip_read.py
Contents of the example file go here.

While reading a file, it is also possible to seek and read only part of the data.

import gzip

input_file = gzip.open('example.txt.gz', 'rb')
try:
    print 'Entire file:'
    all_data = input_file.read()
    print all_data
    
    expected = all_data[5:15]
    
    # rewind to beginning
    input_file.seek(0)
    
    # move ahead 5 bytes
    input_file.seek(5)
    print 'Starting at position 5 for 10 bytes:'
    partial = input_file.read(10)
    print partial
    
    print
    print expected == partial
finally:
    input_file.close()

The seek() position is relative to the uncompressed data, so the caller does not even need to know that the data file is compressed.

$ python gzip_seek.py
Entire file:
Contents of the example file go here.

Starting at position 5 for 10 bytes:
nts of the

True

Working with Streams

It is possible to use the GzipFile class directly to compress or uncompress a data stream, instead of an entire file. This is useful for working with data being transmitted over a socket or from an existing (open) file handle. A StringIO buffer can also be used.

import gzip
from cStringIO import StringIO
import binascii

uncompressed_data = 'The same line, over and over.\n' * 10
print 'UNCOMPRESSED:', len(uncompressed_data)
print uncompressed_data

buf = StringIO()
f = gzip.GzipFile(mode='wb', fileobj=buf)
try:
    f.write(uncompressed_data)
finally:
    f.close()

compressed_data = buf.getvalue()
print 'COMPRESSED:', len(compressed_data)
print binascii.hexlify(compressed_data)

inbuffer = StringIO(compressed_data)
f = gzip.GzipFile(mode='rb', fileobj=inbuffer)
try:
    reread_data = f.read(len(uncompressed_data))
finally:
    f.close()

print
print 'RE-READ:', len(reread_data)
print reread_data

Note

When re-reading the previously compressed data, I pass an explicit length to read(). Leaving the length off resulted in a CRC error, possibly because StringIO returned an empty string before reporting EOF. If you are working with streams of compressed data, you may want to prefix the data with an integer representing the actual amount of data to be read.

$ python gzip_StringIO.py
UNCOMPRESSED: 300
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.

COMPRESSED: 51
1f8b0800aab4934b02ff0bc94855284ecc4d55c8c9cc4bd551c82f4b2d5248cc4b0133f4b8424665916401d3e717802c010000

RE-READ: 300
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.
The same line, over and over.

See also

gzip
The standard library documentation for this module.
zlib
The zlib module is a lower-level interface to gzip compression.
zipfile
The zipfile module gives access to ZIP archives.
bz2
The bz2 module uses the bzip2 compression format.
tarfile
The tarfile module includes built-in support for reading compressed tar archives.
Bookmark and Share