arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rollo Konig-Brock <roll...@gmail.com>
Subject [pyarrow] How can one handle parquet encoding memory bombs
Date Wed, 29 Jan 2020 15:45:41 GMT
Dear Arrow Developers,

I’m having memory issues with certain Parquet files. Parquet uses run length encoding for
certain columns, meaning that an array of int64s could take up a couple thousand bytes on
disk, but then a few hundred megabytes when loaded into a pyarrow.Table. I’m a little surprised
there’s no option to keep the underlying Arrow array as encoded data.

An example of this with a file I’ve created (attached) which is just 1’000’000 repetitions
of a single int64 value:

import psutil
import os
import gc
import pyarrow.parquet

suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
current_thread = psutil.Process(os.getpid())

def human_memory_size(nbytes):
    i = 0
    while nbytes >= 1024 and i < len(suffixes)-1:
        nbytes /= 1024.
        i += 1
    f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
    return '%s %s' % (f, suffixes[i])


def log_memory_usage(msg):
    print(msg, human_memory_size(current_thread.memory_info().rss))

log_memory_usage('Initial Memory usage')

print('Size of parquet file', human_memory_size(os.stat('rle_bomb.parquet').st_size))

pf = pyarrow.parquet.ParquetFile('rle_bomb.parquet')

table = pf.read()

log_memory_usage('Loaded memory usage')

This will produce the following output:

Initial Memory usage 27.11 MB
Size of parquet file 3.62 KB
Loaded schema 27.71 MB
Loaded memory usage 997.9 MB

This poses a bit of a problem particularly when running this code in servers as there doesn’t
seem to be a way of preventing a memory explosion given the PyArrow API. I’m at a bit of
a loss at how to control for this, there does not seem to be a method to do something like
iterate over the Parquet columns in set chunks (where the size could be calculated accurately).

All the best,
Rollo Konig-Brock



Mime
View raw message