arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [pyarrow] How can one handle parquet encoding memory bombs
Date Wed, 29 Jan 2020 16:56:21 GMT
hi Rollo,

Two quick points:

* There is a C++ API already providing chunk-based reads
("GetRecordBatchReader"). That it is not yet exposed in Python in
pyarrow.parquet is a result of a lack of a volunteer to do so in the
past
* Several people are actively working on the "Datasets API" project in
C++ (with bindings in Python and R) which will provide a holistic
approach to chunked (and thus more memory constrained) dataset reading
across arbitrary file formats (not just Parquet). This is what I see
as the long term solution to the problem you're describing. See [1]

We'd welcome your contributions to this work. If you have the
financial resources to sponsor developers to increase efforts on this
I'd be happy to speak with you about that offline.

- Wes

[1]: https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit

On Wed, Jan 29, 2020 at 9:45 AM Rollo Konig-Brock <rollokb@gmail.com> wrote:
>
> Dear Arrow Developers,
>
>
>
> I’m having memory issues with certain Parquet files. Parquet uses run length encoding
for certain columns, meaning that an array of int64s could take up a couple thousand bytes
on disk, but then a few hundred megabytes when loaded into a pyarrow.Table. I’m a little
surprised there’s no option to keep the underlying Arrow array as encoded data.
>
>
>
> An example of this with a file I’ve created (attached) which is just 1’000’000
repetitions of a single int64 value:
>
>
>
> import psutil
>
> import os
>
> import gc
>
> import pyarrow.parquet
>
>
>
> suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
>
> current_thread = psutil.Process(os.getpid())
>
>
>
> def human_memory_size(nbytes):
>
>     i = 0
>
>     while nbytes >= 1024 and i < len(suffixes)-1:
>
>         nbytes /= 1024.
>
>         i += 1
>
>     f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
>
>     return '%s %s' % (f, suffixes[i])
>
>
>
>
>
> def log_memory_usage(msg):
>
>     print(msg, human_memory_size(current_thread.memory_info().rss))
>
>
>
> log_memory_usage('Initial Memory usage')
>
>
>
> print('Size of parquet file', human_memory_size(os.stat('rle_bomb.parquet').st_size))
>
>
>
> pf = pyarrow.parquet.ParquetFile('rle_bomb.parquet')
>
>
>
> table = pf.read()
>
>
>
> log_memory_usage('Loaded memory usage')
>
>
>
> This will produce the following output:
>
>
>
> Initial Memory usage 27.11 MB
>
> Size of parquet file 3.62 KB
>
> Loaded schema 27.71 MB
>
> Loaded memory usage 997.9 MB
>
>
>
> This poses a bit of a problem particularly when running this code in servers as there
doesn’t seem to be a way of preventing a memory explosion given the PyArrow API. I’m at
a bit of a loss at how to control for this, there does not seem to be a method to do something
like iterate over the Parquet columns in set chunks (where the size could be calculated accurately).
>
>
>
> All the best,
>
> Rollo Konig-Brock
>
>
>
>

Mime
View raw message