arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [C++] - Squeeze more out of parquet write(table) operation.
Date Sat, 27 Mar 2021 04:57:00 GMT
I think there is probably room for improvement in the entire
pipeline. Doing some more in depth profiling might inform which areas to
target for optimization and/or parallelize.  But I don't have any
particular user configurable options.  For the schema in question, some of
the comments about future improvements for def/rep level generation [1]
might apply.

-Micah

[1]
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/path_internal.cc#L20

On Fri, Mar 26, 2021 at 9:47 PM Weston Pace <weston.pace@gmail.com> wrote:

> I'm fairly certain there is room for improvement in the C++
> implementation for writing single files to ADLFS.  Others can correct
> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
> guess this is partly because there isn't much benefit when writing to
> local disk (writes are typically synchronous) but also because it's
> much easier to write multiple files.
>
> Is writing multiple files a choice for you?  I would guess using a
> dataset write with multiple files would be significantly more
> efficient than one large single file write on ADLFS.
>
> -Weston
>
> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <yeshsriram@icloud.com>
> wrote:
> >
> > Hello,
> >
> > Thank you again for earlier help on improving overall ADLFS read latency
> using multiple threads which has worked out really well.
> >
> > I’ve incorporated buffering on the adls/writer implementation (upto 64
> meg) . What I’m noticing is that the parquet_writer->WriteTable(table)
> latency dominates everything else on the output phase of the job (~65sec vs
> ~1.2min ) .  I could use multiple threads (like io/s3fs) but not sure if it
> will have any effect on parquet write table operation.
> >
> > Question: Is there anything else I can leverage inside parquet/writer
> subsystem to improve the core parquet/write/table latency ?
> >
> >
> > schema:
> >   map<key,array<struct<…>>>
> >   struct<...>
> >   map<key,map<key,map<key, struct<…>>>>
> >   struct<…>
> >   binary
> > num_row_groups: 6
> > num_rows_per_row_group: ~8mil
> > write buffer size: 64 * 1024 * 1024 (~64 mb)
> > write compression: snappy
> > total write latency per row group: ~1.2min
> >  adls append/flush latency (minor factor)
> > Azure: ESv3/RAM: 256Gb/Cores: 8
> >
> > Yesh
>

Mime
View raw message