arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [C++] - Squeeze more out of parquet write(table) operation.
Date Sat, 27 Mar 2021 09:12:27 GMT
On Fri, 26 Mar 2021 18:47:26 -1000
Weston Pace <weston.pace@gmail.com> wrote:
> I'm fairly certain there is room for improvement in the C++
> implementation for writing single files to ADLFS.  Others can correct
> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
> guess this is partly because there isn't much benefit when writing to
> local disk (writes are typically synchronous) but also because it's
> much easier to write multiple files.

Writes should be asynchronous most of the time.  I don't know anything
about ADLFS, though.

Regards

Antoine.


> 
> Is writing multiple files a choice for you?  I would guess using a
> dataset write with multiple files would be significantly more
> efficient than one large single file write on ADLFS.
> 
> -Weston
> 
> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <yeshsriram@icloud.com> wrote:
> >
> > Hello,
> >
> > Thank you again for earlier help on improving overall ADLFS read latency using multiple
threads which has worked out really well.
> >
> > I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) .
What I’m noticing is that the parquet_writer->WriteTable(table) latency dominates everything
else on the output phase of the job (~65sec vs ~1.2min ) .  I could use multiple threads (like
io/s3fs) but not sure if it will have any effect on parquet write table operation.
> >
> > Question: Is there anything else I can leverage inside parquet/writer subsystem
to improve the core parquet/write/table latency ?
> >
> >
> > schema:
> >   map<key,array<struct<…>>>
> >   struct<...>
> >   map<key,map<key,map<key, struct<…>>>>
> >   struct<…>
> >   binary
> > num_row_groups: 6
> > num_rows_per_row_group: ~8mil
> > write buffer size: 64 * 1024 * 1024 (~64 mb)
> > write compression: snappy
> > total write latency per row group: ~1.2min
> >  adls append/flush latency (minor factor)
> > Azure: ESv3/RAM: 256Gb/Cores: 8
> >
> > Yesh  
> 




Mime
View raw message