arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jonathan mercier <>
Subject Learning pyarrow and optimize row groups size
Date Wed, 18 Mar 2020 10:30:41 GMT

I am learning pyarrow API and arrow tecnology. So I would like first to
thank you for your works.

>From my understanding pyarrow.arrays, pyarrow.RecordBatch are write
only structure. We can not append data.
1/ is it correct ?

I wrote a little script to write data into parquet file. The data is a
2D list ( a list of rows which contains a list of columns
[['a','b','c'], ['d','e','f']])
Script is here:

To obtain this goal I stored in memory all intermediate pyarrow
structures in order to create a table (schema and list of pyarrow

2/ is it possible to reach the same goal with a stream in order to not
waste memory/handle terabyte of data ?

I read these interesting articles:,

 which recommends large row groups (512MB - 1GB).
3/ how to manage row group in order to feat approximately the size 1GB

4) using pyarrow should store at end (on disk) to a parquet file or
pyarrow provide its generic file as common data layer?

Thanks a lot for your help and your works on arrow

Best regards


View raw message