arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: How to specify number of partitions?
Date Thu, 09 Jul 2020 07:49:45 GMT
Hi Yash,

Currently, there is the `parquet.write_to_dataset` function for
something like that. But that requires to specify a column by which to
split the single pyarrow Table.
To just split one table in regular chunks to write to multiple files
in a single directory, I don't think we have an automatic function for
that (you could slice the table in a loop and write each subset with
`write_table`).

You can also control the row group size (partitioning within a single
Parquet file), using the row_group_size argument of `write_table`.

Best,
Joris


On Wed, 8 Jul 2020 at 20:44, Yash Ganthe <yashgt@gmail.com> wrote:
>
> Hi,
>
> parquet_writer.write_table(table)
>
> This line writes a single file.
> The documentation says:
> This creates a single Parquet file. In practice, a Parquet dataset may
> consist of many files in many directories. We can read a single file back
> with read_table:
>
> Is there a way for PyArrow to create a parquet file in the form of a
> directory with multiple part files in it such as :
>
> ls -lrt permit-inspections-recent.parquet
> ...  14:53 part-00001-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
> ...  14:53 part-00000-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
>
> Regards,
> Yash

Mime
View raw message