arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ira>
Subject [Python] How to know what partitions will dataset.write_dataset affect when writting?
Date Thu, 25 Mar 2021 11:41:08 GMT

I am trying to overwrite partitions when writing a table to HDFS using
pyarrow. I would like to know what is the recommended way to figure out
which directories I should clear before writing the dataset?

My current approach is to convert the pyarrow.table to pandas dataframe,
use groupby on the partitioning columns and from that figure out which
directories will be affected. However, I'd like to avoid conversion to
pandas if possible and I hope that since pyarrow is able to figure out
where to write the data quite fast, I could somehow reuse the way it
detects the paths to write to.

Thank you!

Best regards,


View raw message