arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlton Callender <>
Subject [R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?
Date Wed, 23 Dec 2020 05:05:56 GMT

I am starting to use arrow in a workflow where I have a dataset partitioned by a couple variables
(like location and year) that leads to > 100,000 parquet files.

I have been using `arrow::open_dataset(sources = FILEPATH, unify_schemas = FALSE)` but found
this is taking a couple minutes to run. I can see that almost all the time is spent on this
line creating the FileSystemDatasetFactory.

In my use case I know all the partition file paths and I know the schema (and that it is consistent
across partitions). Is there any way to use that information to more quickly create the Dataset
object with a highly partitioned dataset?

I found this section in the Python docs about creating a dataset from filepaths, is this possible
to do from R?

Thank you! I’ve been finding arrow/parquet really useful as an alternative to hdf5 and csv.
View raw message