arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xander Dunn <xan...@xander.ai>
Subject Re: How to Compress Dataset Writes
Date Sun, 23 May 2021 04:22:41 GMT
Alright, I got it working:

    parquet::WriterProperties::Builder file_writer_options_builder;
    file_writer_options_builder.compression(arrow::Compression::BROTLI);
    //file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED);
    std::shared_ptr<parquet::WriterProperties> props =
file_writer_options_builder.build();

    std::shared_ptr<ds::FileWriteOptions> file_write_options =
format->DefaultWriteOptions();
    auto parquet_options =
arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options);
    parquet_options->writer_properties = props;
    arrow::dataset::FileSystemDatasetWriteOptions write_options;
    write_options.file_write_options = parquet_options;

But surely a call to arrow::internal is not the intended usage?

On Sat, May 22, 2021 at 8:52 PM Xander Dunn <xander@xander.ai> wrote:

> I see how to compress writes to a particular file using
> arrow::io::CompressedOutputStream::Make, but I’m having difficulty
> figuring out how to make Dataset writes compressed. I have my code set up
> similar to the CreateExampleParquetHivePartitionedDataset example here
> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_documentation_example.cc#L113>.
>
>
> I suspect there is some option on the FileSystemDatasetWriteOptions to
> specify compression, but I haven’t been able to uncover it:
>
> ds::FileSystemDatasetWriteOptions write_options;
>   write_options.file_write_options = format->DefaultWriteOptions();
>   write_options.filesystem = filesystem;
>   write_options.base_dir = base_path;
>   write_options.partitioning = partitioning;
>   write_options.basename_template = "part{i}.parquet";
>   ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));
>
> FileSystemDatasetWriteOptions is defined here
> <https://github.com/apache/arrow/blob/602a76ac58bc8de60a353648f02cf11891563e77/cpp/src/arrow/dataset/file_base.h#L331>
> and doesn’t have a compression option.
>
> The file_write_options property is a ParquetFileWriteOptions, which is
> defined here
> <https://github.com/apache/arrow/blob/8b4942728e7347dc921a2d423e996fea5f9e2102/cpp/src/arrow/dataset/file_parquet.h#L222>
> and has a parquet::WriterProperties and parquet::ArrowWriterProperties.
> It’s created here:
>
> std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions() {
>   std::shared_ptr<ParquetFileWriteOptions> options(
>       new ParquetFileWriteOptions(shared_from_this()));
>   options->writer_properties = parquet::default_writer_properties();
>   options->arrow_writer_properties = parquet::default_arrow_writer_properties();
>   return options;
> }
>
> parquet::WriterProperties can be created with a compression specified
> like this:
>
>     parquet::WriterProperties::Builder file_writer_options_builder;
>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>     std::shared_ptr<parquet::WriterProperties> props = file_writer_options_builder.build();
>
> However, I have been unable to create a FileWriteOptions which includes
> this WriterProperties. What is shared_from_this()? Creating a
> FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers on
> creating a FileWriteOptions in my project, or a better way to specify the
> compression type on a dataset write?
>

Mime
View raw message