arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xander Dunn <xan...@xander.ai>
Subject Re: How to Compress Dataset Writes
Date Wed, 26 May 2021 00:13:07 GMT
 Thanks Weston!


On Tue, May 25 2021 at 17:55, Weston Pace <weston.pace@gmail.com> wrote:

> One minor note is that specifying compression in
> parquet::WriterProperties will result in a slightly different file than
> one created with arrow::io::CompressedOutputStream::Make. The former tells
> parquet the default compression to use for column data
> (you could even specify a per-column compression scheme if desired). It is
> unique to parquet. The latter applies compression to the entire file. It
> could be used on any output format.
>
> What you have should be fine. There is currently no way (I am aware of) to
> specify file-wide compression on dataset writes. This will probably be a
> more essential feature once CSV support (or some other format that doesn't
> natively handle compression) is added for dataset writes.
>
> On Sat, May 22, 2021 at 9:17 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> >
>
> internal::checked_pointer_cast isn't really anything special. It simply
> switches between std::static_pointer_cast<T> and
> std::dynamic_pointer_cast<T> depending on debug/release compilation. So you
> can choose one or the other depending on how confident you are in the type
> you are casting.
>
> >
> >
>
> On Sat, May 22, 2021 at 9:23 PM Xander Dunn <xander@xander.ai> wrote:
>
> >>
>
> Alright, I got it working:
>
> >>
>
> parquet::WriterProperties::Builder file_writer_options_builder;
> file_writer_options_builder.compression(arrow::Compression::BROTLI);
> //file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED);
>
> std::shared_ptr<parquet::WriterProperties> props =
> file_writer_options_builder.build();
>
> >>
>
> std::shared_ptr<ds::FileWriteOptions> file_write_options =
> format->DefaultWriteOptions();
> auto parquet_options =
> arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options);
>
> parquet_options->writer_properties = props;
> arrow::dataset::FileSystemDatasetWriteOptions write_options;
> write_options.file_write_options = parquet_options;
>
> >>
>
> But surely a call to arrow::internal is not the intended usage?
>
> >>
> >>
>
> On Sat, May 22, 2021 at 8:52 PM Xander Dunn <xander@xander.ai> wrote:
>
> >>>
>
> I see how to compress writes to a particular file using
> arrow::io::CompressedOutputStream::Make, but I’m having difficulty figuring
> out how to make Dataset writes compressed. I have my code set up similar to
> the CreateExampleParquetHivePartitionedDataset example here.
>
> >>>
>
> I suspect there is some option on the FileSystemDatasetWriteOptions to
> specify compression, but I haven’t been able to uncover it:
>
> >>>
>
> ds::FileSystemDatasetWriteOptions write_options;
> write_options.file_write_options = format->DefaultWriteOptions();
> write_options.filesystem = filesystem;
> write_options.base_dir = base_path;
> write_options.partitioning = partitioning;
> write_options.basename_template = "part{i}.parquet";
> ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));
>
> >>>
>
> FileSystemDatasetWriteOptions is defined here and doesn’t have a
> compression option.
>
> >>>
>
> The file_write_options property is a ParquetFileWriteOptions, which is
> defined here and has a parquet::WriterProperties and
> parquet::ArrowWriterProperties. It’s created here:
>
> >>>
>
> std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions()
> {
> std::shared_ptr<ParquetFileWriteOptions> options(
> new ParquetFileWriteOptions(shared_from_this()));
> options->writer_properties = parquet::default_writer_properties();
> options->arrow_writer_properties =
> parquet::default_arrow_writer_properties();
> return options;
> }
>
> >>>
>
> parquet::WriterProperties can be created with a compression specified like
> this:
>
> >>>
>
> parquet::WriterProperties::Builder file_writer_options_builder;
> file_writer_options_builder.compression(arrow::Compression::BROTLI);
> std::shared_ptr<parquet::WriterProperties> props =
> file_writer_options_builder.build();
>
> >>>
>
> However, I have been unable to create a FileWriteOptions which includes
> this WriterProperties. What is shared_from_this()? Creating a
> FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers on
> creating a FileWriteOptions in my project, or a better way to specify the
> compression type on a dataset write?
>
>

Mime
View raw message