arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weston Pace <weston.p...@gmail.com>
Subject Re: How to Compress Dataset Writes
Date Tue, 25 May 2021 21:55:36 GMT
One minor note is that specifying compression in
parquet::WriterProperties will result in a slightly different file
than one created with arrow::io::CompressedOutputStream::Make.  The
former tells parquet the default compression to use for column data
(you could even specify a per-column compression scheme if desired).
It is unique to parquet.  The latter applies compression to the entire
file.  It could be used on any output format.

What you have should be fine.  There is currently no way (I am aware
of) to specify file-wide compression on dataset writes.  This will
probably be a more essential feature once CSV support (or some other
format that doesn't natively handle compression) is added for dataset
writes.

On Sat, May 22, 2021 at 9:17 PM Micah Kornfield <emkornfield@gmail.com> wrote:
>
> internal::checked_pointer_cast isn't really anything special.  It simply switches between
std::static_pointer_cast<T> and std::dynamic_pointer_cast<T> depending on debug/release
compilation. So you can choose one or the other depending on how confident you are in the
type you are casting.
>
>
> On Sat, May 22, 2021 at 9:23 PM Xander Dunn <xander@xander.ai> wrote:
>>
>> Alright, I got it working:
>>
>>     parquet::WriterProperties::Builder file_writer_options_builder;
>>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>>     //file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED);
>>     std::shared_ptr<parquet::WriterProperties> props = file_writer_options_builder.build();
>>
>>     std::shared_ptr<ds::FileWriteOptions> file_write_options = format->DefaultWriteOptions();
>>     auto parquet_options = arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options);
>>     parquet_options->writer_properties = props;
>>     arrow::dataset::FileSystemDatasetWriteOptions write_options;
>>     write_options.file_write_options = parquet_options;
>>
>> But surely a call to arrow::internal is not the intended usage?
>>
>>
>> On Sat, May 22, 2021 at 8:52 PM Xander Dunn <xander@xander.ai> wrote:
>>>
>>> I see how to compress writes to a particular file using arrow::io::CompressedOutputStream::Make,
but I’m having difficulty figuring out how to make Dataset writes compressed. I have my
code set up similar to the CreateExampleParquetHivePartitionedDataset example here.
>>>
>>> I suspect there is some option on the FileSystemDatasetWriteOptions to specify
compression, but I haven’t been able to uncover it:
>>>
>>> ds::FileSystemDatasetWriteOptions write_options;
>>>   write_options.file_write_options = format->DefaultWriteOptions();
>>>   write_options.filesystem = filesystem;
>>>   write_options.base_dir = base_path;
>>>   write_options.partitioning = partitioning;
>>>   write_options.basename_template = "part{i}.parquet";
>>>   ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));
>>>
>>> FileSystemDatasetWriteOptions is defined here and doesn’t have a compression
option.
>>>
>>> The file_write_options property is a ParquetFileWriteOptions, which is defined
here and has a parquet::WriterProperties and parquet::ArrowWriterProperties. It’s created
here:
>>>
>>> std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions()
{
>>>   std::shared_ptr<ParquetFileWriteOptions> options(
>>>       new ParquetFileWriteOptions(shared_from_this()));
>>>   options->writer_properties = parquet::default_writer_properties();
>>>   options->arrow_writer_properties = parquet::default_arrow_writer_properties();
>>>   return options;
>>> }
>>>
>>> parquet::WriterProperties can be created with a compression specified like this:
>>>
>>>     parquet::WriterProperties::Builder file_writer_options_builder;
>>>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>>>     std::shared_ptr<parquet::WriterProperties> props = file_writer_options_builder.build();
>>>
>>> However, I have been unable to create a FileWriteOptions which includes this
WriterProperties. What is shared_from_this()? Creating a FileWriteOptions with std::make_shared<>
doesn’t compile. Any pointers on creating a FileWriteOptions in my project, or a better
way to specify the compression type on a dataset write?

Mime
View raw message