arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: How to Compress Dataset Writes
Date Sun, 23 May 2021 07:16:10 GMT
internal::checked_pointer_cast isn't really anything special.  It simply
switches between std::static_pointer_cast<T> and
std::dynamic_pointer_cast<T> depending on debug/release compilation. So you
can choose one or the other depending on how confident you are in the type
you are casting.


On Sat, May 22, 2021 at 9:23 PM Xander Dunn <xander@xander.ai> wrote:

> Alright, I got it working:
>
>     parquet::WriterProperties::Builder file_writer_options_builder;
>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>     //file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED);
>     std::shared_ptr<parquet::WriterProperties> props = file_writer_options_builder.build();
>
>     std::shared_ptr<ds::FileWriteOptions> file_write_options = format->DefaultWriteOptions();
>     auto parquet_options = arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options);
>     parquet_options->writer_properties = props;
>     arrow::dataset::FileSystemDatasetWriteOptions write_options;
>     write_options.file_write_options = parquet_options;
>
> But surely a call to arrow::internal is not the intended usage?
>
> On Sat, May 22, 2021 at 8:52 PM Xander Dunn <xander@xander.ai> wrote:
>
>> I see how to compress writes to a particular file using
>> arrow::io::CompressedOutputStream::Make, but I’m having difficulty
>> figuring out how to make Dataset writes compressed. I have my code set
>> up similar to the CreateExampleParquetHivePartitionedDataset example here
>> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_documentation_example.cc#L113>.
>>
>>
>> I suspect there is some option on the FileSystemDatasetWriteOptions to
>> specify compression, but I haven’t been able to uncover it:
>>
>> ds::FileSystemDatasetWriteOptions write_options;
>>   write_options.file_write_options = format->DefaultWriteOptions();
>>   write_options.filesystem = filesystem;
>>   write_options.base_dir = base_path;
>>   write_options.partitioning = partitioning;
>>   write_options.basename_template = "part{i}.parquet";
>>   ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));
>>
>> FileSystemDatasetWriteOptions is defined here
>> <https://github.com/apache/arrow/blob/602a76ac58bc8de60a353648f02cf11891563e77/cpp/src/arrow/dataset/file_base.h#L331>
>> and doesn’t have a compression option.
>>
>> The file_write_options property is a ParquetFileWriteOptions, which is
>> defined here
>> <https://github.com/apache/arrow/blob/8b4942728e7347dc921a2d423e996fea5f9e2102/cpp/src/arrow/dataset/file_parquet.h#L222>
>> and has a parquet::WriterProperties and parquet::ArrowWriterProperties.
>> It’s created here:
>>
>> std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions()
{
>>   std::shared_ptr<ParquetFileWriteOptions> options(
>>       new ParquetFileWriteOptions(shared_from_this()));
>>   options->writer_properties = parquet::default_writer_properties();
>>   options->arrow_writer_properties = parquet::default_arrow_writer_properties();
>>   return options;
>> }
>>
>> parquet::WriterProperties can be created with a compression specified
>> like this:
>>
>>     parquet::WriterProperties::Builder file_writer_options_builder;
>>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>>     std::shared_ptr<parquet::WriterProperties> props = file_writer_options_builder.build();
>>
>> However, I have been unable to create a FileWriteOptions which includes
>> this WriterProperties. What is shared_from_this()? Creating a
>> FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers
>> on creating a FileWriteOptions in my project, or a better way to specify
>> the compression type on a dataset write?
>>
>

Mime
View raw message