From user-return-1242-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun May 23 04:23:07 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 15D32180644 for ; Sun, 23 May 2021 06:23:07 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 5F5AB60337 for ; Sun, 23 May 2021 04:23:06 +0000 (UTC) Received: (qmail 47613 invoked by uid 500); 23 May 2021 04:23:03 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 47602 invoked by uid 99); 23 May 2021 04:23:02 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 May 2021 04:23:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id 1EB8DC0435 for ; Sun, 23 May 2021 04:23:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: 0.002 X-Spam-Level: X-Spam-Status: No, score=0.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=xander.ai Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id iy5cZez1MIMI for ; Sun, 23 May 2021 04:23:01 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.216.43; helo=mail-pj1-f43.google.com; envelope-from=xander@xander.ai; receiver= Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id A56F4BD1E1 for ; Sun, 23 May 2021 04:23:00 +0000 (UTC) Received: by mail-pj1-f43.google.com with SMTP id pi6-20020a17090b1e46b029015cec51d7cdso9262600pjb.5 for ; Sat, 22 May 2021 21:23:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xander.ai; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=WAFv+QBvLT7qRwoaBYU8h9bREpwPlchRU0CNDVhh3CQ=; b=KMYlG/Y21P1N0Y5BUYQHohgkp4V7nxLLtcRcLBgg+4J5TnqHi0ooCDaMa4WK+taafp jHoiGEQLB+USZxb/JzePsUls/hards6U1a14fYdBJyzxPMSOcXeb3MCbOA2zF2AZAIGV ApT5qlzy4PQOa1CROSkUc2YNfOQSV39LteZHjCUhR5KjizLG3WbzJu0bHANh7gmNcoCg p9MuhQrlJvErW1uYPHjgkgd+Rw96BqEgJfArDOLF4UZkQy1S6nJTa3SJLO8NicfU58vG akIZBH3LhoxzCK6rrBvs8TJGgpX0IEQYGZvUT5n83kZowdBAzvx2uXsVGg21SF/S4CI6 WkUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=WAFv+QBvLT7qRwoaBYU8h9bREpwPlchRU0CNDVhh3CQ=; b=WWqKQRKSsHFUXMPC2eNEMDyIWSl6ymWuQR+A150hnsgXr72uVwqZ1QEg20QcDegLMD GQ0PUxKvhPf2bTQulHw3dQ21oszPBKMALTu6VmLfWeh1zxJ9OZcmCOUOTZci/OIqPBtJ HLqn09WkPHbGDBXFuObo1Xj47uN2zWkuvdSXZTZiyePKekV17ZG+iS0FwyXwZe54zJv4 3T816lBBa9jnhdA8xuY9e4CiJoDbgdoB/pzYWXxc50KV4tWMAp42NVbd19NnbMzfuIFk F3l7GrdAQV1NpmiO0YsmGIEJrwDOrzwzX4WKrxPE5zvPtV62GeDBzX+ZolX9JIGO1DOe P7ng== X-Gm-Message-State: AOAM530Za4mMP+sx1rm10fuHOcV/HSEvkuOcx+UDcHXKK4DC7pkLVqDY wU2Dgi5yoBZTFvtcCci7ui0p1z1Ws62PwLrmQihVEIgFdJGr9nS1 X-Google-Smtp-Source: ABdhPJy3RVzu/6rsGyEoW1JrzkQUX9lFzvM5m0Bf1Ok8R11+50CgetStXt09HpiIwTZP9ArhSfu7bHa2Z1eG8Tuw6iE= X-Received: by 2002:a17:903:230b:b029:f4:b7cf:44aa with SMTP id d11-20020a170903230bb02900f4b7cf44aamr19394226plh.31.1621743773965; Sat, 22 May 2021 21:22:53 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Xander Dunn Date: Sat, 22 May 2021 21:22:41 -0700 Message-ID: Subject: Re: How to Compress Dataset Writes To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="00000000000090777705c2f7a5fa" --00000000000090777705c2f7a5fa Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Alright, I got it working: parquet::WriterProperties::Builder file_writer_options_builder; file_writer_options_builder.compression(arrow::Compression::BROTLI); //file_writer_options_builder.compression(arrow::Compression::UNCOMPRES= SED); std::shared_ptr props =3D file_writer_options_builder.build(); std::shared_ptr file_write_options =3D format->DefaultWriteOptions(); auto parquet_options =3D arrow::internal::checked_pointer_cast(file_wri= te_options); parquet_options->writer_properties =3D props; arrow::dataset::FileSystemDatasetWriteOptions write_options; write_options.file_write_options =3D parquet_options; But surely a call to arrow::internal is not the intended usage? On Sat, May 22, 2021 at 8:52 PM Xander Dunn wrote: > I see how to compress writes to a particular file using > arrow::io::CompressedOutputStream::Make, but I=E2=80=99m having difficult= y > figuring out how to make Dataset writes compressed. I have my code set up > similar to the CreateExampleParquetHivePartitionedDataset example here > . > > > I suspect there is some option on the FileSystemDatasetWriteOptions to > specify compression, but I haven=E2=80=99t been able to uncover it: > > ds::FileSystemDatasetWriteOptions write_options; > write_options.file_write_options =3D format->DefaultWriteOptions(); > write_options.filesystem =3D filesystem; > write_options.base_dir =3D base_path; > write_options.partitioning =3D partitioning; > write_options.basename_template =3D "part{i}.parquet"; > ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner)); > > FileSystemDatasetWriteOptions is defined here > > and doesn=E2=80=99t have a compression option. > > The file_write_options property is a ParquetFileWriteOptions, which is > defined here > > and has a parquet::WriterProperties and parquet::ArrowWriterProperties. > It=E2=80=99s created here: > > std::shared_ptr ParquetFileFormat::DefaultWriteOptions(= ) { > std::shared_ptr options( > new ParquetFileWriteOptions(shared_from_this())); > options->writer_properties =3D parquet::default_writer_properties(); > options->arrow_writer_properties =3D parquet::default_arrow_writer_prop= erties(); > return options; > } > > parquet::WriterProperties can be created with a compression specified > like this: > > parquet::WriterProperties::Builder file_writer_options_builder; > file_writer_options_builder.compression(arrow::Compression::BROTLI); > std::shared_ptr props =3D file_writer_opti= ons_builder.build(); > > However, I have been unable to create a FileWriteOptions which includes > this WriterProperties. What is shared_from_this()? Creating a > FileWriteOptions with std::make_shared<> doesn=E2=80=99t compile. Any poi= nters on > creating a FileWriteOptions in my project, or a better way to specify the > compression type on a dataset write? > --00000000000090777705c2f7a5fa Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Alright, I got it working:

    parquet::WriterProperties::Builder file_writer_opti=
ons_builder;
    file_writer_options_builder.compression(arrow::Compression::BROTLI);
    //file_writer_options_builder.compression(arrow::Compression::UNCO=
MPRESSED);
    std::share=
d_ptr<parquet::WriterProperties> props =3D file_writer_options=
_builder.build();

    std::share=
d_ptr<ds::FileWriteOptions> file_write_options =3D format->=
DefaultWriteOptions();
    auto parquet_options =3D arrow::internal::checked_pointer_cast&=
lt;ds::ParquetFileWriteOptions>(file_write_options);
    parquet_options->writer_properties =3D props;
    arrow::dataset::FileSystemDatasetWriteOptions write_options;
    write_options.file_write_options =3D parquet_options;

But surely a call to arrow::internal is not the intended usage?


On Sat, May 22, 2021 at 8:52 PM Xander Dunn <xander@xander.ai> wrote:
=

I see how to compress writes to a particular= file using arrow::io::CompressedOutputStream::Make= , but I=E2=80=99m having difficulty figuring out how to make Dataset writes compressed. I have my code set up similar to the <= code style=3D"font-size:0.85em;font-family:Consolas,Inconsolata,Courier,mon= ospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px = solid rgb(234,234,234);background-color:rgb(248,248,248);border-radius:3px;= display:inline">CreateExampleParquetHivePartitionedDataset example <= a href=3D"https://github.com/apache/arrow/blob/master/cpp/examples/arrow/da= taset_documentation_example.cc#L113" target=3D"_blank">here.

I suspect there is some option on the FileSystemDatasetWriteOptions to specify compression, = but I haven=E2=80=99t been able to uncover it:

ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options =3D format->DefaultWriteOptions();
  write_options.filesystem =3D filesystem;
  write_options.base_dir =3D base_path;
  write_options.partitioning =3D partitioning;
  write_options.basename_template =3D =
"part{i}.parquet";
  ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));

FileSystemDatasetWriteOp= tions is defined here and doesn=E2=80=99t have a compression option= .

The file_write_options property is a ParquetFileWriteOptions, which = is defined here and has a parquet::WriterProperties and parquet::ArrowWriterProperties. It=E2=80= =99s created here:

std::shared_ptr<FileWri=
teOptions> ParquetFileFormat::DefaultWriteOptions() {
  std::shared_ptr<ParquetFil=
eWriteOptions> options(
      new Parqu=
etFileWriteOptions(shared_from_this()));
  options->writer_properties =3D parquet::default_writer_properties();
  options->arrow_writer_properties =3D parquet::default_arrow_writer_pro=
perties();
  return option=
s;
}

parquet::WriterPropertie= s can be created with a compression specified like this:

    parquet::WriterProperties::Builder file_writer_options_builder;
    file_writer_options_builder.compression(arrow::Compression::BROTLI);
    std::shared_ptr<parquet:=
:WriterProperties> props =3D file_writer_options_builder.build();

However, I have been unable to create a <= code style=3D"font-size:0.85em;font-family:Consolas,Inconsolata,Courier,mon= ospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px = solid rgb(234,234,234);background-color:rgb(248,248,248);border-radius:3px;= display:inline">FileWriteOptions which includes this WriterProperties. What is shared_from_this()? Creating a FileWriteOptions with std::make_shared<> doesn=E2=80=99t compile. Any pointers on= creating a FileWriteOptions in my project, or a be= tter way to specify the compression type on a dataset write?

--00000000000090777705c2f7a5fa--