From user-return-1243-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun May 23 07:16:34 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 9A46B180643 for ; Sun, 23 May 2021 09:16:34 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 19BF66058F for ; Sun, 23 May 2021 07:16:25 +0000 (UTC) Received: (qmail 34362 invoked by uid 500); 23 May 2021 07:16:24 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 34352 invoked by uid 99); 23 May 2021 07:16:24 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 May 2021 07:16:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id 5D84EC0435 for ; Sun, 23 May 2021 07:16:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id w1D5RXBQQtgj for ; Sun, 23 May 2021 07:16:22 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::635; helo=mail-ej1-x635.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id A55357FF63 for ; Sun, 23 May 2021 07:16:22 +0000 (UTC) Received: by mail-ej1-x635.google.com with SMTP id et19so29775646ejc.4 for ; Sun, 23 May 2021 00:16:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to; bh=AKhGKUEw4a79QQk+vgUd2lYtX+s9tDwxDIYbdKBu6JI=; b=reSieh6zW2DQm/xqxqS3OOS6cHVdP78mr+M60QRy+ko7Yv1WKVU3ngg0gH3iUFhgtf dXUcubPWzkF0T/jfdLM73t4Erx7UxJ9/Wh121d+QgnHarvc/VCpjOyZ9/3cNPHZGwF1y ehSl3EPBz7qG+Wbh5nKmzazAZTOKEKLEZkCL14+oRYbH9MRUQNPkz7hYdUkGlE+O7vh6 1FqTMrtkMOrgy5PV84Jn5wn3Rp8w5zzoFDGWfw4h/1X4s5GPXjdj/msGE+aVIYFA8Y/z EYNpTbuSVYcuVVOs/ZmT7tS83fayehkliS1ILpyySPPthGu6o2On2NLNrY+1hIpV8SBu U21Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to; bh=AKhGKUEw4a79QQk+vgUd2lYtX+s9tDwxDIYbdKBu6JI=; b=W9+e0iCFHbhA5sp2dK87wfOhSQuCv6ZeIIc79/gxAPFi11BRE/5FRc3wJtAnfJMceB WSUskuQF4IbsakwVLrle7yEhVbKizE0u1oZFg2nTMfLH9Ol1cJg+EAvkQv2DYFQgR/sJ /hBGsPwI+PmJ5uD/6v+Sxrm25FReO6HJYBDJfF8ReGNOADAEdZ5jUAt1dErEYOBU3imt 0RPWgW/YaOYMS5uFq66pUEpz+f3wYWCDmJy4ds3buaYPgr1hselnyQ928E5vKmcua+C+ OsBsuVxgyTauCAFQZ6OLi2BXj2DVmyIkoer7NKC+RdyGjgn0JTybIRMdvptKeQyxBqko WfuQ== X-Gm-Message-State: AOAM530hWqRoYlA4CouoVuIh3jX6o7U0wBgO75a82HtnbkYXd1oAKdus CWdTy/+vqYerrN36XZz3licHdPz3UQqroA+V4tCiYuGbaMigAw== X-Google-Smtp-Source: ABdhPJwl+9aUrLWxTx7WACY7iUesoO4RhsV5zYHDXStyYAVR3eg0hYxtYvnUc34sjPOnLQDwcJW9RqiGh6ecYMQ2piI= X-Received: by 2002:a17:906:328c:: with SMTP id 12mr17776413ejw.361.1621754181694; Sun, 23 May 2021 00:16:21 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Sun, 23 May 2021 00:16:10 -0700 Message-ID: Subject: Re: How to Compress Dataset Writes To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000e9b9cd05c2fa11cb" --000000000000e9b9cd05c2fa11cb Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable internal::checked_pointer_cast isn't really anything special. It simply switches between std::static_pointer_cast and std::dynamic_pointer_cast depending on debug/release compilation. So you can choose one or the other depending on how confident you are in the type you are casting. On Sat, May 22, 2021 at 9:23 PM Xander Dunn wrote: > Alright, I got it working: > > parquet::WriterProperties::Builder file_writer_options_builder; > file_writer_options_builder.compression(arrow::Compression::BROTLI); > //file_writer_options_builder.compression(arrow::Compression::UNCOMPR= ESSED); > std::shared_ptr props =3D file_writer_opti= ons_builder.build(); > > std::shared_ptr file_write_options =3D format->= DefaultWriteOptions(); > auto parquet_options =3D arrow::internal::checked_pointer_cast(file_write_options); > parquet_options->writer_properties =3D props; > arrow::dataset::FileSystemDatasetWriteOptions write_options; > write_options.file_write_options =3D parquet_options; > > But surely a call to arrow::internal is not the intended usage? > > On Sat, May 22, 2021 at 8:52 PM Xander Dunn wrote: > >> I see how to compress writes to a particular file using >> arrow::io::CompressedOutputStream::Make, but I=E2=80=99m having difficul= ty >> figuring out how to make Dataset writes compressed. I have my code set >> up similar to the CreateExampleParquetHivePartitionedDataset example her= e >> . >> >> >> I suspect there is some option on the FileSystemDatasetWriteOptions to >> specify compression, but I haven=E2=80=99t been able to uncover it: >> >> ds::FileSystemDatasetWriteOptions write_options; >> write_options.file_write_options =3D format->DefaultWriteOptions(); >> write_options.filesystem =3D filesystem; >> write_options.base_dir =3D base_path; >> write_options.partitioning =3D partitioning; >> write_options.basename_template =3D "part{i}.parquet"; >> ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner))= ; >> >> FileSystemDatasetWriteOptions is defined here >> >> and doesn=E2=80=99t have a compression option. >> >> The file_write_options property is a ParquetFileWriteOptions, which is >> defined here >> >> and has a parquet::WriterProperties and parquet::ArrowWriterProperties. >> It=E2=80=99s created here: >> >> std::shared_ptr ParquetFileFormat::DefaultWriteOptions= () { >> std::shared_ptr options( >> new ParquetFileWriteOptions(shared_from_this())); >> options->writer_properties =3D parquet::default_writer_properties(); >> options->arrow_writer_properties =3D parquet::default_arrow_writer_pro= perties(); >> return options; >> } >> >> parquet::WriterProperties can be created with a compression specified >> like this: >> >> parquet::WriterProperties::Builder file_writer_options_builder; >> file_writer_options_builder.compression(arrow::Compression::BROTLI); >> std::shared_ptr props =3D file_writer_opt= ions_builder.build(); >> >> However, I have been unable to create a FileWriteOptions which includes >> this WriterProperties. What is shared_from_this()? Creating a >> FileWriteOptions with std::make_shared<> doesn=E2=80=99t compile. Any po= inters >> on creating a FileWriteOptions in my project, or a better way to specify >> the compression type on a dataset write? >> > --000000000000e9b9cd05c2fa11cb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
internal::checked_pointer_cast isn't really anything s= pecial.=C2=A0 It simply switches between std::static_pointer_cast<T> = and std::dynamic_pointer_cast<T> depending on debug/release compilati= on. So you can choose one or the other depending on how confident you are i= n the type you are casting.

On Sat, M= ay 22, 2021 at 9:23 PM Xander Dunn <= xander@xander.ai> wrote:

Alr= ight, I got it working:

    parquet::WriterProperties::Builder file_writer_options_builder;
    file_writer_options_builder.compression(arrow::Compression::BROTLI);
    //file_writer_=
options_builder.compression(arrow::Compression::UNCOMPRESSED);
    std::shared_ptr<parquet:=
:WriterProperties> props =3D file_writer_options_builder.build();

    std::shared_ptr<ds::File=
WriteOptions> file_write_options =3D format->DefaultWriteOptions();
    auto parque=
t_options =3D arrow::internal::checked_pointer_cast<ds::ParquetFileWrite=
Options>(file_write_options);
    parquet_options->writer_properties =3D props;
    arrow::dataset::FileSystemDatasetWriteOptions write_options;
    write_options.file_write_options =3D parquet_options;

But surely a call to arr= ow::internal is not the intended usage?


<= div dir=3D"ltr" class=3D"gmail_attr">On Sat, May 22, 2021 at 8:52 PM Xander= Dunn <xander@xand= er.ai> wrote:

I see how to c= ompress writes to a particular file using arrow::io::Compr= essedOutputStream::Make, but I=E2=80=99m having difficulty figuring = out how to make Dataset writes compressed. I have m= y code set up similar to the CreateExampleParquetHiveParti= tionedDataset example here.

I suspect there is some option on the FileSystemDatasetWriteOptions to specify compression, = but I haven=E2=80=99t been able to uncover it:

ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options =3D format->DefaultWriteOptions();
  write_options.filesystem =3D filesystem;
  write_options.base_dir =3D base_path;
  write_options.partitioning =3D partitioning;
  write_options.basename_template =3D =
"part{i}.parquet";
  ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));

FileSystemDatasetWriteOp= tions is defined here and doesn=E2=80=99t have a compression option= .

The file_write_options property is a ParquetFileWriteOptions, which = is defined here and has a parquet::WriterProperties and parquet::ArrowWriterProperties. It=E2=80= =99s created here:

std::shared_ptr<FileWri=
teOptions> ParquetFileFormat::DefaultWriteOptions() {
  std::shared_ptr<ParquetFil=
eWriteOptions> options(
      new Parqu=
etFileWriteOptions(shared_from_this()));
  options->writer_properties =3D parquet::default_writer_properties();
  options->arrow_writer_properties =3D parquet::default_arrow_writer_pro=
perties();
  return option=
s;
}

parquet::WriterPropertie= s can be created with a compression specified like this:

    parquet::WriterProperties::Builder file_writer_options_builder;
    file_writer_options_builder.compression(arrow::Compression::BROTLI);
    std::shared_ptr<parquet:=
:WriterProperties> props =3D file_writer_options_builder.build();

However, I have been unable to create a <= code style=3D"font-size:0.85em;font-family:Consolas,Inconsolata,Courier,mon= ospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px = solid rgb(234,234,234);background-color:rgb(248,248,248);border-radius:3px;= display:inline">FileWriteOptions which includes this WriterProperties. What is shared_from_this()? Creating a FileWriteOptions with std::make_shared<> doesn=E2=80=99t compile. Any pointers on= creating a FileWriteOptions in my project, or a be= tter way to specify the compression type on a dataset write?

--000000000000e9b9cd05c2fa11cb--