From user-return-1248-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Wed May 26 22:07:34 2021
Return-Path: <user-return-1248-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id E81D718065E
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 27 May 2021 00:07:33 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 1BCB93F87B
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 26 May 2021 22:07:33 +0000 (UTC)
Received: (qmail 22346 invoked by uid 500); 26 May 2021 22:07:32 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 22335 invoked by uid 99); 26 May 2021 22:07:32 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 May 2021 22:07:32 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id C1F241FF3A4
	for <user@arrow.apache.org>; Wed, 26 May 2021 22:07:31 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.001
X-Spam-Level:
X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=ursacomputing.com
Received: from mx1-ec2-va.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id 2-uQz6wZq8k4 for <user@arrow.apache.org>;
	Wed, 26 May 2021 22:07:30 +0000 (UTC)
Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.166.46; helo=mail-io1-f46.google.com; envelope-from=weston@ursacomputing.com; receiver=<UNKNOWN> 
Received: from mail-io1-f46.google.com (mail-io1-f46.google.com [209.85.166.46])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 609B1BD1E1
	for <user@arrow.apache.org>; Wed, 26 May 2021 22:07:30 +0000 (UTC)
Received: by mail-io1-f46.google.com with SMTP id n10so2687714ion.8
        for <user@arrow.apache.org>; Wed, 26 May 2021 15:07:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=ursacomputing.com; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=378oq7eShZRLEAAXtI77BMNd5qz8YnYC9O1Za3bji2Q=;
        b=blZ8ZEDIb1N5kPUTGhudu62yjkOdjPqPrDEMgCyWukA9agP4VPeh0ekrk1AcEY6lN7
         3PxkpkWt81Z7shobU6ocRNeSZG5dJV5aDuss2LsdcPp8yxFJVppDq/p/ynU9F77ZFhDT
         UbnvqyRjF3aeZt6OTaF+QRnhgD0/Hk8+i5jXKYbFzxo94nGRDZmogTi0X/jl05vvZYDm
         KNh+15MA079eEWYRVwzGNDzw8Q93tyogAQkkItfDBAycWV10LOhFynKBCQ8fKp4SHHp6
         dqoRkaIBtdu/tu3ZITZ1ABs3ytxpwjcHuZSSMeBOcAZwA7IPTcF3mnLpV4fbT7DKG/6x
         Majg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=378oq7eShZRLEAAXtI77BMNd5qz8YnYC9O1Za3bji2Q=;
        b=T5t200WxsSAoisjkspvZgp8rtQZvoOPVmjfg/RTpFihE78lWwH8lU7P4HrUApvt9HY
         Bldb+TT1JWFjNB/fk8gjmn8klkSFUJc0jeAG9VxDjrJ4RgX5uwXmN7PgkInHyAYMpG0b
         KN902ix3csMvVLNKJDGWokKqJu/5p40NjrbzOdrL1U0bjzxTwqCGDsffZsO5g9J3UTfA
         s7nS+aeaZniLur0XD4VVRWYSXLi3rHzXvgSqRi1k6R+0XJxJGvXzs8aV0K1Ty8KLZwjX
         dpXEWq6cCXnlbluP/cowNf2SNYnxapVUHEeNZfqSdT9IfaTDA6MtSbOL0UB/Bqc61mUp
         Rnsw==
X-Gm-Message-State: AOAM533nsnF+pcdSgfxT/vM9HPWOCRjp9abMddM10CmPcOFQEGHh1DtT
	aQKS75nnvLPuClNGF6HbfAVS8eNY8hM4NL4HZ5gbnMTshbkCLZIc
X-Google-Smtp-Source: ABdhPJxEeuRkoGoSvWi4dpm6cfCVrsD4Wdmg1QiVgwCNGiUXwaXNDQs1pZj8mieO2ZNQLGNTmXyNjiV9iHdQxvnky58=
X-Received: by 2002:a6b:3b92:: with SMTP id i140mr339205ioa.23.1622066843980;
 Wed, 26 May 2021 15:07:23 -0700 (PDT)
MIME-Version: 1.0
References: <CAHvHNrfTQdTZPU-e3rab5qVMyBvfHyrhLf9WuhLbo242V7=QSw@mail.gmail.com>
 <CAE+GJ4byBnE9BVMzRdnVQ6FqEn2s8Afob=nr0kxY0e9ku3NpbA@mail.gmail.com>
In-Reply-To: <CAE+GJ4byBnE9BVMzRdnVQ6FqEn2s8Afob=nr0kxY0e9ku3NpbA@mail.gmail.com>
From: Weston Pace <weston@ursacomputing.com>
Date: Wed, 26 May 2021 12:07:12 -1000
Message-ID: <CAL-5j1D=PXDp+s11Oe13R83rkP2L2k3ETCRKLXMNNZ_o_9CUcg@mail.gmail.com>
Subject: Re: Long-Running Continuous Data Saving to File
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="00000000000009abbb05c342de96"

--00000000000009abbb05c342de96
Content-Type: text/plain; charset="UTF-8"

Elad's advice is very helpful.  This is not a problem that Arrow solves
today (to the best of my knowledge).  It is a topic that comes up
periodically[1][2][3].  If a crash happens while your parquet stream writer
is open then the most likely outcome is that you will be missing the footer
(this gets written on close) and be unable to read the file (although it
could presumably be recovered).  The parquet format may be able to support
an append mode but readers don't typically support it.

I believe a common approach to this problem is to dump out lots of small
files as the data arrives and then periodically batch them together.  Kafka
is a great way to do this but it could be done with a single process as
well.  If you go very far down this path you will likely run into concerns
like durability and schema evolution so I don't mean to imply that it is
trivial :)

[1]
https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
[2] https://issues.apache.org/jira/browse/PARQUET-1154
[3]
https://lists.apache.org/thread.html/r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%40%3Cdev.arrow.apache.org%3E

On Wed, May 26, 2021 at 7:39 AM Elad Rosenheim <elad@dynamicyield.com>
wrote:

> Hi,
>
> While I'm not using the C++ version of Arrow, the issue you're talking
> about is a very common concern.
>
> There are a few points to discuss here:
>
> 1. Generally, Parquet files cannot be appended to. You could of course
> load the file to memory, add more information and re-save, but that's not
> really what you're looking for... tools like `parquet-tools` can
> concatenate files together by creating a new file with two (or more) row
> groups, but that's not a very good solution either. Having multiple row
> groups in a single file is sometimes desirable, but in this case would just
> create a less compressed file, most probably.
>
> 2. The other concern is reliability - having a process that holds a big
> batch in memory and then spills them to disk every X minutes/rows/bytes is
> bound to have issues when things crash/get stuck/need to go down for
> maintenance. You probably want to have as close to "exactly once"
> guarantees as possible (the holy grail...). One common solution for this is
> to write to Kafka, and a have a consumer that periodically reads a batch of
> messages and stores them to file. This is nowadays provided by Kafka
> Connect
> <https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
> thankfully. Anyway, the "exactly once" part stops at this point, and for
> anything that happens downstream you'd need
>
> 3. Then, you're back to the question of many many files per day... there
> is no magical solution to this. You may need to have a scheduled task that
> reads files every X hours (or every day?), and re-partitions the data in
> the way that makes the most sense for processing/querying later - perhaps
> by date, perhaps by customer, both, etc. There are various tools that help
> in this.
>
> Elad
>
> On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xander@xander.ai> wrote:
>
>> I have a very long-running (months) program that is streaming in data
>> continually, processing it, and saving it to file using Arrow. My current
>> solution is to buffer several million rows and write them to a new .parquet
>> file each time. This works, but produces 1000+ files every day.
>>
>> If I could, I would just append to the same file for each day. I see an
>> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
>> work with? Can I append to .parquet or .feather files? Googling seems to
>> indicate these formats can't be appended to.
>>
>> Using the `parquet::StreamWriter
>> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
>> could I continually stream rows to a single file throughout the day? What
>> happens if the program is unexpectedly terminated? Would everything in the
>> currently open monolithic file be lost? I would be streaming rows to a
>> single .parquet file for 24 hours.
>>
>> Thanks,
>> Xander
>>
>>

--00000000000009abbb05c342de96
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Elad&#39;s advice is very helpful.=C2=A0 This is not =
a problem that Arrow solves today (to the best of my knowledge).=C2=A0 It i=
s a topic that comes up periodically[1][2][3].=C2=A0 If a crash happens whi=
le your parquet stream writer is open then the most likely outcome is that =
you will be missing the footer (this gets written on close) and be unable t=
o read the file (although it could presumably be recovered).=C2=A0 The parq=
uet format may be able to support an append mode but readers don&#39;t typi=
cally support it.</div><div><br></div><div>I believe a common approach to t=
his problem is to dump out lots of small files as the data arrives and then=
 periodically batch them together.=C2=A0 Kafka is a great way to do this bu=
t it could be done with a single process as well.=C2=A0 If you go very far =
down this path you will likely run into concerns like durability and schema=
 evolution so I don&#39;t mean to imply that it is trivial :)<br></div><div=
><br></div><div>[1] <a href=3D"https://stackoverflow.com/questions/47113813=
/using-pyarrow-how-do-you-append-to-parquet-file">https://stackoverflow.com=
/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file</a></di=
v><div>[2] <a href=3D"https://issues.apache.org/jira/browse/PARQUET-1154">h=
ttps://issues.apache.org/jira/browse/PARQUET-1154</a></div><div>[3] <a href=
=3D"https://lists.apache.org/thread.html/r7efad314abec0219016886eaddc7ba79a=
451087b6324531bdeede1af%40%3Cdev.arrow.apache.org%3E">https://lists.apache.=
org/thread.html/r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%4=
0%3Cdev.arrow.apache.org%3E</a></div></div><br><div class=3D"gmail_quote"><=
div dir=3D"ltr" class=3D"gmail_attr">On Wed, May 26, 2021 at 7:39 AM Elad R=
osenheim &lt;<a href=3D"mailto:elad@dynamicyield.com">elad@dynamicyield.com=
</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">=
<div dir=3D"ltr">Hi,<div><br></div><div>While I&#39;m not using the C++ ver=
sion of Arrow, the issue you&#39;re talking about is a very common concern.=
</div><div><br>There are a few points to discuss here:<br><br>1. Generally,=
 Parquet files cannot be appended to. You could of course load the file to =
memory, add more information and re-save, but that&#39;s not really what yo=
u&#39;re looking for... tools like `parquet-tools` can concatenate files to=
gether by creating a new file with two (or more) row groups, but that&#39;s=
 not a very good solution either. Having multiple row groups in a single fi=
le is sometimes desirable, but in this case would just create a less compre=
ssed file, most probably.<br><br>2. The other concern is reliability - havi=
ng a process that holds a big batch in memory and then spills them to disk =
every X minutes/rows/bytes is bound to have issues when things crash/get st=
uck/need to go down for maintenance. You probably want to have as close to =
&quot;exactly once&quot; guarantees as possible (the holy grail...). One co=
mmon solution for this is to write to Kafka, and a have a consumer that per=
iodically reads a batch of messages and stores them to file. This is nowada=
ys <a href=3D"https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exact=
ly-once/" target=3D"_blank">provided by Kafka Connect</a>, thankfully. Anyw=
ay, the &quot;exactly once&quot; part stops at this point, and for anything=
 that happens downstream you&#39;d need<br><br>3. Then, you&#39;re back to =
the question of many many files per day... there is no magical solution to =
this. You may need to have a scheduled task that reads files every X hours =
(or every day?), and re-partitions the data in the way that makes the most =
sense for processing/querying later - perhaps by date, perhaps by customer,=
 both, etc. There are various tools that help in this.</div><div><br></div>=
<div>Elad</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=
=3D"gmail_attr">On Wed, May 26, 2021 at 7:32 PM Xander Dunn &lt;<a href=3D"=
mailto:xander@xander.ai" target=3D"_blank">xander@xander.ai</a>&gt; wrote:<=
br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div><=
div>I have a very long-running (months) program that is streaming in data c=
ontinually, processing it, and saving it to file using Arrow. My current so=
lution is to buffer several million rows and write them to a new .parquet f=
ile each time. This works, but produces 1000+ files every day.=C2=A0<br></d=
iv><div><br></div><div>If I could, I would just append to the same file for=
 each day. I see an `arrow::fs::FileySystem::OpenAppendStream` - what file =
formats does this work with? Can I append to .parquet or .feather files? Go=
ogling seems to indicate these formats can&#39;t be appended to.<br></div><=
div><br></div><div>Using the `parquet::<a href=3D"https://arrow.apache.org/=
docs/cpp/parquet.html?highlight=3Dwritetable#writetable" target=3D"_blank">=
StreamWriter</a>`, could I continually stream rows to a single file through=
out the day? What happens if the program is unexpectedly terminated? Would =
everything in the currently open monolithic file be lost?=C2=A0I would be s=
treaming rows to a single .parquet file for 24 hours.<br></div><div><br></d=
iv><div>Thanks,<br></div><div>Xander<br></div></div><div><div style=3D"disp=
lay:none;border:0px none;width:0px;height:0px;overflow:hidden"><img src=3D"=
https://r.superhuman.com/qLvnbYbS5fTkUBTltChcI-WkVSb3UMNTuaYCPMBiy_B-bQH-ek=
SDYdFQ-fdrZBltx83X5wPzsIL5wjZkV2j0LtNV3erNBhs2Lo0sZWmfd7iZ8Ljqtj8TCt-RQAb51=
Gxf3Qr_jeIgonZ_0MUfpUMxhRn3WqdQUuFQmV3KkjhPBROlWFLAY6tWzA8.gif" alt=3D"" st=
yle=3D"display: none; border: 0px none; width: 0px; height: 0px; overflow: =
hidden;" width=3D"1" height=3D"0"></div><br><div></div></div></div></div>
</blockquote></div>
</blockquote></div>

--00000000000009abbb05c342de96--