From user-return-1247-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Wed May 26 17:39:42 2021
Return-Path: <user-return-1247-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 440FF18065E
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 26 May 2021 19:39:42 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 1CB2661635
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 26 May 2021 17:39:40 +0000 (UTC)
Received: (qmail 86276 invoked by uid 500); 26 May 2021 17:39:40 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 86266 invoked by uid 99); 26 May 2021 17:39:39 -0000
Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 May 2021 17:39:39 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id 46C8DC0440
	for <user@arrow.apache.org>; Wed, 26 May 2021 17:39:39 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_BLOCKED=0.001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=dynamicyield.com
Received: from mx1-ec2-va.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024)
	with ESMTP id 9Nm5gL8apuz6 for <user@arrow.apache.org>;
	Wed, 26 May 2021 17:39:38 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.222.174; helo=mail-qk1-f174.google.com; envelope-from=elad@dynamicyield.com; receiver=<UNKNOWN> 
Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id DB428BD1FF
	for <user@arrow.apache.org>; Wed, 26 May 2021 17:39:37 +0000 (UTC)
Received: by mail-qk1-f174.google.com with SMTP id h20so1803141qko.11
        for <user@arrow.apache.org>; Wed, 26 May 2021 10:39:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=dynamicyield.com; s=dynamicyield;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=V2FOAR+jWcSHrXzOuOX5DCydiASkT6dk1l46G9B9uds=;
        b=R0n4CcW7fE3l5UnjcPunVQ3JW6ZjbjZ/y7aapUPv8vetQcuimBh3lkVBG0YfOqvd/X
         xK3NnYTGEwMaWpZEgfSYzMlQwJihfvMvuwRMzl4YxAjMpDdr02ndCR8xIQ/+3tGa86En
         RGLVfpI2admUtn1vnEYwEs/XxMtiapiVyLXvU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=V2FOAR+jWcSHrXzOuOX5DCydiASkT6dk1l46G9B9uds=;
        b=biKhma5rMCf2EGixuo16sH/Lfp9uT6rYOVVS+/46gmJA1tcCNHyM7t4tq02v16DjyD
         HkXVb6GUI44lWDGY69RkwX159Wrcys9WPmKwRoFmic58GMbAHnxX8wx0tH93AkjRw4Aq
         UEiVWqHi24WowwMJBZr1QDWx1wx2xKWhlWJfXr3vKV6LHSRnsJXiOG0bBUzt+vjftPeA
         3KRJmV1KG59V4kWeggA699AA9otL4ZaW+xJl03Y+aVppSZSKuDjGY2spyx+H6sfDdXm0
         DcVArq9RoBcXOfNDF0oe0kNQ9WZ+9QETKTceL+pM9rkWQ3CPsmbXiuavU2o11HtD67rC
         efVw==
X-Gm-Message-State: AOAM530j7QVHFV2sQYBokHuG6IVmN4PtMcelVCU3MKXo2GO3vklgCQJW
	Pw2SV6u4aAPwAXcK6o7ANyV3c9lFn9rQrFv4NkdmjeUCvx4eEA==
X-Google-Smtp-Source: ABdhPJwosKz5jcc4NCx1zQbdNFe/CloQVssCVSd1i9IumswY/ayFg/rTbbw4JdTFFQ5iB++KqqtQ/5++dJRcSA9kevU=
X-Received: by 2002:a05:620a:248c:: with SMTP id i12mr43242049qkn.56.1622050777298;
 Wed, 26 May 2021 10:39:37 -0700 (PDT)
MIME-Version: 1.0
References: <CAHvHNrfTQdTZPU-e3rab5qVMyBvfHyrhLf9WuhLbo242V7=QSw@mail.gmail.com>
In-Reply-To: <CAHvHNrfTQdTZPU-e3rab5qVMyBvfHyrhLf9WuhLbo242V7=QSw@mail.gmail.com>
From: Elad Rosenheim <elad@dynamicyield.com>
Date: Wed, 26 May 2021 20:39:24 +0300
Message-ID: <CAE+GJ4byBnE9BVMzRdnVQ6FqEn2s8Afob=nr0kxY0e9ku3NpbA@mail.gmail.com>
Subject: Re: Long-Running Continuous Data Saving to File
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="000000000000638d5505c33f20bc"

--000000000000638d5505c33f20bc
Content-Type: text/plain; charset="UTF-8"

Hi,

While I'm not using the C++ version of Arrow, the issue you're talking
about is a very common concern.

There are a few points to discuss here:

1. Generally, Parquet files cannot be appended to. You could of course load
the file to memory, add more information and re-save, but that's not really
what you're looking for... tools like `parquet-tools` can concatenate files
together by creating a new file with two (or more) row groups, but that's
not a very good solution either. Having multiple row groups in a single
file is sometimes desirable, but in this case would just create a less
compressed file, most probably.

2. The other concern is reliability - having a process that holds a big
batch in memory and then spills them to disk every X minutes/rows/bytes is
bound to have issues when things crash/get stuck/need to go down for
maintenance. You probably want to have as close to "exactly once"
guarantees as possible (the holy grail...). One common solution for this is
to write to Kafka, and a have a consumer that periodically reads a batch of
messages and stores them to file. This is nowadays provided by Kafka Connect
<https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
thankfully. Anyway, the "exactly once" part stops at this point, and for
anything that happens downstream you'd need

3. Then, you're back to the question of many many files per day... there is
no magical solution to this. You may need to have a scheduled task that
reads files every X hours (or every day?), and re-partitions the data in
the way that makes the most sense for processing/querying later - perhaps
by date, perhaps by customer, both, etc. There are various tools that help
in this.

Elad

On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xander@xander.ai> wrote:

> I have a very long-running (months) program that is streaming in data
> continually, processing it, and saving it to file using Arrow. My current
> solution is to buffer several million rows and write them to a new .parquet
> file each time. This works, but produces 1000+ files every day.
>
> If I could, I would just append to the same file for each day. I see an
> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
> work with? Can I append to .parquet or .feather files? Googling seems to
> indicate these formats can't be appended to.
>
> Using the `parquet::StreamWriter
> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
> could I continually stream rows to a single file throughout the day? What
> happens if the program is unexpectedly terminated? Would everything in the
> currently open monolithic file be lost? I would be streaming rows to a
> single .parquet file for 24 hours.
>
> Thanks,
> Xander
>
>

--000000000000638d5505c33f20bc
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>While I&#39;m not using the C++ ver=
sion of Arrow, the issue you&#39;re talking about is a very common concern.=
</div><div><br>There are a few points to discuss here:<br><br>1. Generally,=
 Parquet files cannot be appended to. You could of course load the file to =
memory, add more information and re-save, but that&#39;s not really what yo=
u&#39;re looking for... tools like `parquet-tools` can concatenate files to=
gether by creating a new file with two (or more) row groups, but that&#39;s=
 not a very good solution either. Having multiple row groups in a single fi=
le is sometimes desirable, but in this case would just create a less compre=
ssed file, most probably.<br><br>2. The other concern is reliability - havi=
ng a process that holds a big batch in memory and then spills them to disk =
every X minutes/rows/bytes is bound to have issues when things crash/get st=
uck/need to go down for maintenance. You probably want to have as close to =
&quot;exactly once&quot; guarantees as possible (the holy grail...). One co=
mmon solution for this is to write to Kafka, and a have a consumer that per=
iodically reads a batch of messages and stores them to file. This is nowada=
ys <a href=3D"https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exact=
ly-once/">provided by Kafka Connect</a>, thankfully. Anyway, the &quot;exac=
tly once&quot; part stops at this point, and for anything that happens down=
stream you&#39;d need<br><br>3. Then, you&#39;re back to the question of ma=
ny many files per day... there is no magical solution to this. You may need=
 to have a scheduled task that reads files every X hours (or every day?), a=
nd re-partitions the data in the way that makes the most sense for processi=
ng/querying later - perhaps by date, perhaps by customer, both, etc. There =
are various tools that help in this.</div><div><br></div><div>Elad</div></d=
iv><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On =
Wed, May 26, 2021 at 7:32 PM Xander Dunn &lt;<a href=3D"mailto:xander@xande=
r.ai">xander@xander.ai</a>&gt; wrote:<br></div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2=
04);padding-left:1ex"><div><div><div><div>I have a very long-running (month=
s) program that is streaming in data continually, processing it, and saving=
 it to file using Arrow. My current solution is to buffer several million r=
ows and write them to a new .parquet file each time. This works, but produc=
es 1000+ files every day.=C2=A0<br></div><div><br></div><div>If I could, I =
would just append to the same file for each day. I see an `arrow::fs::Filey=
System::OpenAppendStream` - what file formats does this work with? Can I ap=
pend to .parquet or .feather files? Googling seems to indicate these format=
s can&#39;t be appended to.<br></div><div><br></div><div>Using the `parquet=
::<a href=3D"https://arrow.apache.org/docs/cpp/parquet.html?highlight=3Dwri=
tetable#writetable" target=3D"_blank">StreamWriter</a>`, could I continuall=
y stream rows to a single file throughout the day? What happens if the prog=
ram is unexpectedly terminated? Would everything in the currently open mono=
lithic file be lost?=C2=A0I would be streaming rows to a single .parquet fi=
le for 24 hours.<br></div><div><br></div><div>Thanks,<br></div><div>Xander<=
br></div></div><div><div style=3D"display:none;border:0px;width:0px;height:=
0px;overflow:hidden"><img src=3D"https://r.superhuman.com/qLvnbYbS5fTkUBTlt=
ChcI-WkVSb3UMNTuaYCPMBiy_B-bQH-ekSDYdFQ-fdrZBltx83X5wPzsIL5wjZkV2j0LtNV3erN=
Bhs2Lo0sZWmfd7iZ8Ljqtj8TCt-RQAb51Gxf3Qr_jeIgonZ_0MUfpUMxhRn3WqdQUuFQmV3Kkjh=
PBROlWFLAY6tWzA8.gif" alt=3D"" width=3D"1" height=3D"0" style=3D"display: n=
one; border: 0px; width: 0px; height: 0px; overflow: hidden;"></div><br><di=
v></div></div></div></div>
</blockquote></div>

--000000000000638d5505c33f20bc--