arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elad Rosenheim <e...@dynamicyield.com>
Subject Re: Long-Running Continuous Data Saving to File
Date Wed, 26 May 2021 17:39:24 GMT
Hi,

While I'm not using the C++ version of Arrow, the issue you're talking
about is a very common concern.

There are a few points to discuss here:

1. Generally, Parquet files cannot be appended to. You could of course load
the file to memory, add more information and re-save, but that's not really
what you're looking for... tools like `parquet-tools` can concatenate files
together by creating a new file with two (or more) row groups, but that's
not a very good solution either. Having multiple row groups in a single
file is sometimes desirable, but in this case would just create a less
compressed file, most probably.

2. The other concern is reliability - having a process that holds a big
batch in memory and then spills them to disk every X minutes/rows/bytes is
bound to have issues when things crash/get stuck/need to go down for
maintenance. You probably want to have as close to "exactly once"
guarantees as possible (the holy grail...). One common solution for this is
to write to Kafka, and a have a consumer that periodically reads a batch of
messages and stores them to file. This is nowadays provided by Kafka Connect
<https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
thankfully. Anyway, the "exactly once" part stops at this point, and for
anything that happens downstream you'd need

3. Then, you're back to the question of many many files per day... there is
no magical solution to this. You may need to have a scheduled task that
reads files every X hours (or every day?), and re-partitions the data in
the way that makes the most sense for processing/querying later - perhaps
by date, perhaps by customer, both, etc. There are various tools that help
in this.

Elad

On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xander@xander.ai> wrote:

> I have a very long-running (months) program that is streaming in data
> continually, processing it, and saving it to file using Arrow. My current
> solution is to buffer several million rows and write them to a new .parquet
> file each time. This works, but produces 1000+ files every day.
>
> If I could, I would just append to the same file for each day. I see an
> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
> work with? Can I append to .parquet or .feather files? Googling seems to
> indicate these formats can't be appended to.
>
> Using the `parquet::StreamWriter
> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
> could I continually stream rows to a single file throughout the day? What
> happens if the program is unexpectedly terminated? Would everything in the
> currently open monolithic file be lost? I would be streaming rows to a
> single .parquet file for 24 hours.
>
> Thanks,
> Xander
>
>

Mime
View raw message