arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xander Dunn <xan...@xander.ai>
Subject Re: Long-Running Continuous Data Saving to File
Date Sat, 29 May 2021 01:05:43 GMT
Thanks to both of you, this is helpful.


On Wed, May 26, 2021 at 6:07 PM, Weston Pace <weston@ursacomputing.com>
wrote:

> Elad's advice is very helpful.  This is not a problem that Arrow solves
> today (to the best of my knowledge).  It is a topic that comes up
> periodically[1][2][3].  If a crash happens while your parquet stream writer
> is open then the most likely outcome is that you will be missing the footer
> (this gets written on close) and be unable to read the file (although it
> could presumably be recovered).  The parquet format may be able to support
> an append mode but readers don't typically support it.
>
> I believe a common approach to this problem is to dump out lots of small
> files as the data arrives and then periodically batch them together.  Kafka
> is a great way to do this but it could be done with a single process as
> well.  If you go very far down this path you will likely run into concerns
> like durability and schema evolution so I don't mean to imply that it is
> trivial :)
>
> [1] https://stackoverflow.com/questions/47113813/
> using-pyarrow-how-do-you-append-to-parquet-file
> [2] https://issues.apache.org/jira/browse/PARQUET-1154
> [3] https://lists.apache.org/thread.html/
> r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%40%3Cdev.arrow.
> apache.org%3E
>
> On Wed, May 26, 2021 at 7:39 AM Elad Rosenheim <elad@dynamicyield.com>
> wrote:
>
> Hi,
>
> While I'm not using the C++ version of Arrow, the issue you're talking
> about is a very common concern.
>
> There are a few points to discuss here:
>
> 1. Generally, Parquet files cannot be appended to. You could of course
> load the file to memory, add more information and re-save, but that's not
> really what you're looking for... tools like `parquet-tools` can
> concatenate files together by creating a new file with two (or more) row
> groups, but that's not a very good solution either. Having multiple row
> groups in a single file is sometimes desirable, but in this case would just
> create a less compressed file, most probably.
>
> 2. The other concern is reliability - having a process that holds a big
> batch in memory and then spills them to disk every X minutes/rows/bytes is
> bound to have issues when things crash/get stuck/need to go down for
> maintenance. You probably want to have as close to "exactly once"
> guarantees as possible (the holy grail...). One common solution for this is
> to write to Kafka, and a have a consumer that periodically reads a batch of
> messages and stores them to file. This is nowadays provided by Kafka
> Connect
> <https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
> thankfully. Anyway, the "exactly once" part stops at this point, and for
> anything that happens downstream you'd need
>
> 3. Then, you're back to the question of many many files per day... there
> is no magical solution to this. You may need to have a scheduled task that
> reads files every X hours (or every day?), and re-partitions the data in
> the way that makes the most sense for processing/querying later - perhaps
> by date, perhaps by customer, both, etc. There are various tools that help
> in this.
>
> Elad
>
> On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xander@xander.ai> wrote:
>
> I have a very long-running (months) program that is streaming in data
> continually, processing it, and saving it to file using Arrow. My current
> solution is to buffer several million rows and write them to a new .parquet
> file each time. This works, but produces 1000+ files every day.
>
> If I could, I would just append to the same file for each day. I see an
> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
> work with? Can I append to .parquet or .feather files? Googling seems to
> indicate these formats can't be appended to.
>
> Using the `parquet::StreamWriter
> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
> could I continually stream rows to a single file throughout the day? What
> happens if the program is unexpectedly terminated? Would everything in the
> currently open monolithic file be lost? I would be streaming rows to a
> single .parquet file for 24 hours.
>
> Thanks,
> Xander
>
>

Mime
View raw message