arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xander Dunn <xan...@xander.ai>
Subject Long-Running Continuous Data Saving to File
Date Wed, 26 May 2021 16:32:00 GMT
I have a very long-running (months) program that is streaming in data
continually, processing it, and saving it to file using Arrow. My current
solution is to buffer several million rows and write them to a new .parquet
file each time. This works, but produces 1000+ files every day.

If I could, I would just append to the same file for each day. I see an
`arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
work with? Can I append to .parquet or .feather files? Googling seems to
indicate these formats can't be appended to.

Using the `parquet::StreamWriter
<https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
could I continually stream rows to a single file throughout the day? What
happens if the program is unexpectedly terminated? Would everything in the
currently open monolithic file be lost? I would be streaming rows to a
single .parquet file for 24 hours.

Thanks,
Xander

Mime
View raw message