(I should preface that I am extremely new with parquet)

I have an application (written in rust) that logs high-frequency data to an sqlite database. Our analysts would prefer to move the data to parquet.

I have written a simple proof of concept based on the 0.16.0 release of parquet, and am getting quite poor write performance. I would like to verify that I am approaching the problem correctly and using the tooling properly.

My data is shaped entirely flat containing ~1200 columns. Something like:

message my_data {
required data1 INT32;
required data2 INT32;
...
required data1200 INT32;
}

The program flow simply mirrors the example shown here, and is as follows -- I open a file with a SerializedFileWriter, from which I get a RowGroupWriter. Using that I get a typed ColumnWriter for each column and call write_batch with its new data. (My supposition is that this effectively creates a transaction for each column each update opening and closing the file to make many small writes)

In a parallel effort, another developer wrote another proof of concept using the cpp variant of parquet. This version is many, many times faster. They describe their flow as follows -- I use a ParquetFileWriter to create an AppendBufferedRowGroup. From that I get a writer for the specific type of data that I want to write and I call the WriteBatch method on it. After I have written N rows (default N = 1000) I flush the FileOutputStreamthat the ParquetFileWriter is using and finally I close the ParquetFileWriter. I do that for each batch of N.

So my questions come in multiple parts -

  1. Is my rust workflow "correct"? I recognize that the reference implementation I am using involves nested data structures, where my use-case has none (no repetition or definition values).

  2. Is there a way to get the workflow outlined in the cpp example, but using the rust API? I recognize that the rust parquet writer is a WIP

  3. If the API does not support this buffered functionality (yet?) is there a timeline for when it will?


- Sam