Hi Chris,
Unfortunately, I don't think there is something built into the writers that handles multiple dictionaries well.  The PR that added unit tests that writes out replacement and delta dictionary can be found here [1] and it appears to write directly to an output stream bypassing the writer classes.  It probably makes sense to expose an additional "writeDictionary" method to the writers to make this easier or rethink the overall API with respect to dictionaries a little bit.


[1] https://github.com/apache/arrow/pull/5945/files

On Mon, Jul 27, 2020 at 5:29 AM Chris Nuernberger <chris@techascent.com> wrote:
Hi Micah,

I had seen that page, yes, and my specific question was around delta dictionaries:


There doesn't seem to be a way to access this functionality via Java and the above stream example contains one batch and one dictionary batch.

On Sun, Jul 26, 2020 at 10:54 PM Micah Kornfield <emkornfield@gmail.com> wrote:
Hi Chris,
Have you read through the "reading and writing streaming format docs" [1].  If this doesn't work or you have something different in mind, some code samples of what you are currently doing might help.

I'll add that I think the dictionary APIs in java aren't the most ergonomic so if you have ideas on improving them, feel free to propose something.


On Sat, Jul 25, 2020 at 5:49 AM Chris Nuernberger <chris@techascent.com> wrote:

Using the java API for serialization, it is not clear to me how to utilize the per-batch dictionary functionality of the Arrow binary format.  Specifically the stream writer class expects the dictionaries to be defined when it loads the schema so it isn't clear how it will handle assigning a dictionary to a provider when saving a batch.

Is there an example that clarifies this use case?

Thanks for any input or feedback,