arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Per-batch dictionary example in Java
Date Tue, 28 Jul 2020 03:58:19 GMT
Hi Chris,
Unfortunately, I don't think there is something built into the writers that
handles multiple dictionaries well.  The PR that added unit tests that
writes out replacement and delta dictionary can be found here [1] and it
appears to write directly to an output stream bypassing the writer
classes.  It probably makes sense to expose an additional "writeDictionary"
method to the writers to make this easier or rethink the overall API with
respect to dictionaries a little bit.

-Micah

[1] https://github.com/apache/arrow/pull/5945/files


On Mon, Jul 27, 2020 at 5:29 AM Chris Nuernberger <chris@techascent.com>
wrote:

> Hi Micah,
>
> I had seen that page, yes, and my specific question was around delta
> dictionaries:
>
> https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages
>
> There doesn't seem to be a way to access this functionality via Java and
> the above stream example contains one batch and one dictionary batch.
>
> On Sun, Jul 26, 2020 at 10:54 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
>> Hi Chris,
>> Have you read through the "reading and writing streaming format docs"
>> [1].  If this doesn't work or you have something different in mind, some
>> code samples of what you are currently doing might help.
>>
>> I'll add that I think the dictionary APIs in java aren't the most
>> ergonomic so if you have ideas on improving them, feel free to
>> propose something.
>>
>> Thanks,
>> Micah
>>
>>
>> [1]
>> https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format
>>
>> On Sat, Jul 25, 2020 at 5:49 AM Chris Nuernberger <chris@techascent.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Using the java API for serialization, it is not clear to me how to
>>> utilize the per-batch dictionary functionality of the Arrow binary format.
>>> Specifically the stream writer class expects the dictionaries to be defined
>>> when it loads the schema so it isn't clear how it will handle assigning a
>>> dictionary to a provider when saving a batch.
>>>
>>> Is there an example that clarifies this use case?
>>>
>>> Thanks for any input or feedback,
>>>
>>> Chris
>>>
>>

Mime
View raw message