arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <>
Subject Re: Batch writing/reading tables with varying dictionary (in v0.14.1)
Date Mon, 14 Oct 2019 20:55:41 GMT
hi Thomas,

The stream writer class currently only supports a constant dictionary.
The work in ARROW-3144 moved the dictionary out of the schema and into
the DictionaryArray data structure, so this is necessary to allow
changing dictionaries in a stream.

To support your use case, we either need dictionary deltas or
dictionary replacements to be implemented. These are provided for in
the format, but have not been implemented yet in C++.

Note there's a mailing list thread on dev@ going on right now about
finalizing low level details of dictionary encoding in the columnar
format specification

I just opened since I
didn't see another issue covering this

- Wes

On Mon, Oct 14, 2019 at 8:41 AM Thomas Buhrmann
<> wrote:
> Hi,
> My use case involves processing large datasets in batches (of rows), each batch resulting
in a DataFrame that I'm serializing to a single file on disk via RecordBatchStreamWriter (to
end up with a file that can in turn be read in batches). My problem is that some columns are
pandas categorical types, for which I can't know ahead of time all the possible categories.
And since the RecordBatchStreamWriter accepts only a single schema, I can't seem to find a
way to update the Arrow dictionary, or write a new schema for each RecordBatch. This results
in an invalid stream/file with dictionary indices that don't match the schema. Is there currently
a way to do this using the high-level APIs? Or would I have to manually construct the stream
using each batch's schema etc.?
> It seems that this may be related to the open issues in ARROW-3144 (ARROW-5279, ARROW-5336)
and the discussion in PR-3165, from which I understand that this may be supported already
when writing to parquet, but not in IPC? Is there any other workaround I could use right now?
> Many thanks,
> T

View raw message