FWIW, in the Java client it's https://github.com/apache/arrow/blob/apache-arrow-3.0.0/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamReader.java#L131  that's causing the aforementioned stackoverflow when reading lots of dictionaries from a stream. 
i.e. the recursive construct 

    public boolean loadNextBatch() throws IOException {
    ..
      if (..) return true;
      else {
        ..
        return loadNextBatch();
      }
    }

Not sure if that qualifies as a bug, as I think the depth is typically multiple thousands, but perhaps of interest.


On Thu, Feb 25, 2021 at 8:11 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not sure if it's possible at the moment, but it SHOULD be made
possible. See ARROW-5340

On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
<joris.mg.peeters@gmail.com> wrote:
>
> Hello,
>
> I have a pandas DataFrame with many string columns (>30,000), and they share a low-cardinality set of values (e.g. size 100). I'd like to convert this to an Arrow table of dictionary encoded columns (let's say int16 for the index cols), but with just one shared dictionary of strings.
> This is to avoid ending up with >30,000 tiny dictionaries on the wire, which doesn't even load in e.g. Java (due to a stackoverflow error).
>
> Despite my efforts, I haven't really been able to achieve this with the public API's I could find. Does anyone have an idea? I'm using pyarrow 3.0.0.
>
> For a mickey mouse example, I'm looking at e.g.
>
> df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})
>
> and would like a Table with dictionary-encoded columns a and b, both nullable, that both refer to the same dictionary with id=0 (or whatever id) containing ['foo', 'bar', 'quux'].
>
> Thanks,
> -Joris.
>
>
>
>
>
>
>