arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: pyarrow: write table where columns share the same dictionary
Date Thu, 25 Feb 2021 20:11:04 GMT
I'm not sure if it's possible at the moment, but it SHOULD be made
possible. See ARROW-5340

On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
<joris.mg.peeters@gmail.com> wrote:
>
> Hello,
>
> I have a pandas DataFrame with many string columns (>30,000), and they share a low-cardinality
set of values (e.g. size 100). I'd like to convert this to an Arrow table of dictionary encoded
columns (let's say int16 for the index cols), but with just one shared dictionary of strings.
> This is to avoid ending up with >30,000 tiny dictionaries on the wire, which doesn't
even load in e.g. Java (due to a stackoverflow error).
>
> Despite my efforts, I haven't really been able to achieve this with the public API's
I could find. Does anyone have an idea? I'm using pyarrow 3.0.0.
>
> For a mickey mouse example, I'm looking at e.g.
>
> df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})
>
> and would like a Table with dictionary-encoded columns a and b, both nullable, that both
refer to the same dictionary with id=0 (or whatever id) containing ['foo', 'bar', 'quux'].
>
> Thanks,
> -Joris.
>
>
>
>
>
>
>

Mime
View raw message