arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Indexing, encoding, transformations and processing with PyArrow - GitHub 6284
Date Mon, 27 Jan 2020 16:25:51 GMT
hi Athanassios,

I asked to move this discussion here because we use the dev@ and user@
mailing list for discussions (this is explained in the GitHub issue
template https://github.com/apache/arrow/blob/master/.github/ISSUE_TEMPLATE.md)

In the issue you cited inconsistent behavior with dictionary_encode --
we don't consider this to be inconsistent, see this Jupyter notebook

https://gist.github.com/wesm/2e29b7724571d5251051189846bfa99c

NumPy coerces None to NaN in numpy.array. In pyarrow.array, None
becomes null for all data types. However, NaN is not a null sentinel
in Apache Arrow like it is in pandas, so it is treated as a valid
floating point value in algorithms like dictionary_encode. Given that
if you need null and NaN to be handled equivalently in your system you
may indeed need to maintain some custom code if there isn't anything
in the project that does precisely what you need.

- Wes


On Mon, Jan 27, 2020 at 8:55 AM Athanassios I. Hatzis
<athanassios@healis.eu> wrote:
>
> Hi, recently I have started experimenting with PyArrow for the needs of my TRIADB project.
Kudos to
> Wes and his team on leading one of the best open-source IT projects in data engineering.
Definitely
> a wise decision to continue the success story of Pandas on the right track !
>
> At this stage I am trying to make a new release of TRIADB that will handle metadata management
and
> fast ingestion of data in memory for transformations and basic query operations.
>
> Secondary index, dictionary encoding and adjacency lists are a core part of TRIADB project,
that is
> the reason I posted the issue with Array.dictionary_encode method (
> https://github.com/apache/arrow/issues/6284). Isn't my example and description
> clear ? What exactly would you like me to elaborate on ?
>
> I also noticed that there is NumPy integration and you can convert easily from NumPy
to Arrow but
> the reverse direction has several limitations. For example I cannot create view for StringArray
> (NotImplementedError: NumPy array view is only supported for primitive types). But string()
(utf8)
> is in the list of your primitive types. Any plans for supporting this type with NumPy
soon ?
>
> Kind regards
> Athanassios
>
>

Mime
View raw message