arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Attn: Wes, Re: Masked Arrays
Date Mon, 30 Mar 2020 16:37:33 GMT
Social and technical reasons I guess. Empirically it's just not used much.

You can see my comments about numpy.ma in my 2010 paper about pandas

https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf

At least in 2010, there were notable performance problems when using
MaskedArray for computations

"We chose to use NaN as opposed to using NumPy MaskedArrays for
performance reasons (which are beyond the scope of this paper), as NaN
propagates in floating-point operations in a natural way and can be
easily detected in algorithms."

On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com> wrote:
>
> Thanks! Since I'm just using it to jump to Arrow, I think I'll stick with it.
>
> Do you have any feelings about why Numpy's masked arrays didn't gain favor when many
data representation formats explicitly support nullity (including Arrow)? Is it just that
not carrying nulls in computations forward is preferable (that is, early filtering/value filling
was easier)?
>
> -Dan Nugent
>
>
> On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nugend@gmail.com> wrote:
>> >
>> > Didn’t want to follow up on this on the Jira issue earlier since it's sort
of tangential to that bug and more of a usage question. You said:
>> >
>> > > I wouldn't recommend building applications based on them nowadays since
the level of support / compatibility in other projects is low.
>> >
>> > In my case, I am using them since it seemed like a straightforward representation
of my data that has nulls, the format I’m converting from has zero cost numpy representations,
and converting from an internal format into Arrow in memory structures appears zero cost (or
close to it) as well. I guess I can just provide the mask as an explicit argument, but my
original desire to use it came from being able to exploit numpy.ma.concatenate in a way that
saved some complexity in implementation.
>> >
>> > Since Arrow itself supports masking values with a bitfield, is there something
intrinsic to the notion of array masks that is not well supported? Or do you just mean the
specific numpy MaskedArray class?
>> >
>>
>> I mean just the numpy.ma module. Not many Python computing projects
>> nowadays treat MaskedArray objects as first class citizens. Depending
>> on what you need it may or may not be a problem. pyarrow supports
>> ingesting from MaskedArray as a convenience, but it would not be
>> common in my experience for a library's APIs to return MaskedArrays.
>>
>> > If this is too much of a numpy question rather than an arrow question, could
you point me to where I can read up on masked array support or maybe what the right place
to ask the numpy community about whether what I'm doing is appropriate or not.
>> >
>> > Thanks,
>> >
>> >
>> > -Dan Nugent

Mime
View raw message