arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Nugent <nug...@gmail.com>
Subject Re: Attn: Wes, Re: Masked Arrays
Date Mon, 30 Mar 2020 16:57:01 GMT
Ok. That actually aligns closely to what I'm familiar with. Good to know.

Thanks again for taking the time to respond,

-Dan Nugent


On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> Social and technical reasons I guess. Empirically it's just not used much.
>
> You can see my comments about numpy.ma in my 2010 paper about pandas
>
> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>
> At least in 2010, there were notable performance problems when using
> MaskedArray for computations
>
> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> performance reasons (which are beyond the scope of this paper), as NaN
> propagates in floating-point operations in a natural way and can be
> easily detected in algorithms."
>
> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com> wrote:
> >
> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick
> with it.
> >
> > Do you have any feelings about why Numpy's masked arrays didn't gain
> favor when many data representation formats explicitly support nullity
> (including Arrow)? Is it just that not carrying nulls in computations
> forward is preferable (that is, early filtering/value filling was easier)?
> >
> > -Dan Nugent
> >
> >
> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >>
> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nugend@gmail.com> wrote:
> >> >
> >> > Didn’t want to follow up on this on the Jira issue earlier since it's
> sort of tangential to that bug and more of a usage question. You said:
> >> >
> >> > > I wouldn't recommend building applications based on them nowadays
> since the level of support / compatibility in other projects is low.
> >> >
> >> > In my case, I am using them since it seemed like a straightforward
> representation of my data that has nulls, the format I’m converting from
> has zero cost numpy representations, and converting from an internal format
> into Arrow in memory structures appears zero cost (or close to it) as well.
> I guess I can just provide the mask as an explicit argument, but my
> original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >> >
> >> > Since Arrow itself supports masking values with a bitfield, is there
> something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >> >
> >>
> >> I mean just the numpy.ma module. Not many Python computing projects
> >> nowadays treat MaskedArray objects as first class citizens. Depending
> >> on what you need it may or may not be a problem. pyarrow supports
> >> ingesting from MaskedArray as a convenience, but it would not be
> >> common in my experience for a library's APIs to return MaskedArrays.
> >>
> >> > If this is too much of a numpy question rather than an arrow
> question, could you point me to where I can read up on masked array support
> or maybe what the right place to ask the numpy community about whether what
> I'm doing is appropriate or not.
> >> >
> >> > Thanks,
> >> >
> >> >
> >> > -Dan Nugent
>

Mime
View raw message