arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Benning <felix.benn...@gmail.com>
Subject Re: Attn: Wes, Re: Masked Arrays
Date Sun, 05 Apr 2020 16:34:29 GMT
Awesome, that was exactly what I was looking for, thank you!

On Sun, 5 Apr 2020 at 00:40, Wes McKinney <wesmckinn@gmail.com> wrote:

> I wrote a blog post a couple of years about this
>
> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>
> Pasha Stetsenko did a follow-up analysis that showed that my
> "sentinel" code could be significantly improved, see:
>
> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>
> Generally speaking in Apache Arrow we've been happy to have a uniform
> representation of nullness across all types, both primitive (booleans,
> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> computational operations (like elementwise functions) need not concern
> themselves with the nulls at all, for example, since the bitmap from
> the input array can be passed along (with zero copy even) to the
> output array.
>
> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <felix.benning@gmail.com>
> wrote:
> >
> > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays
> for NA implementations? There seems to have been a discussion about that in
> the numpy community in 2012
> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> apparent result.
> >
> > Summary of the Summary:
> > - The Bitpattern approach reserves one bitpattern of any type as na, the
> only type not having spare bitpatterns are integers which means this
> decreases their range by one. This approach is taken by R and was regarded
> as more performant in 2012.
> > - The Mask approach was deemed more flexible, since it would allow
> "degrees of missingness", and also cleaner/easier implementation.
> >
> > Since bitpattern checks would probably disrupt SIMD, I feel like some
> calculations (e.g. mean) would actually benefit more, from setting na
> values to zero, proceeding as if they were not there, and using the number
> of nas in the metadata to adjust the result. This of course does not work
> if two columns are used (e.g. scalar product), which is probably more
> important.
> >
> > Was using Bitmasks in Arrow a conscious performance decision? Or was the
> decision only based on the fact, that R and Bitpattern implementations in
> general are a niche, which means that Bitmasks are more compatible with
> other languages?
> >
> > I am curious about this topic, since the "lack of proper na support" was
> cited as the reason, why Python would never replace R in statistics.
> >
> > Thanks,
> >
> > Felix
> >
> >
> > On 31.03.20 14:52, Joris Van den Bossche wrote:
> >
> > Note that pandas is starting to use a notion of "masked arrays" as well,
> for example for its nullable integer data type, but also not using the
> np.ma masked array, but a custom implementation (for technical reasons in
> pandas this was easier).
> >
> > Also, there has been quite some discussion last year in numpy about a
> possible re-implementation of a MaskedArray, but using numpy's protocols
> (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
> like np.ma now is. See eg
> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
> >
> > Joris
> >
> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nugend@gmail.com> wrote:
> >>
> >> Ok. That actually aligns closely to what I'm familiar with. Good to
> know.
> >>
> >> Thanks again for taking the time to respond,
> >>
> >> -Dan Nugent
> >>
> >>
> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >>>
> >>> Social and technical reasons I guess. Empirically it's just not used
> much.
> >>>
> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
> >>>
> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> >>>
> >>> At least in 2010, there were notable performance problems when using
> >>> MaskedArray for computations
> >>>
> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> >>> performance reasons (which are beyond the scope of this paper), as NaN
> >>> propagates in floating-point operations in a natural way and can be
> >>> easily detected in algorithms."
> >>>
> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com>
> wrote:
> >>> >
> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick
> with it.
> >>> >
> >>> > Do you have any feelings about why Numpy's masked arrays didn't gain
> favor when many data representation formats explicitly support nullity
> (including Arrow)? Is it just that not carrying nulls in computations
> forward is preferable (that is, early filtering/value filling was easier)?
> >>> >
> >>> > -Dan Nugent
> >>> >
> >>> >
> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >>> >>
> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nugend@gmail.com>
> wrote:
> >>> >> >
> >>> >> > Didn’t want to follow up on this on the Jira issue earlier
since
> it's sort of tangential to that bug and more of a usage question. You said:
> >>> >> >
> >>> >> > > I wouldn't recommend building applications based on them
> nowadays since the level of support / compatibility in other projects is
> low.
> >>> >> >
> >>> >> > In my case, I am using them since it seemed like a
> straightforward representation of my data that has nulls, the format I’m
> converting from has zero cost numpy representations, and converting from an
> internal format into Arrow in memory structures appears zero cost (or close
> to it) as well. I guess I can just provide the mask as an explicit
> argument, but my original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >>> >> >
> >>> >> > Since Arrow itself supports masking values with a bitfield,
is
> there something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >>> >> >
> >>> >>
> >>> >> I mean just the numpy.ma module. Not many Python computing projects
> >>> >> nowadays treat MaskedArray objects as first class citizens.
> Depending
> >>> >> on what you need it may or may not be a problem. pyarrow supports
> >>> >> ingesting from MaskedArray as a convenience, but it would not be
> >>> >> common in my experience for a library's APIs to return MaskedArrays.
> >>> >>
> >>> >> > If this is too much of a numpy question rather than an arrow
> question, could you point me to where I can read up on masked array support
> or maybe what the right place to ask the numpy community about whether what
> I'm doing is appropriate or not.
> >>> >> >
> >>> >> > Thanks,
> >>> >> >
> >>> >> >
> >>> >> > -Dan Nugent
>

Mime
View raw message