arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Attn: Wes, Re: Masked Arrays
Date Sun, 05 Apr 2020 20:31:00 GMT
As I recall the contents "underneath" have been discussed before and
the consensus was that the contents are not specified. If you'e like
to make a proposal to change something I would suggest raising it on
dev@arrow.apache.org

On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <felix.benning@gmail.com> wrote:
>
> Follow up: Do you think it would make sense to have an `na_are_zero` flag? Since it appears
that the baseline (naively assuming there are no null values) is still a bit faster than equally
optimized null value handling algorithms. So you might want to make the assumption, that all
null values are set to zero in the array (instead of undefined). This would allow for very
fast means, scalar products and thus matrix multiplication which ignore nas. And in case of
matrix multiplication, you might prefer sacrificing an O(n^2) effort to set all null entries
to zero before multiplying. And assuming you do not overwrite this data, you would be able
to reuse that assumption in later computations with such a flag.
> In some use cases, you might even be able to utilize unused computing resources for this
task. I.e. clean up the nulls while the computer is not used, preparing for the next query.
>
>
> On Sun, 5 Apr 2020 at 18:34, Felix Benning <felix.benning@gmail.com> wrote:
>>
>> Awesome, that was exactly what I was looking for, thank you!
>>
>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>
>>> I wrote a blog post a couple of years about this
>>>
>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> Pasha Stetsenko did a follow-up analysis that showed that my
>>> "sentinel" code could be significantly improved, see:
>>>
>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>>
>>> Generally speaking in Apache Arrow we've been happy to have a uniform
>>> representation of nullness across all types, both primitive (booleans,
>>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>>> computational operations (like elementwise functions) need not concern
>>> themselves with the nulls at all, for example, since the bitmap from
>>> the input array can be passed along (with zero copy even) to the
>>> output array.
>>>
>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <felix.benning@gmail.com>
wrote:
>>> >
>>> > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays
for NA implementations? There seems to have been a discussion about that in the numpy community
in 2012 https://numpy.org/neps/nep-0026-missing-data-summary.html without an apparent result.
>>> >
>>> > Summary of the Summary:
>>> > - The Bitpattern approach reserves one bitpattern of any type as na, the
only type not having spare bitpatterns are integers which means this decreases their range
by one. This approach is taken by R and was regarded as more performant in 2012.
>>> > - The Mask approach was deemed more flexible, since it would allow "degrees
of missingness", and also cleaner/easier implementation.
>>> >
>>> > Since bitpattern checks would probably disrupt SIMD, I feel like some calculations
(e.g. mean) would actually benefit more, from setting na values to zero, proceeding as if
they were not there, and using the number of nas in the metadata to adjust the result. This
of course does not work if two columns are used (e.g. scalar product), which is probably more
important.
>>> >
>>> > Was using Bitmasks in Arrow a conscious performance decision? Or was the
decision only based on the fact, that R and Bitpattern implementations in general are a niche,
which means that Bitmasks are more compatible with other languages?
>>> >
>>> > I am curious about this topic, since the "lack of proper na support" was
cited as the reason, why Python would never replace R in statistics.
>>> >
>>> > Thanks,
>>> >
>>> > Felix
>>> >
>>> >
>>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
>>> >
>>> > Note that pandas is starting to use a notion of "masked arrays" as well,
for example for its nullable integer data type, but also not using the np.ma masked array,
but a custom implementation (for technical reasons in pandas this was easier).
>>> >
>>> > Also, there has been quite some discussion last year in numpy about a possible
re-implementation of a MaskedArray, but using numpy's protocols (`__array_ufunc__`, `__array_function__`
etc), instead of being a subclass like np.ma now is. See eg https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>>> >
>>> > Joris
>>> >
>>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nugend@gmail.com> wrote:
>>> >>
>>> >> Ok. That actually aligns closely to what I'm familiar with. Good to
know.
>>> >>
>>> >> Thanks again for taking the time to respond,
>>> >>
>>> >> -Dan Nugent
>>> >>
>>> >>
>>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <wesmckinn@gmail.com>
wrote:
>>> >>>
>>> >>> Social and technical reasons I guess. Empirically it's just not
used much.
>>> >>>
>>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>> >>>
>>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>> >>>
>>> >>> At least in 2010, there were notable performance problems when using
>>> >>> MaskedArray for computations
>>> >>>
>>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>>> >>> performance reasons (which are beyond the scope of this paper),
as NaN
>>> >>> propagates in floating-point operations in a natural way and can
be
>>> >>> easily detected in algorithms."
>>> >>>
>>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nugend@gmail.com>
wrote:
>>> >>> >
>>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
stick with it.
>>> >>> >
>>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
gain favor when many data representation formats explicitly support nullity (including Arrow)?
Is it just that not carrying nulls in computations forward is preferable (that is, early filtering/value
filling was easier)?
>>> >>> >
>>> >>> > -Dan Nugent
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <wesmckinn@gmail.com>
wrote:
>>> >>> >>
>>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nugend@gmail.com>
wrote:
>>> >>> >> >
>>> >>> >> > Didn’t want to follow up on this on the Jira issue
earlier since it's sort of tangential to that bug and more of a usage question. You said:
>>> >>> >> >
>>> >>> >> > > I wouldn't recommend building applications based
on them nowadays since the level of support / compatibility in other projects is low.
>>> >>> >> >
>>> >>> >> > In my case, I am using them since it seemed like a
straightforward representation of my data that has nulls, the format I’m converting from
has zero cost numpy representations, and converting from an internal format into Arrow in
memory structures appears zero cost (or close to it) as well. I guess I can just provide the
mask as an explicit argument, but my original desire to use it came from being able to exploit
numpy.ma.concatenate in a way that saved some complexity in implementation.
>>> >>> >> >
>>> >>> >> > Since Arrow itself supports masking values with a
bitfield, is there something intrinsic to the notion of array masks that is not well supported?
Or do you just mean the specific numpy MaskedArray class?
>>> >>> >> >
>>> >>> >>
>>> >>> >> I mean just the numpy.ma module. Not many Python computing
projects
>>> >>> >> nowadays treat MaskedArray objects as first class citizens.
Depending
>>> >>> >> on what you need it may or may not be a problem. pyarrow
supports
>>> >>> >> ingesting from MaskedArray as a convenience, but it would
not be
>>> >>> >> common in my experience for a library's APIs to return
MaskedArrays.
>>> >>> >>
>>> >>> >> > If this is too much of a numpy question rather than
an arrow question, could you point me to where I can read up on masked array support or maybe
what the right place to ask the numpy community about whether what I'm doing is appropriate
or not.
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > -Dan Nugent

Mime
View raw message