arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [Discuss][C++] Hashing floating point numbers
Date Tue, 05 Mar 2019 06:16:02 GMT
OK to summarize my understanding of the thoughts expressed:
1.  People really shouldn't be trying to do things like grouping and
joining on double valued columns (but they do).
2.  The consensus (but not 100% agreement) :
   *Canonicalize NaNs and assume NaN == NaN, for group by/unique kernels
   * assume -0.0 == 0.0.

I can update the JIRA with these conclusions unless someone strongly
disagrees.

Thanks,
Micah

On Tue, Feb 26, 2019 at 11:54 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> In an analytics setting my prior is that -0/+0 and all types of NaNs
> should respectively be considered semantically to all be "the same
> value". It would be confusing (and likely "wrong" in a practical
> setting) to obtain two kinds of zeros as the output of an algorithm
> involving a hash table, like Unique or ValueCounts. However: hashing
> of floats should not be encouraged in general, but sometimes people
> will hash the results of some operation that happens to yield floats.
>
> On Tue, Feb 26, 2019 at 1:49 PM Antoine Pitrou <solipsis@pitrou.net>
> wrote:
> >
> > On Tue, 26 Feb 2019 09:59:54 -0800
> > Tim Armstrong <tarmstrong@cloudera.com.INVALID> wrote:
> > > It's not a database thing, it's a floating point
> > > number thing. If you're doing floating point arithmetic you can end up
> > > with -0/+0 from expressions that should be equivalent.
> >
> > But we are not exactly dealing with arithmetic here...  I'm not sure
> > the IEEE FP standard was designed with database joins in mind.
> >
> > Granted, float hashing and float equality may be of dubious utility.
> > I'm curious about the use cases.
> >
> > > You end up in a world of pain if your equality relation and your hash
> > > function implementation are not aligned.
> >
> > This is not what I am suggesting.
> >
> > > So it's really a question of how you want to define equality (and
> whether
> > > you want to have multiple definitions of equality for different
> purposes).
> >
> > I think this is the goal of this discussion.
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message