incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: min/max/sort APIs not in sync
Date Sat, 15 Sep 2012 08:09:47 GMT
I now that I understand the full situation (sorry for not getting it
sooner Rahul), I see that this is indeed a bit of an issue.

The min and max methods act the way that would be most logically
expected (in my mind), meaning the semantics of the sort are not
exactly what would most logically be expected (again, in my mind).

For example, if I have a custom class that is Comparable and is being
serialized via Reflection with Avro, I would expect that the compareTo
method on the class would be used for min/max and sort. Although this
is how things work (really well) already for the min/max methods, I
think that making sort work with this might be a pain and a lot
slower.

On the other hand, I'm not at all a user of either the sort or the
min/max methods, and it seems mostly likely to me that they would all
be used on numerical types that are built-in and will just work
anyhow, so maybe this is a non-issue. Are there any use-cases with
these methods on non-numerical data?

- Gabriel


On Fri, Sep 14, 2012 at 6:22 PM, Josh Wills <jwills@cloudera.com> wrote:
> So it seems reasonable to me for users to expect that min/max/sort give
> consistent answers, even though the backing implementations of each are
> different.
>
> I *think* that the right way to do this is to have some sort of notion of a
> PType that knows whether the type it refers to is Comparable and has a
> method that consistently cmps elements of that type, either as Java types
> or as serialized (Writable/Avro) types, and then we make the sort and
> min/max APIs make use of that functionality. Then the assumption would be
> that the min/max/sort APIs would be consistent _assuming_ that the
> PCollection had the same associated PType when the min/max/sort method was
> called.
>
> J
>
> On Fri, Sep 14, 2012 at 6:33 AM, Rahul <rsharma@xebia.com> wrote:
>
>> Hi all,
>>
>> We have min/max/sort APIs in Crunch. The min and max rely on S(user type)
>> being comparable while the Sort API relies on the corresponding writable
>> type being comparable i. WritableComparable.   To me the min and max API
>> are special cases of Sort API and the three should be in sync with each
>> other.  If this is not the case then at-least theoretically we could have
>> cases where sorting produces results that are different from min/max
>> functions. We could adopt the Sort approach for all three but there are
>> some issues in that api like if the Writable is not comparable then the
>> error will not be that clear,  S could have a comparator that is different
>> from the Writable then the results are not as expected by user etc. Or
>> maybe we can use comparable S in Sort api, I am not sure, but I think we
>> would not be able to use hadoop shuffle and sort then.  I do not have
>> complete idea how we could make the three in sync. Any thoughts on the same
>> ? But I would like to ask first should we even try to to do that ? or  I am
>> just cooking some theory and this has no practical use case. There has been
>> some discussion on this in CRUNCH-57 <https://issues.apache.org/**
>> jira/browse/CRUNCH-57 <https://issues.apache.org/jira/browse/CRUNCH-57>>
>> issue. Let me know what you think.
>>
>> regards,
>> Rahul
>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message