incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: min/max/sort APIs not in sync
Date Sun, 16 Sep 2012 00:36:08 GMT
On Sat, Sep 15, 2012 at 6:39 AM, Rahul <> wrote:

> I agree that it is hard to find use cases for this besides the standard
> types.  I can say that in our application, which is on hadoop-MR and not on
> crunch, we use sorting on strings  but  examples for min/max  are hard to
> find. If I think of moving my application on crunch, sorting on custom type
> is the only thing that we would require. I think I should be able to bypass
> that constraint by making use of PTable<String, CustomObject>.
> @Josh, I also think that  PType could be the starting point. Maybe  we can
> embed this information in Convertor that is used there. As long as the user
> uses the same PType results would remain in sync.
> But again it comes back to the point Gabriel mentioned are there use cases
> that go beyond standard types ?

Maybe something involving composites of numeric types (e.g., Tuple3<Double,
Double, Double> or some such thing) but that's the only non-primitive type
example I could come up with. +1 that max/min of a string type isn't all
that common in my experience. So it feels like this is something that is
nice-to-have-at-some-point but not a critical feature for the next release.

> regards
> Rahul
> On 15-09-2012 13:39, Gabriel Reid wrote:
>> I now that I understand the full situation (sorry for not getting it
>> sooner Rahul), I see that this is indeed a bit of an issue.
>> The min and max methods act the way that would be most logically
>> expected (in my mind), meaning the semantics of the sort are not
>> exactly what would most logically be expected (again, in my mind).
>> For example, if I have a custom class that is Comparable and is being
>> serialized via Reflection with Avro, I would expect that the compareTo
>> method on the class would be used for min/max and sort. Although this
>> is how things work (really well) already for the min/max methods, I
>> think that making sort work with this might be a pain and a lot
>> slower.
>> On the other hand, I'm not at all a user of either the sort or the
>> min/max methods, and it seems mostly likely to me that they would all
>> be used on numerical types that are built-in and will just work
>> anyhow, so maybe this is a non-issue. Are there any use-cases with
>> these methods on non-numerical data?
>> - Gabriel
>> On Fri, Sep 14, 2012 at 6:22 PM, Josh Wills <> wrote:
>>> So it seems reasonable to me for users to expect that min/max/sort give
>>> consistent answers, even though the backing implementations of each are
>>> different.
>>> I *think* that the right way to do this is to have some sort of notion
>>> of a
>>> PType that knows whether the type it refers to is Comparable and has a
>>> method that consistently cmps elements of that type, either as Java types
>>> or as serialized (Writable/Avro) types, and then we make the sort and
>>> min/max APIs make use of that functionality. Then the assumption would be
>>> that the min/max/sort APIs would be consistent _assuming_ that the
>>> PCollection had the same associated PType when the min/max/sort method
>>> was
>>> called.
>>> J
>>> On Fri, Sep 14, 2012 at 6:33 AM, Rahul <> wrote:
>>>  Hi all,
>>>> We have min/max/sort APIs in Crunch. The min and max rely on S(user
>>>> type)
>>>> being comparable while the Sort API relies on the corresponding writable
>>>> type being comparable i. WritableComparable.   To me the min and max API
>>>> are special cases of Sort API and the three should be in sync with each
>>>> other.  If this is not the case then at-least theoretically we could
>>>> have
>>>> cases where sorting produces results that are different from min/max
>>>> functions. We could adopt the Sort approach for all three but there are
>>>> some issues in that api like if the Writable is not comparable then the
>>>> error will not be that clear,  S could have a comparator that is
>>>> different
>>>> from the Writable then the results are not as expected by user etc. Or
>>>> maybe we can use comparable S in Sort api, I am not sure, but I think we
>>>> would not be able to use hadoop shuffle and sort then.  I do not have
>>>> complete idea how we could make the three in sync. Any thoughts on the
>>>> same
>>>> ? But I would like to ask first should we even try to to do that ? or
>>>>  I am
>>>> just cooking some theory and this has no practical use case. There has
>>>> been
>>>> some discussion on this in CRUNCH-57 <**
>>>> jira/browse/CRUNCH-57 <**
>>>> jira/browse/CRUNCH-57 <>
>>>> >>
>>>> issue. Let me know what you think.
>>>> regards,
>>>> Rahul
>>> --
>>> Director of Data Science
>>> Cloudera <>
>>> Twitter: @josh_wills <**>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message