incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul <>
Subject Re: min/max/sort APIs not in sync
Date Sat, 15 Sep 2012 13:39:34 GMT
I agree that it is hard to find use cases for this besides the standard 
types.  I can say that in our application, which is on hadoop-MR and not 
on crunch, we use sorting on strings  but  examples for min/max  are 
hard to find. If I think of moving my application on crunch, sorting on 
custom type is the only thing that we would require. I think I should be 
able to bypass that constraint by making use of PTable<String, 

@Josh, I also think that  PType could be the starting point. Maybe  we 
can embed this information in Convertor that is used there. As long as 
the user uses the same PType results would remain in sync.
But again it comes back to the point Gabriel mentioned are there use 
cases that go beyond standard types ?


On 15-09-2012 13:39, Gabriel Reid wrote:
> I now that I understand the full situation (sorry for not getting it
> sooner Rahul), I see that this is indeed a bit of an issue.
> The min and max methods act the way that would be most logically
> expected (in my mind), meaning the semantics of the sort are not
> exactly what would most logically be expected (again, in my mind).
> For example, if I have a custom class that is Comparable and is being
> serialized via Reflection with Avro, I would expect that the compareTo
> method on the class would be used for min/max and sort. Although this
> is how things work (really well) already for the min/max methods, I
> think that making sort work with this might be a pain and a lot
> slower.
> On the other hand, I'm not at all a user of either the sort or the
> min/max methods, and it seems mostly likely to me that they would all
> be used on numerical types that are built-in and will just work
> anyhow, so maybe this is a non-issue. Are there any use-cases with
> these methods on non-numerical data?
> - Gabriel
> On Fri, Sep 14, 2012 at 6:22 PM, Josh Wills <> wrote:
>> So it seems reasonable to me for users to expect that min/max/sort give
>> consistent answers, even though the backing implementations of each are
>> different.
>> I *think* that the right way to do this is to have some sort of notion of a
>> PType that knows whether the type it refers to is Comparable and has a
>> method that consistently cmps elements of that type, either as Java types
>> or as serialized (Writable/Avro) types, and then we make the sort and
>> min/max APIs make use of that functionality. Then the assumption would be
>> that the min/max/sort APIs would be consistent _assuming_ that the
>> PCollection had the same associated PType when the min/max/sort method was
>> called.
>> J
>> On Fri, Sep 14, 2012 at 6:33 AM, Rahul <> wrote:
>>> Hi all,
>>> We have min/max/sort APIs in Crunch. The min and max rely on S(user type)
>>> being comparable while the Sort API relies on the corresponding writable
>>> type being comparable i. WritableComparable.   To me the min and max API
>>> are special cases of Sort API and the three should be in sync with each
>>> other.  If this is not the case then at-least theoretically we could have
>>> cases where sorting produces results that are different from min/max
>>> functions. We could adopt the Sort approach for all three but there are
>>> some issues in that api like if the Writable is not comparable then the
>>> error will not be that clear,  S could have a comparator that is different
>>> from the Writable then the results are not as expected by user etc. Or
>>> maybe we can use comparable S in Sort api, I am not sure, but I think we
>>> would not be able to use hadoop shuffle and sort then.  I do not have
>>> complete idea how we could make the three in sync. Any thoughts on the same
>>> ? But I would like to ask first should we even try to to do that ? or  I am
>>> just cooking some theory and this has no practical use case. There has been
>>> some discussion on this in CRUNCH-57 <**
>>> jira/browse/CRUNCH-57 <>>
>>> issue. Let me know what you think.
>>> regards,
>>> Rahul
>> --
>> Director of Data Science
>> Cloudera <>
>> Twitter: @josh_wills <>

View raw message