flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <chesnay.schep...@fu-berlin.de>
Subject Re: Changing how TypeComparators Work
Date Thu, 28 Aug 2014 08:22:21 GMT
@aljoscha : yes, the tuple comparators don't work on binary data.

On 28.8.2014 0:10, Ufuk Celebi wrote:
> I agree with Aljoscha that we should have further numbers on this
> (independently of what we settle on with the comparators).
>
> @Chesnay: do you feel like giving it a try or is the Python mystery eating
> up all your time? ;-)
>
>
> On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <aljoscha@apache.org>
> wrote:
>
>> Ok, but that's micro benchmarks. How much time is really spent on
>> those comparisons when a real job is executed? Does anyone know? Also,
>> the tuple pair comparators never work on binary representation, right?
>>
>> On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler
>> <chesnay.schepler@fu-berlin.de> wrote:
>>> i cant recall definitely what the numbers were so I'll just quote myself
>>> from the PR:
>>>
>>> measurements were done using System.nanoTime()
>>> time necessary for the comparison
>>> Strings consisted of 90 characters
>>>
>>> difference in the beginning of the string
>>> New 4259
>>> Old 23431
>>>
>>> difference in the middle of the string
>>> New 13560
>>> Old 30757
>>>
>>> difference at the end of the string
>>> New 29385
>>> Old 28293
>>>
>>> for smaller strings (<10 i think) it was roughly 50% faster.
>>>
>>>
>>>
>>> On 27.8.2014 22:10, Aljoscha Krettek wrote:
>>>> Ok, then we have to find some other solution for interop between
>>>> comparators. What's the performance increase from that pull request?
>>>>
>>>> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <sewen@apache.org> wrote:
>>>>> A lot of comparisons (and that is what makes it fast) can happen on the
>>>>> binary data, without turning them into a comparable.
>>>>>
>>>>> The latest pull request in the String representation for example
>> improved
>>>>> String-keyed sorting quite a bit:
>>>>>
>>>>> https://github.com/apache/incubator-flink/pull/4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <aljoscha@apache.org
>>>>> wrote:
>>>>>
>>>>>> It would work on binary data, for example for tuples it would become:
>>>>>>
>>>>>> public Comparable extractKeys(DataInputView in) {
>>>>>>       Object[] fields = ...
>>>>>>       // extract only relevant fields from binary input and save
in
>>>>>> fields
>>>>>>       return Tuple(fields) // something like that
>>>>>> }
>>>>>>
>>>>>> And for normalized keys something similar can be done.
>>>>>>
>>>>>> Aljoscha
>>>>>>
>>>>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <sewen@apache.org>
>> wrote:
>>>>>>> The design of the comparators so far was to make them work on
the
>>>>>>> binary
>>>>>>> data. That we need to retain, in my opinion, otherwise there
is no
>>>>>>> way to get good performance out of working on serialized data.
>>>>>>>
>>>>>>> I personally think that creating a tuple2 (key/value pair) when
using
>>>>>>> selector functions is actually good:
>>>>>>> The key type (being treated by its dedicated comparator) benefits
>> from
>>>>>> all
>>>>>>> the optimizations implemented for that type (bin copying, normalized
>>>>>> keys,
>>>>>>> ...)
>>>>>>> That would be very hard to incorporate into any comparator that
just
>>>>>>> deserializes some comparable.
>>>>>>>
>>>>>>> Also, the key extractor can contain sort of heavy magic (such
as to
>>>>>>> block
>>>>>>> keys), whatever a user put in there. If we put that into the
>>>>>>> comparator,
>>>>>> it
>>>>>>> gets called for
>>>>>>> every comparison!
>>>>>>>
>>>>>>> I do agree, though, that we need to come up with a better interface
>>>>>>> that
>>>>>>> seamlessly allows working on binary versions and on objects,
without
>>>>>>> duplicating too much code.
>>>>>>>
>>>>>>>   From your suggestion, I am not sure I got everything. Could
you
>> post a
>>>>>>> concrete example or code?
>>>>>>>
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <
>> aljoscha@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Guys,
>>>>>>>> while porting the Java API to Scala I'm noticing how complicated
>>>>>>>> things are because of how our TypeComparators work: 1) There
is only
>>>>>>>> one type of comparator per TypeInformation which is created
by the
>>>>>>>> TypeInformation. Therefore, our KeySelectors are not actually
>>>>>>>> implemented as comparators but as generated mappers that
emit a
>>>>>>>> Tuple2, because you wouldn't for example be able to generate
a
>>>>>>>> SelectorFunctionComparator for a TupleTypeInfo.  (There's
also a lot
>>>>>>>> of magic going on with wrapping and unwrapping those tuples
in
>> Reduce,
>>>>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate,
there
>>>>>>>> is special case code for the combinations that work. This
will only
>>>>>>>> get worse when we properly introduce POJO types, which should
work
>>>>>>>> well with tuple comparators and the other comparators.
>>>>>>>>
>>>>>>>> My proposal is this: No more TypeComparator on a per type
basis.
>> Just
>>>>>>>> a generic comparator and PairComparator that work on Comparable.
>> What
>>>>>>>> used to be TypeComparators become SelectionExtractors that
return a
>>>>>>>> Comparable. Make Tuple comparable or add new ComparableTuple.
 The
>>>>>>>> TupleSelectionExtractor would return a comparable tuple of
the
>>>>>>>> appropriate length (same for POJOs). For Tuple extractors
that
>> operate
>>>>>>>> on only one field they would immediately return that field,
without
>>>>>>>> wrapping it in a tuple. This would directly support our existing
>>>>>>>> KeySelector functions since the already return Comparable,
when
>>>>>>>> returning a tuple in a key selector function this would be
>> compatible
>>>>>>>> with a TupleSelectionExtractor (on the other join side, for
>> example).
>>>>>>>> That's my idea. What do you think? I think the current state
is not
>>>>>>>> maintainable, so we should do something quickly. :D
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Aljoscha
>>>>>>>>


Mime
View raw message