pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1295) Binary comparator for secondary sort
Date Thu, 12 Aug 2010 21:50:17 GMT

    [ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897958#action_12897958

Thejas M Nair commented on PIG-1295:

bq. Conceptually comparator is in the logic of Tuple. 
This comparator is part of only the *default* tuple implementation used internally within
pig. So the class that is the source of truth for the default internal tuple implementation
seems a good place to have this function. A tuple returned by a loadfunction has nothing to
do with the comparator logic. 

bq. Ideally it should be a static method of Tuple, however Tuple interface do not allow me
do that.
Yes, a static method can't be overridden. Since this is supposed to return only one value
per pig query, the singleton TupleFactory is a better place.

bq. For backward compatibility, first, we will break either Tuple or TupleFactory, the impact
is equivalent;
No. TupleFactory is an abstract class, while Tuple is an interface. Users will not be forced
to change their implementation if we add a function to TupleFactory. Also, users are more
likely to have custom Tuple than custom TupleFactory - because they might implement different
tuples as part of their load function implementation, and are unlikely to change the default
Tuple implementation used in internally in pig.

bq. second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we will check if
Tuple does not implement the new method, we fall back to the default serialize version. 
If Tuple interface is going to have this function, i think we should add in the javadoc that
it makes sense to implement the function only if it is going to be used as the default internal
tuple implementation. And that
null value can be returned if user chooses to not implement it.

> Binary comparator for secondary sort
> ------------------------------------
>                 Key: PIG-1295
>                 URL: https://issues.apache.org/jira/browse/PIG-1295
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>             Fix For: 0.8.0
>         Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch,
PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch,
PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch
> When hadoop framework doing the sorting, it will try to use binary version of comparator
if available. The benefit of binary comparator is we do not need to instantiate the object
before we compare. We see a ~30% speedup after we switch to binary comparator. Currently,
Pig use binary comparator in following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need to do a
sort in order to filter out duplicate values; however, we do not care how comparator sort
keys. Groupby also share this character. In this case, we rely on hadoop's default binary
> 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation
for simple types, such as integer, long, float, chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have a binary
comparator implementation. This especially matters when we switch to use secondary sort. In
secondary sort, we convert the inner sort of nested foreach into the secondary key and rely
on hadoop to sorting on both main key and secondary key. The sorting key will become a two
items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting
semantics matters. It turns out we do not have binary comparator once we use secondary sort,
and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary structure
of the serialized tuple. We can focus on most common use cases first, which is "group by"
followed by a nested sort. In this case, we will use secondary sort. Semantics of the first
key does not matter but semantics of secondary key matters. We need to identify the boundary
of main key and secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary key. Secondary
key can also be a complex data type, but for the first step, we focus on simple secondary
key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" program.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message