pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-1295) Binary comparator for secondary sort
Date Sat, 13 Mar 2010 00:49:27 GMT
Binary comparator for secondary sort

                 Key: PIG-1295
                 URL: https://issues.apache.org/jira/browse/PIG-1295
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.7.0
            Reporter: Daniel Dai

When hadoop framework doing the sorting, it will try to use binary version of comparator if
available. The benefit of binary comparator is we do not need to instantiate the object before
we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use
binary comparator in following case:

1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort
in order to filter out duplicate values; however, we do not care how comparator sort keys.
Groupby also share this character. In this case, we rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we have implementation
for simple types, such as integer, long, float, chararray, databytearray, string

However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator
implementation. This especially matters when we switch to use secondary sort. In secondary
sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop
to sorting on both main key and secondary key. The sorting key will become a two items tuple.
Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters.
It turns out we do not have binary comparator once we use secondary sort, and we see a significant
slow down.

Binary comparator for tuple should be doable once we understand the binary structure of the
serialized tuple. We can focus on most common use cases first, which is "group by" followed
by a nested sort. In this case, we will use secondary sort. Semantics of the first key does
not matter but semantics of secondary key matters. We need to identify the boundary of main
key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if
the first key equals, we use a binary comparator to compare secondary key. Secondary key can
also be a complex data type, but for the first step, we focus on simple secondary key, which
is the most common use case.

We mark this issue to be a candidate project for "Google summer of code 2010" program. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message