hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Patterson <j...@cloudera.com>
Subject Re: RawComparator
Date Tue, 10 Aug 2010 19:21:50 GMT

On Tue, Aug 10, 2010 at 4:01 AM, Dennis <arsenepark@yahoo.com.cn> wrote:

> Hi, guys,
> I am using hadoop 0.20.2, and I am trying to run the "SecondarySort"
> exmaple. The following is the "FirstGroupingComparator" class, and I just
> cannot figure out how "WritableComparator.compareBytes(b1, s1,
> Integer.SIZE / 8, b2, s2, Integer.SIZE / 8)" works. There are really few
> javadocs of this class or  this method.
> 1. Why it is "Integer.SIZE / 8"?

That says "take the size of the integer in bits on this system and divide it
by 8" --- which in java on 32 and 64 bit systems should give you 32 / 8 == 4
as afaik the integer bit width doesnt change based on the architecture with
java. So its saying here "compare the first 4 bytes of each byte array" (the
width, in bytes, of the first integer in the composite key) ,whereas
Integer.SIZE gives the number of bits in the datatype.

WritableComparators are useful in the shuffle phase of hadoop; we are
constantly comparing and sorting WritableComparables, and the secondary
sorting mechanics allow us to have a group of data for a key arrive at the
reducer in a certain order (example: time series data, where we want a range
of timestamps in one group, but we also want them in order when they are
processed inside the reducer)

> 2. If I want to compare two "String" here, how should I write to code?
>     public static class FirstGroupingComparator implements
>             RawComparator<IntPair> {
>         @Override
>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2) {
>             int ret = WritableComparator.compareBytes(b1, s1, Integer.SIZE
> / 8,
>                     b2, s2, Integer.SIZE / 8);
>             return ret;
>         }
>         @Override
>         public int compare(IntPair o1, IntPair o2) {
>             int l = o1.getFirst();
>             int r = o2.getFirst();
>             return l == r ? 0 : (l < r ? -1 : 1);
>         }
>     }

In the case of the comparison of strings, lets say for example you have a
"composite key" that has two String or Text object members (k1, k2); We
"group by" the first part of the key k1 and we sort by this key as well (we
block ranges together). This is very similar to the example above. Since
with a RawComparator we are looking to only deserialize a portion of the
data to do the comparison, you'll need a strategy for the compare() function
that takes into account that the strings are variable length (which means we
are unable to simply read 4 bytes as in the case of the integer). The
challenge here is to only deserialize the portion of the composite key that
contains the string/text that you want to compare against, which is going to
be a variable number of bytes each time. A good place to start looking at
for ideas would be the Text class in Hadoop and also WritableUtils.

Josh Patterson

> Thanks.
> Dennis

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message