hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: What is the runtime efficiency of secondary sorting?
Date Tue, 04 Jan 2011 00:10:01 GMT
On Mon, Jan 3, 2011 at 4:00 PM, W.P. McNeill <billmcn@gmail.com> wrote:

> ... If I write a combiner like this, is there any advantage to also doing a
> secondary sort?

The definitive answer is that it depends.

> As for deserialization, the value in my actual application is a Java object
> with a floating point rank field, and I will be sorting these objects by
> this rank.  Does this make deserialization relatively costly?  (I'm
> guessing
> it does, because it's not as simple as a single number.)

If you can define a binary comparator for your value field that extracts
this single number and compare it, then the framework can sort your data
items without (fully) deserializing them.  This can be a big, big
performance win if only because the serialized form is often smaller so the
merge sort needs to recurse less because more data fits into a certain
amount of memory.

Without a binary comparator, the framework must deserialize your object
completely and then compare it. That can't help but be slower than avoiding
the deserialization.

Some serialization frameworks like Avro even allow some values to be sorted
without any deserialization.  This is even better, of course, than partial

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message