hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: [HADOOP] Terasort for numbers
Date Mon, 02 Aug 2010 17:41:41 GMT
Hi Teodor,

I see the problem now:  There is no simple binary comparator for
So you can do 2 things:

1. Convert your doubles to ints (or bytes), say if the precision is always 2
decimal points, represent the number as 100 x double:  The problem is
reduced to sorting integers then.

2. Use DoubleWritable as the key and payload as value.  You can use generic
does not use tries.  You also can just use a generic MR with
DoubleWritable keys: MR will sort the key for you with identity mapper and
identity reducer.

Option 2 is slightly less efficient since the code will need to call
Double.longBitsToDouble each time, but I don't see an easy way to avoid this
with the IEEE 754 encoding.

Alex K

On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas <teodor.macicas@epfl.ch>wrote:

> Hi Alex,
> Thank you for your quick reply and sorry for not being so clear.
> The job I want to do is simple to sort data having numbers [doubles] as
> keys [0]. I noticed that Terasort is using 10b char key. How can I use this
> for my particular job ?
> Do I need to change the Terasort ?
> [0] example of workload:
> 123.45    payload1
> -34.56     payload2
> 752.10    payload3
> 10.25      payload4
> ....
> Does this make sense now ?
> Regards,
> Teodor
> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>> Hi Teodor,
>> I am not clear what you call 'real numbers'.  Terasort does work on bytes
>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>> really does not matter as Hadoop uses binary comparators on the raw value.
>> Total order partitioning should also work with any  WritableComparable key
>> (if it doesn't, it's a bug).
>> My guess your problem is converting a char trie to WritableComparable.
>>  Can
>> you provide more background?  Are the strings of fixed length?
>> Alex K
>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch
>> >wrote:
>>> Hi all,
>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
>>> read about Terasort [from examples], but now it's using 10bytes char
>>> keys.
>>> Changing keys from char to integer wasn't a good solution as Terasort
>>> builds a trie for creating total order partitions. I got stuck when I
>>> tried
>>> to change the char trie to a one suitable for number keys.
>>> Then, I've given a try to Sort [also from examples] and it did work for
>>> integer keys, but without a total order partitioning. In the end of the
>>> day,
>>> the final result can not be created only by putting together all
>>> reducers'
>>> outputs. Each reducer sorts only a subset of data and no merging is
>>> occured
>>> between two reducers.
>>> Please can anyone advise me what and how to use in order to sort huge
>>> amount of real numbers ?
>>> Looking forward for your replies.
>>> Thank you.
>>> Best,
>>> Teodor

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message