hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teodor Macicas <teodor.maci...@epfl.ch>
Subject Re: [HADOOP] Terasort for numbers
Date Mon, 02 Aug 2010 21:43:25 GMT
Hi Alex,

Thank you again.
Yes, I'm also thinking of your first suggestion. But that would help me 
only for 'reducing' the problem from floating points to integers. But I 
also do not know how to use Terasort for integer keys !

I've tried to use the generic TotalOrderPartitioner instead of the one 
nested in Terasort class, but I received a lot of errors [0]. I had 
tried to modify the TeraInputFormat, TeraOutputFormat (and all nested 
classes) and I've continued getting errors.

Now, it's not clear for me what do I have to change in order to make 
your second solution working. Moreover, I was unable to find a generic 
MR on my hadoop 0.20.2 version.
I'd prefer the first solution, so can you please give me some tips for 
how to use Terasort for integers ?

p.s.: I've made a trick using fixed-length char keys and the program 
worked for this kind of workload [1]. I think using integer keys instead 
of this trick would be faster.

[0] java.io.IOException: wrong key class: 
org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text

[1] it worked for this:
0000123.45 payload1
0005120.55 payload2
0000003.77 payload3


On 08/02/2010 07:41 PM, Alex Kozlov wrote:
> Hi Teodor,
> I see the problem now:  There is no simple binary comparator for
> DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
> So you can do 2 things:
> 1. Convert your doubles to ints (or bytes), say if the precision is always 2
> decimal points, represent the number as 100 x double:  The problem is
> reduced to sorting integers then.
> 2. Use DoubleWritable as the key and payload as value.  You can use generic
> TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
> does not use tries.  You also can just use a generic MR with
> DoubleWritable keys: MR will sort the key for you with identity mapper and
> identity reducer.
> Option 2 is slightly less efficient since the code will need to call
> Double.longBitsToDouble each time, but I don't see an easy way to avoid this
> with the IEEE 754 encoding.
> Alex K
> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<teodor.macicas@epfl.ch>wrote:
>> Hi Alex,
>> Thank you for your quick reply and sorry for not being so clear.
>> The job I want to do is simple to sort data having numbers [doubles] as
>> keys [0]. I noticed that Terasort is using 10b char key. How can I use this
>> for my particular job ?
>> Do I need to change the Terasort ?
>> [0] example of workload:
>> 123.45    payload1
>> -34.56     payload2
>> 752.10    payload3
>> 10.25      payload4
>> ....
>> Does this make sense now ?
>> Regards,
>> Teodor
>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>> Hi Teodor,
>>> I am not clear what you call 'real numbers'.  Terasort does work on bytes
>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>>> really does not matter as Hadoop uses binary comparators on the raw value.
>>> Total order partitioning should also work with any  WritableComparable key
>>> (if it doesn't, it's a bug).
>>> My guess your problem is converting a char trie to WritableComparable.
>>>   Can
>>> you provide more background?  Are the strings of fixed length?
>>> Alex K
>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch
>>>> wrote:
>>>> Hi all,
>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>> keys.
>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>> builds a trie for creating total order partitions. I got stuck when I
>>>> tried
>>>> to change the char trie to a one suitable for number keys.
>>>> Then, I've given a try to Sort [also from examples] and it did work for
>>>> integer keys, but without a total order partitioning. In the end of the
>>>> day,
>>>> the final result can not be created only by putting together all
>>>> reducers'
>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>> occured
>>>> between two reducers.
>>>> Please can anyone advise me what and how to use in order to sort huge
>>>> amount of real numbers ?
>>>> Looking forward for your replies.
>>>> Thank you.
>>>> Best,
>>>> Teodor

View raw message