Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4C574986.5080903@epfl.ch>
Date: Tue, 03 Aug 2010 00:41:10 +0200
From: Teodor Macicas <teodor.macicas@epfl.ch>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4
MIME-Version: 1.0
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Subject: Re: [HADOOP] Terasort for numbers
References: <4C55E5DE.5090100@epfl.ch>
	<AANLkTikdmDU_bkoRjsu6znOxYyRQD1rzTENEo5q4zQ8O@mail.gmail.com>
	<4C568F1E.9030401@epfl.ch>
	<AANLkTinz-oiowX5LgBZXgSTtD0DW779RVDeWxkdU3c2U@mail.gmail.com>
	<4C573BFD.4000503@epfl.ch>
 <AANLkTinNx673tDEjmU5QEoMdChActn-XWeQLJXLxMxRU@mail.gmail.com>
In-Reply-To: <AANLkTinNx673tDEjmU5QEoMdChActn-XWeQLJXLxMxRU@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Alex,

Why are you suggesting using SequenceFiles ? That implies changing the 
TeraInputFormat class, right ?

Your second approach is similar with Sort example from hadoop. The 
disadvantage of using it is that I don't have a total order partitioning 
and thus more operations are neccessary for creating the final result.

Regards,
Teodor

On 08/03/2010 12:21 AM, Alex Kozlov wrote:
> Hi Teodor,
>
> Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
> are different classes.  For the approach (1) I suggested, you need just to
> construct byte[10] array from an integer and create a new Text(byte[]) and
> write it together with the value to a sequence file.
>
> Since TeraSort was specifically created for just benchmarking purposes, I
> think it might make sense for you to start with the approach (2).  Just
> create a SequenceFile<DoubleWritable,Text>  file with your<key,value>  data
> and do a simple MR job with an identity mapper and identity reducer.  I can
> send you an example of a MR code, but there are plenty out
> there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
> One of them is TeraSort.java:run() itself, but you may want to use the new
> mapreduce API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
> Once you are comfortable with the MR framework, you can optimize it further.
>
> Another good source of information is Tom White's 'Hadoop: The Definitive
> Guide', particularly on the TotalOrderPartitioner.
>
> Let me know if you have any further questions.
>
> Alex K
>
> On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<teodor.macicas@epfl.ch>wrote:
>
>    
>> Hi Alex,
>>
>> Thank you again.
>> Yes, I'm also thinking of your first suggestion. But that would help me
>> only for 'reducing' the problem from floating points to integers. But I also
>> do not know how to use Terasort for integer keys !
>>
>> I've tried to use the generic TotalOrderPartitioner instead of the one
>> nested in Terasort class, but I received a lot of errors [0]. I had tried to
>> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
>> I've continued getting errors.
>>
>> Now, it's not clear for me what do I have to change in order to make your
>> second solution working. Moreover, I was unable to find a generic MR on my
>> hadoop 0.20.2 version.
>> I'd prefer the first solution, so can you please give me some tips for how
>> to use Terasort for integers ?
>>
>> p.s.: I've made a trick using fixed-length char keys and the program worked
>> for this kind of workload [1]. I think using integer keys instead of this
>> trick would be faster.
>>
>> [0] java.io.IOException: wrong key class:
>> org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text
>>
>> [1] it worked for this:
>> 0000123.45 payload1
>> 0005120.55 payload2
>> 0000003.77 payload3
>> ...
>>
>> Best,
>> Teodor
>>
>>
>> On 08/02/2010 07:41 PM, Alex Kozlov wrote:
>>
>>      
>>> Hi Teodor,
>>>
>>> I see the problem now:  There is no simple binary comparator for
>>> DoubleWritable<
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
>>>        
>>>> .
>>>>          
>>> So you can do 2 things:
>>>
>>> 1. Convert your doubles to ints (or bytes), say if the precision is always
>>> 2
>>> decimal points, represent the number as 100 x double:  The problem is
>>> reduced to sorting integers then.
>>>
>>> 2. Use DoubleWritable as the key and payload as value.  You can use
>>> generic
>>> TotalOrderPartitioner<
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
>>>        
>>>> which
>>>>          
>>> does not use tries.  You also can just use a generic MR with
>>> DoubleWritable keys: MR will sort the key for you with identity mapper and
>>> identity reducer.
>>>
>>> Option 2 is slightly less efficient since the code will need to call
>>> Double.longBitsToDouble each time, but I don't see an easy way to avoid
>>> this
>>> with the IEEE 754 encoding.
>>>
>>> Alex K
>>>
>>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<teodor.macicas@epfl.ch
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>> Hi Alex,
>>>>
>>>> Thank you for your quick reply and sorry for not being so clear.
>>>> The job I want to do is simple to sort data having numbers [doubles] as
>>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use
>>>> this
>>>> for my particular job ?
>>>> Do I need to change the Terasort ?
>>>>
>>>> [0] example of workload:
>>>> 123.45    payload1
>>>> -34.56     payload2
>>>> 752.10    payload3
>>>> 10.25      payload4
>>>> ....
>>>>
>>>> Does this make sense now ?
>>>>
>>>> Regards,
>>>> Teodor
>>>>
>>>>
>>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Hi Teodor,
>>>>>
>>>>> I am not clear what you call 'real numbers'.  Terasort does work on
>>>>> bytes
>>>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>>>>> really does not matter as Hadoop uses binary comparators on the raw
>>>>> value.
>>>>>
>>>>> Total order partitioning should also work with any  WritableComparable
>>>>> key
>>>>> (if it doesn't, it's a bug).
>>>>>
>>>>> My guess your problem is converting a char trie to WritableComparable.
>>>>>   Can
>>>>> you provide more background?  Are the strings of fixed length?
>>>>>
>>>>> Alex K
>>>>>
>>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<teodor.macicas@epfl.ch
>>>>>
>>>>>
>>>>>            
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data.
>>>>>> I've
>>>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>>>> keys.
>>>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>>>> builds a trie for creating total order partitions. I got stuck when I
>>>>>> tried
>>>>>> to change the char trie to a one suitable for number keys.
>>>>>>
>>>>>> Then, I've given a try to Sort [also from examples] and it did work for
>>>>>> integer keys, but without a total order partitioning. In the end of the
>>>>>> day,
>>>>>> the final result can not be created only by putting together all
>>>>>> reducers'
>>>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>>>> occured
>>>>>> between two reducers.
>>>>>>
>>>>>> Please can anyone advise me what and how to use in order to sort huge
>>>>>> amount of real numbers ?
>>>>>> Looking forward for your replies.
>>>>>>
>>>>>>
>>>>>> Thank you.
>>>>>> Best,
>>>>>> Teodor
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>            
>>>>
>>>>          
>>>        
>>