Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 74840 invoked from network); 2 Aug 2010 22:41:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Aug 2010 22:41:42 -0000 Received: (qmail 39212 invoked by uid 500); 2 Aug 2010 22:41:39 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 39168 invoked by uid 500); 2 Aug 2010 22:41:39 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 39159 invoked by uid 99); 2 Aug 2010 22:41:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Aug 2010 22:41:38 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [128.178.224.226] (HELO smtp3.epfl.ch) (128.178.224.226) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 02 Aug 2010 22:41:32 +0000 Received: (qmail 10297 invoked by uid 107); 2 Aug 2010 22:41:10 -0000 X-Virus-Scanned: ClamAV Received: from adsl-84-227-223-160.adslplus.ch (84.227.223.160) (authenticated) by smtp3.epfl.ch (AngelmatoPhylax SMTP proxy); Tue, 03 Aug 2010 00:41:10 +0200 Message-ID: <4C574986.5080903@epfl.ch> Date: Tue, 03 Aug 2010 00:41:10 +0200 From: Teodor Macicas User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4 MIME-Version: 1.0 To: "common-user@hadoop.apache.org" Subject: Re: [HADOOP] Terasort for numbers References: <4C55E5DE.5090100@epfl.ch> <4C568F1E.9030401@epfl.ch> <4C573BFD.4000503@epfl.ch> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Alex, Why are you suggesting using SequenceFiles ? That implies changing the TeraInputFormat class, right ? Your second approach is similar with Sort example from hadoop. The disadvantage of using it is that I don't have a total order partitioning and thus more operations are neccessary for creating the final result. Regards, Teodor On 08/03/2010 12:21 AM, Alex Kozlov wrote: > Hi Teodor, > > Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text > are different classes. For the approach (1) I suggested, you need just to > construct byte[10] array from an integer and create a new Text(byte[]) and > write it together with the value to a sequence file. > > Since TeraSort was specifically created for just benchmarking purposes, I > think it might make sense for you to start with the approach (2). Just > create a SequenceFile file with your data > and do a simple MR job with an identity mapper and identity reducer. I can > send you an example of a MR code, but there are plenty out > there. > One of them is TeraSort.java:run() itself, but you may want to use the new > mapreduce API. > Once you are comfortable with the MR framework, you can optimize it further. > > Another good source of information is Tom White's 'Hadoop: The Definitive > Guide', particularly on the TotalOrderPartitioner. > > Let me know if you have any further questions. > > Alex K > > On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicaswrote: > > >> Hi Alex, >> >> Thank you again. >> Yes, I'm also thinking of your first suggestion. But that would help me >> only for 'reducing' the problem from floating points to integers. But I also >> do not know how to use Terasort for integer keys ! >> >> I've tried to use the generic TotalOrderPartitioner instead of the one >> nested in Terasort class, but I received a lot of errors [0]. I had tried to >> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and >> I've continued getting errors. >> >> Now, it's not clear for me what do I have to change in order to make your >> second solution working. Moreover, I was unable to find a generic MR on my >> hadoop 0.20.2 version. >> I'd prefer the first solution, so can you please give me some tips for how >> to use Terasort for integers ? >> >> p.s.: I've made a trick using fixed-length char keys and the program worked >> for this kind of workload [1]. I think using integer keys instead of this >> trick would be faster. >> >> [0] java.io.IOException: wrong key class: >> org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text >> >> [1] it worked for this: >> 0000123.45 payload1 >> 0005120.55 payload2 >> 0000003.77 payload3 >> ... >> >> Best, >> Teodor >> >> >> On 08/02/2010 07:41 PM, Alex Kozlov wrote: >> >> >>> Hi Teodor, >>> >>> I see the problem now: There is no simple binary comparator for >>> DoubleWritable< >>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html >>> >>>> . >>>> >>> So you can do 2 things: >>> >>> 1. Convert your doubles to ints (or bytes), say if the precision is always >>> 2 >>> decimal points, represent the number as 100 x double: The problem is >>> reduced to sorting integers then. >>> >>> 2. Use DoubleWritable as the key and payload as value. You can use >>> generic >>> TotalOrderPartitioner< >>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html >>> >>>> which >>>> >>> does not use tries. You also can just use a generic MR with >>> DoubleWritable keys: MR will sort the key for you with identity mapper and >>> identity reducer. >>> >>> Option 2 is slightly less efficient since the code will need to call >>> Double.longBitsToDouble each time, but I don't see an easy way to avoid >>> this >>> with the IEEE 754 encoding. >>> >>> Alex K >>> >>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas>> >>>> wrote: >>>> >>> >>> >>> >>>> Hi Alex, >>>> >>>> Thank you for your quick reply and sorry for not being so clear. >>>> The job I want to do is simple to sort data having numbers [doubles] as >>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use >>>> this >>>> for my particular job ? >>>> Do I need to change the Terasort ? >>>> >>>> [0] example of workload: >>>> 123.45 payload1 >>>> -34.56 payload2 >>>> 752.10 payload3 >>>> 10.25 payload4 >>>> .... >>>> >>>> Does this make sense now ? >>>> >>>> Regards, >>>> Teodor >>>> >>>> >>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote: >>>> >>>> >>>> >>>> >>>>> Hi Teodor, >>>>> >>>>> I am not clear what you call 'real numbers'. Terasort does work on >>>>> bytes >>>>> (10 bytes key and 90 bytes payload). The actual 'meaning' of the bytes >>>>> really does not matter as Hadoop uses binary comparators on the raw >>>>> value. >>>>> >>>>> Total order partitioning should also work with any WritableComparable >>>>> key >>>>> (if it doesn't, it's a bug). >>>>> >>>>> My guess your problem is converting a char trie to WritableComparable. >>>>> Can >>>>> you provide more background? Are the strings of fixed length? >>>>> >>>>> Alex K >>>>> >>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas>>>> >>>>> >>>>> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. >>>>>> I've >>>>>> read about Terasort [from examples], but now it's using 10bytes char >>>>>> keys. >>>>>> Changing keys from char to integer wasn't a good solution as Terasort >>>>>> builds a trie for creating total order partitions. I got stuck when I >>>>>> tried >>>>>> to change the char trie to a one suitable for number keys. >>>>>> >>>>>> Then, I've given a try to Sort [also from examples] and it did work for >>>>>> integer keys, but without a total order partitioning. In the end of the >>>>>> day, >>>>>> the final result can not be created only by putting together all >>>>>> reducers' >>>>>> outputs. Each reducer sorts only a subset of data and no merging is >>>>>> occured >>>>>> between two reducers. >>>>>> >>>>>> Please can anyone advise me what and how to use in order to sort huge >>>>>> amount of real numbers ? >>>>>> Looking forward for your replies. >>>>>> >>>>>> >>>>>> Thank you. >>>>>> Best, >>>>>> Teodor >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>