Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 24787 invoked from network); 3 Jun 2008 19:03:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Jun 2008 19:03:24 -0000 Received: (qmail 15099 invoked by uid 500); 3 Jun 2008 19:03:23 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 14811 invoked by uid 500); 3 Jun 2008 19:03:22 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 14800 invoked by uid 99); 3 Jun 2008 19:03:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Jun 2008 12:03:22 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.147.107.21] (HELO mrout2-b.corp.re1.yahoo.com) (69.147.107.21) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Jun 2008 19:02:30 +0000 Received: from SNV-EXPF01.ds.corp.yahoo.com (snv-expf01.ds.corp.yahoo.com [207.126.227.250]) by mrout2-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id m53J1FWO067931 for ; Tue, 3 Jun 2008 12:01:17 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:x-mimeole:content-class:mime-version: content-type:content-transfer-encoding:subject:date:message-id: in-reply-to:x-ms-has-attach:x-ms-tnef-correlator:thread-topic: thread-index:references:from:to:return-path:x-originalarrivaltime; b=JvxbE9n2408RPOg9e7ztEFBni9dl8begg/d0Gf+L7jHJ5T4PNYEDN6eTD67mIoVi Received: from SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.86]) by SNV-EXPF01.ds.corp.yahoo.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 3 Jun 2008 12:01:13 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Stackoverflow Date: Tue, 3 Jun 2008 12:00:49 -0700 Message-ID: <60499C890DBB8042BC7834CC82FB237901397AF5@SNV-EXVS09.ds.corp.yahoo.com> In-Reply-To: <7AB8F70D-6A58-482E-BAF8-0F31C5FBE47D@yahoo-inc.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Stackoverflow Thread-Index: AcjFqLnxkhQkLAJZSI++ktEk81DJoAAAsL8w References: <200806021912.35346.andreas@kostyrka.org> <200806030809.41664.andreas@kostyrka.org> <86CC0403-6038-499E-A996-3B6665B60E8A@yahoo-inc.com> <200806031544.01604.andreas@kostyrka.org> <7AB8F70D-6A58-482E-BAF8-0F31C5FBE47D@yahoo-inc.com> From: "Runping Qi" To: X-OriginalArrivalTime: 03 Jun 2008 19:01:13.0415 (UTC) FILETIME=[316A3570:01C8C5AC] X-Virus-Checked: Checked by ClamAV on apache.org Chris, Your version will use LongWritable as the map output key type, which changes the job nature completely. You should use=20 ${hadoop} jar hadoop-0.17-examples.jar sort -m \ > -r 88 \ > -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \ > -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \ > -outKey org.apache.hadoop.io.Text \ > -outValue org.apache.hadoop.io.Text \ > instead. Runping > -----Original Message----- > From: Chris Douglas [mailto:chrisdo@yahoo-inc.com] > Sent: Tuesday, June 03, 2008 11:35 AM > To: core-user@hadoop.apache.org > Subject: Re: Stackoverflow >=20 > >> By "not exactly small, do you mean each line is long or that there > >> are many records? > > > > Well, not small in the meaning, that even I could get my boss to > > allow me to > > give you the data, transfering it might be painful. (E.g. the job that > > aborted had about 12M lines with with ~2.6GB data =3D> the lines are > > not really > > long, but longer than 80 chars) >=20 > Ah, I see. Would it be possible to run the Java sort example over > your data? It would be helpful to verify that this is not specific to > streaming. >=20 > ${hadoop} jar hadoop-0.17-examples.jar sort -m \ > -r 88 \ > -inFormat org.apache.hadoop.mapred.TextInputFormat \ > -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \ > -outKey org.apache.hadoop.io.LongWritable \ > -outValue org.apache.hadoop.io.Text \ > >=20 > This should be close to streaming with cat as the mapper. >=20 > >> util.QuickSort is only used on the map side, so this shouldn't have > >> anything to do with the reduce. Is it always and only the *last* map > > > > Nope, although sometimes it happens earlier. >=20 > Is it always the same splits when you re-run your job? Though > distributing the full dataset may not be feasible, if there are > splits that fail consistently then we might be able to work from that. >=20 > >> task that fails? If I sent you a patch that would print a trace with > >> the partitions, would you mind running it? Do you have any other > >> settings that differ from the defaults? -C > > > > If you tell me how to apply it, I'm happy to. (I'm not the biggest > > Java > > hotshot on this planet, I'm just using the provided 0.17.0 jars, > > Guess I > > would have to patch the source and run ant. On all nodes or just the > > control?). >=20 > Unfortunately, it would need to be deployed to all the TaskTrackers, > and it would be pretty invasive (i.e. I was planning on logging all > the offsets from the sort as the stack unwinds from the exception). > I'll test something and send it to you, and if it's not too much > trouble you can try it. >=20 > > My hadoop-site.xml: > > [snip] >=20 > Nothing suspect, there. -C