Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5B3FAD88A for ; Thu, 18 Oct 2012 05:25:15 +0000 (UTC) Received: (qmail 17651 invoked by uid 500); 18 Oct 2012 05:25:10 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 17576 invoked by uid 500); 18 Oct 2012 05:25:10 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 17559 invoked by uid 99); 18 Oct 2012 05:25:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 05:25:09 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jason.j.wang@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 05:25:04 +0000 Received: by mail-qa0-f48.google.com with SMTP id c11so1206783qad.14 for ; Wed, 17 Oct 2012 22:24:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=yQnryRprzS0Gb1LTmc4q79sx17J7Ej0PGOnLJM1S8LE=; b=jkDB+Kfn8OTOBn62jNlz3YC80evb8zeQm5jwPePP3E2aPbJmC7/FbSzr+Gy2OOhPuQ +INZW044y9W/MDOpIzDAl8f2XW6+TUGSBJKSk9ePY96Nu2gOdkE0vDbuzmT8W0p8c9ZP aODy35DfratxNPeG7qdsz2cl2nGGLiOdwxFmLLRdPwF4n3VU/hr//IlGmixN3hqlfmBD L5q6A4wW8KR2iTlkQ8A0cJxNiHXNVnxazAX8SIpuMYqMHLSAbI4w097za9FXyfo+qHDk m5626sv8PmeITVTHSvD3DXQG+VdNdEjX90xOWgkBPFwXiQRbhkUPFCx/3N0N8MRXIg8O +8mQ== MIME-Version: 1.0 Received: by 10.224.207.8 with SMTP id fw8mr34059823qab.92.1350537884031; Wed, 17 Oct 2012 22:24:44 -0700 (PDT) Received: by 10.49.75.4 with HTTP; Wed, 17 Oct 2012 22:24:44 -0700 (PDT) In-Reply-To: References: Date: Thu, 18 Oct 2012 00:24:44 -0500 Message-ID: Subject: Re: hadoop streaming with custom RecordReader class From: Jason Wang To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf300fb263d55be804cc4e9a83 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300fb263d55be804cc4e9a83 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable 1. I did try using NLineInputFormat, but this causes the "stream.map.input.ignoreKey" to no longer work. As per the streaming documentation: "The configuration parameter is valid only if stream.map.input.writer.class is org.apache.hadoop.streaming.io.TextInputWriter.class." My mapper prefers the streaming stdin to not have the key as part of the input. I could obviously parse that out in the mapper, but the mapper belongs to a 3rd party. This is why I tried to do the RecordReader route. 2. Yes - I did export the classpath before running. 3. This may be the problem: bash-3.2$ jar -tf NLineRecordReader.jar META-INF/ META-INF/MANIFEST.MF NLineRecordReader.class I have specified "package mypackage;" at the top of the java file though. Then compiled using "javac" and then "jar cf". 4. The class is public. On Wed, Oct 17, 2012 at 11:53 PM, Harsh J wrote: > Hi Jason, > > A few questions (in order): > > 1. Does Hadoop's own NLineInputFormat not suffice? > > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NL= ineInputFormat.html > > 2. Do you make sure to pass your jar into the front-end too? > > $ export HADOOP_CLASSPATH=3D/path/to/your/jar > $ command=85 > > 3. Does jar -tf carry a proper mypackage.NLineRecordReader? > > 4. Is your class marked public? > > On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang > wrote: > > Hi all, > > I'm experimenting with hadoop streaming on build 1.0.3. > > > > To give background info, i'm streaming a text file into mapper written > in C. > > Using the default settings, streaming uses TextInputFormat which create= s > one > > record from each line. The problem I am having is that I need record > > boundaries to be every 4 lines. When the splitter breaks up the input > into > > the mappers, I have partial records on the boundaries due to this. To > > address this, my approach was to write a new RecordReader class almost = in > > java that is almost identical to LineRecordReader, but with a modified > > next() method that reads 4 lines instead of one. > > > > I then compiled the new class and created a jar. I wanted to import > this at > > run time using the -libjars argument, like such: > > > > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars > > NLineRecordReader.jar -files test_stream.sh -inputreader > > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output > > /Users/hadoop/test/output -mapper =93test_stream.sh=94 -reducer NONE > > > > Unfortunately, I keep getting the following error: > > -inputreader: class not found: mypackage.NLineRecordReader > > > > My question is 2 fold. Am I using the right approach to handle the 4 > line > > records with the custom RecordReader implementation? And why isn't > -libjars > > working to include my class to hadoop streaming at runtime? > > > > Thanks, > > Jason > > > > -- > Harsh J > --20cf300fb263d55be804cc4e9a83 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable 1. I did try using NLineInputFormat, but this causes the "stream.map.i= nput.ignoreKey" to no longer work. =A0As per the streaming documentati= on:

"The configuration parameter is valid only if stream.map.input.writer.cl= ass is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming std= in to not have the key as part of the input. =A0I could obviously parse tha= t out in the mapper, but the mapper belongs to a 3rd party. This is why I t= ried to do the RecordReader route.

2. Yes - I did export the classpath= before running.

3. This may b= e the problem:

bas= h-3.2$ jar -tf NLineRecordReader.jar=A0
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.c= lass

I have specified "package mypackage;&quo= t; at the top of the java file though. Then compiled using "javac"= ; and then "jar cf".

4. The class is public.



On= Wed, Oct 17, 2012 at 11:53 PM, Harsh J <harsh@cloudera.com> wrote:
Hi Jason,

A few questions (in order):

1. Does Hadoop's own NLineInputFormat not suffice?
http://hadoop.apache.org/do= cs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

2. Do you make sure to pass your jar into the front-end too?

$ export HADOOP_CLASSPATH=3D/path/to/your/jar
$ command=85

3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?=

4. Is your class marked public?

On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <jason.j.wang@gmail.com> wrote:
> Hi all,
> I'm experimenting with hadoop streaming on build 1.0.3.
>
> To give background info, i'm streaming a text file into mapper wri= tten in C.
> Using the default settings, streaming uses TextInputFormat which creat= es one
> record from each line. =A0The problem I am having is that I need recor= d
> boundaries to be every 4 lines. =A0When the splitter breaks up the inp= ut into
> the mappers, I have partial records on the boundaries due to this. =A0= To
> address this, my approach was to write a new RecordReader class almost= in
> java that is almost identical to LineRecordReader, but with a modified=
> next() method that reads 4 lines instead of one.
>
> I then compiled the new class and created a jar. =A0I wanted to import= this at
> run time using the -libjars argument, like such:
>
> hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars > NLineRecordReader.jar -files test_stream.sh -inputreader
> mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output=
> /Users/hadoop/test/output -mapper =93test_stream.sh=94 -reducer NONE >
> Unfortunately, I keep getting the following error:
> -inputreader: class not found: mypackage.NLineRecordReader
>
> My question is 2 fold. =A0Am I using the right approach to handle the = 4 line
> records with the custom RecordReader implementation? =A0And why isn= 9;t -libjars
> working to include my class to hadoop streaming at runtime?
>
> Thanks,
> Jason



--
Harsh J

--20cf300fb263d55be804cc4e9a83--