Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 89F1DD0CE for ; Thu, 18 Oct 2012 04:54:00 +0000 (UTC) Received: (qmail 52772 invoked by uid 500); 18 Oct 2012 04:53:56 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 52593 invoked by uid 500); 18 Oct 2012 04:53:55 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 52580 invoked by uid 99); 18 Oct 2012 04:53:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 04:53:55 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 04:53:49 +0000 Received: by mail-ie0-f176.google.com with SMTP id k11so15235151iea.35 for ; Wed, 17 Oct 2012 21:53:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding:x-gm-message-state; bh=+N40y1zI2c5pGqdjA39b+uOzp/w+UXwx+PJYUAIh76M=; b=OUklis+uUDD0sa+NLil2rZBq0WinqHe704+zWn1pK6AqXyhluOGnFRVrd2zJQYlsPh GpDHLqD++676a0hLDwM1l9b/SU2V37pMFjushS3TrbaLX9BzolXeQqamrq8+ozoCCQKL qTUQsi9EXAQJ/FgELtO1CiAQFkJT/Tvd7XC520PSW5mD8ORLKeNdzuKzqnW44imeoY1D fzeyjmPNpikyGF+R2nES1+mTUp4vHVjO/ItzshqEFHM6Cd6kKBjTdlK3oiYmx/E8RMU2 M8nBaC62oy0nTVU2WtnWREMOy9JFih7IWeyE9SLlWfuFgEWofxx+0jW4+mLaSCoNjrWP GHyA== Received: by 10.50.15.132 with SMTP id x4mr3619931igc.58.1350536008832; Wed, 17 Oct 2012 21:53:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.77.232 with HTTP; Wed, 17 Oct 2012 21:53:08 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Thu, 18 Oct 2012 10:23:08 +0530 Message-ID: Subject: Re: hadoop streaming with custom RecordReader class To: user@hadoop.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQkpdCWkmDNi9TRSmPy+FDvXmv9u/sSVhiTk0aZgbVnuuB3I7o600Ns8PH4DJ1ebcxekaVbR X-Virus-Checked: Checked by ClamAV on apache.org Hi Jason, A few questions (in order): 1. Does Hadoop's own NLineInputFormat not suffice? http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLin= eInputFormat.html 2. Do you make sure to pass your jar into the front-end too? $ export HADOOP_CLASSPATH=3D/path/to/your/jar $ command=85 3. Does jar -tf carry a proper mypackage.NLineRecordReader? 4. Is your class marked public? On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang wrote: > Hi all, > I'm experimenting with hadoop streaming on build 1.0.3. > > To give background info, i'm streaming a text file into mapper written in= C. > Using the default settings, streaming uses TextInputFormat which creates = one > record from each line. The problem I am having is that I need record > boundaries to be every 4 lines. When the splitter breaks up the input in= to > the mappers, I have partial records on the boundaries due to this. To > address this, my approach was to write a new RecordReader class almost in > java that is almost identical to LineRecordReader, but with a modified > next() method that reads 4 lines instead of one. > > I then compiled the new class and created a jar. I wanted to import this= at > run time using the -libjars argument, like such: > > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars > NLineRecordReader.jar -files test_stream.sh -inputreader > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output > /Users/hadoop/test/output -mapper =93test_stream.sh=94 -reducer NONE > > Unfortunately, I keep getting the following error: > -inputreader: class not found: mypackage.NLineRecordReader > > My question is 2 fold. Am I using the right approach to handle the 4 lin= e > records with the custom RecordReader implementation? And why isn't -libj= ars > working to include my class to hadoop streaming at runtime? > > Thanks, > Jason --=20 Harsh J