hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: hadoop streaming with custom RecordReader class
Date Thu, 18 Oct 2012 05:58:41 GMT
Also, consider using Maven for these kinda development, helps build
sane jars automatically :)

On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <harsh@cloudera.com> wrote:
> (3)'s your problem for sure.
>
> Try this:
>
> mkdir mypackage
> mv <class file> mypackage/
> jar cvf NLineRecordReader.jar mypackage
> [Use this jar]
>
> On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <jason.j.wang@gmail.com> wrote:
>> 1. I did try using NLineInputFormat, but this causes the
>> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
>> documentation:
>>
>> "The configuration parameter is valid only if stream.map.input.writer.class
>> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>>
>> My mapper prefers the streaming stdin to not have the key as part of the
>> input.  I could obviously parse that out in the mapper, but the mapper
>> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>>
>> 2. Yes - I did export the classpath before running.
>>
>> 3. This may be the problem:
>>
>> bash-3.2$ jar -tf NLineRecordReader.jar
>> META-INF/
>> META-INF/MANIFEST.MF
>> NLineRecordReader.class
>>
>> I have specified "package mypackage;" at the top of the java file though.
>> Then compiled using "javac" and then "jar cf".
>>
>> 4. The class is public.
>>
>>
>>
>> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <harsh@cloudera.com> wrote:
>>>
>>> Hi Jason,
>>>
>>> A few questions (in order):
>>>
>>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>>
>>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>
>>> 2. Do you make sure to pass your jar into the front-end too?
>>>
>>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>>> $ command…
>>>
>>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>>
>>> 4. Is your class marked public?
>>>
>>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <jason.j.wang@gmail.com>
>>> wrote:
>>> > Hi all,
>>> > I'm experimenting with hadoop streaming on build 1.0.3.
>>> >
>>> > To give background info, i'm streaming a text file into mapper written
>>> > in C.
>>> > Using the default settings, streaming uses TextInputFormat which creates
>>> > one
>>> > record from each line.  The problem I am having is that I need record
>>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>>> > into
>>> > the mappers, I have partial records on the boundaries due to this.  To
>>> > address this, my approach was to write a new RecordReader class almost
>>> > in
>>> > java that is almost identical to LineRecordReader, but with a modified
>>> > next() method that reads 4 lines instead of one.
>>> >
>>> > I then compiled the new class and created a jar.  I wanted to import
>>> > this at
>>> > run time using the -libjars argument, like such:
>>> >
>>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>>> >
>>> > Unfortunately, I keep getting the following error:
>>> > -inputreader: class not found: mypackage.NLineRecordReader
>>> >
>>> > My question is 2 fold.  Am I using the right approach to handle the 4
>>> > line
>>> > records with the custom RecordReader implementation?  And why isn't
>>> > -libjars
>>> > working to include my class to hadoop streaming at runtime?
>>> >
>>> > Thanks,
>>> > Jason
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Mime
View raw message