hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Hadoop-2438
Date Tue, 22 Jan 2008 22:04:41 GMT


Streaming has some real conceptual confusions awaiting the unwary.

For instance, if you implement line counting, a correct implementation is
this:

    stream -mapper cat -reducer 'uniq -c'

(stream is an alias I use to avoid typing hadoop -jar ....)

It is tempting, though very dangerous to do

    stream -mapper 'sort | uniq -c' -reducer '...add up counts...'

But this doesn't work right because the mapper isn't to produce output after
the last input line.  (it also tends to not work due to quoting issues, but
we can ignore that issue for the moment).  A similar confusion occurs when
the mapper exits, even normally.  Take the following program:

    stream -mapper 'head -10' -reducer '...whatever...'

Here the mapper exits after acting like the identity mapper for the first
ten input records and then exits.  According to the implicit contract, it
should instead stick around and accept all subsequent inputs and not produce
any output.

The need for fairly deep understanding of how hadoop and how normal shell
processing idioms need to be modified makes streaming a pretty tricky thing
to use, especially for the map-reduce novice.

I don't think that this problem can be easily corrected since it is due to a
fairly fundamental mismatch between shell programming tradition and what a
mapper or reducer is.


On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <jssarma@facebook.com> wrote:

>> My guess is that this is something to do with caching / buffering, since I
>> presume that when the Stream mapper has real work to do, the associated Java
>> streamer buffers input until the Mapper signals that it can process more
>> data.  If the Mapper is busy, then a lot of data would get cached, causing
>> some internal buffer to overflow.
> 
> unlikely. the java buffer would be fixed size. it would write to a unix pipe
> periodically. if the streaming mapper is not consuming data - the java side
> would quickly become blocked writing to this pipe.
> 
> the broken pipe case is extremely common and just tells that the mapper died.
> best thing to do is find the stderr log for the task (from the jobtracker ui)
> and find if the mapper left something there before dying.
> 
> 
> if streaming gurus are reading this - i am curious about one unrelated thing -
> the java map task does a 'flush()' in the buffered input stream to the
> streaming mapper after every input line. seemed like unnecessary overhead to
> me. was curious why (must be some rationale).
> 
> 
> 
> -----Original Message-----
> From: milesosb@gmail.com on behalf of Miles Osborne
> Sent: Tue 1/22/2008 6:26 AM
> To: hadoop-user@lucene.apache.org
> Subject: Hadoop-2438
>  
> Has there been any progress / a work-around for this?
> 
> Currently I'm experimenting with Streaming and I've encountered what looks
> like the same problem as described here:
> 
> https://issues.apache.org/jira/browse/HADOOP-2438
> 
> So, I get much the same errors (see below).
> 
> For this particular task, when I replace the mappers and reducers with the
> identity operation (ie just pass through the data) all is well.  When
> instead I try to do something more taxing
> (in this case, gathering together all ngrams with the same prefix), I get
> these errors.
> 
> My guess is that this is something to do with caching / buffering, since I
> presume that when the Stream mapper has real work to do, the associated Java
> streamer buffers input until the Mapper signals that it can process more
> data.  If the Mapper is busy, then a lot of data would get cached, causing
> some internal buffer to overflow.
> 
> Miles
> 
>> 
> 
> Date: Tue Jan 22 14:12:28 GMT 2008
> java.io.IOException: Broken pipe
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 
> 
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 
> java.io.IOException: MROutput/MRErrThread
> failed:java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2786)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at org.apache.hadoop.io.Text.write(Text.java:243)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
> at 
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)
> 
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 
> java.io.IOException: MROutput/MRErrThread
> failed:java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2786)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at org.apache.hadoop.io.Text.write(Text.java:243)
> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
> at 
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:344)
> 
> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> 


Mime
View raw message