hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Hadoop-2438
Date Tue, 22 Jan 2008 22:44:05 GMT

I would love to talk more off-line about our efforts in this regard.

I will send you email.


On 1/22/08 2:21 PM, "Miles Osborne" <miles@inf.ed.ac.uk> wrote:

> In my case, I'm using actual mappers and reducers, rather than shell script
> commands.  I've also used Map-Reduce at Google when I was on sabbatical
> there in 2006.
> 
> That aside, I do take your point --you need to have a good grip on what Map
> Reduce does to understand some of the challenges.  Here at Edinburgh I'm
> leading a little push to start doing some of our core research within this
> environment.  As a starter, I'm looking at the simple task of estimating
> large n-gram based language models using M-R (think 5-grams and upwards from
> lots of web data).  We are also about to look at core machine learning, such
> as EM etc within this framework.  So, lots of fun and games ... and for me,
> it is quite nice doing this kind of thing.  A good break from the usual
> research.
> 
> Miles
> 
> On 22/01/2008, Ted Dunning <tdunning@veoh.com> wrote:
>> 
>> 
>> 
>> Streaming has some real conceptual confusions awaiting the unwary.
>> 
>> For instance, if you implement line counting, a correct implementation is
>> this:
>> 
>>     stream -mapper cat -reducer 'uniq -c'
>> 
>> (stream is an alias I use to avoid typing hadoop -jar ....)
>> 
>> It is tempting, though very dangerous to do
>> 
>>     stream -mapper 'sort | uniq -c' -reducer '...add up counts...'
>> 
>> But this doesn't work right because the mapper isn't to produce output
>> after
>> the last input line.  (it also tends to not work due to quoting issues,
>> but
>> we can ignore that issue for the moment).  A similar confusion occurs when
>> the mapper exits, even normally.  Take the following program:
>> 
>>     stream -mapper 'head -10' -reducer '...whatever...'
>> 
>> Here the mapper exits after acting like the identity mapper for the first
>> ten input records and then exits.  According to the implicit contract, it
>> should instead stick around and accept all subsequent inputs and not
>> produce
>> any output.
>> 
>> The need for fairly deep understanding of how hadoop and how normal shell
>> processing idioms need to be modified makes streaming a pretty tricky
>> thing
>> to use, especially for the map-reduce novice.
>> 
>> I don't think that this problem can be easily corrected since it is due to
>> a
>> fairly fundamental mismatch between shell programming tradition and what a
>> mapper or reducer is.
>> 
>> 
>> On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <jssarma@facebook.com> wrote:
>> 
>>>> My guess is that this is something to do with caching / buffering,
>> since I
>>>> presume that when the Stream mapper has real work to do, the associated
>> Java
>>>> streamer buffers input until the Mapper signals that it can process
>> more
>>>> data.  If the Mapper is busy, then a lot of data would get cached,
>> causing
>>>> some internal buffer to overflow.
>>> 
>>> unlikely. the java buffer would be fixed size. it would write to a unix
>> pipe
>>> periodically. if the streaming mapper is not consuming data - the java
>> side
>>> would quickly become blocked writing to this pipe.
>>> 
>>> the broken pipe case is extremely common and just tells that the mapper
>> died.
>>> best thing to do is find the stderr log for the task (from the
>> jobtracker ui)
>>> and find if the mapper left something there before dying.
>>> 
>>> 
>>> if streaming gurus are reading this - i am curious about one unrelated
>> thing -
>>> the java map task does a 'flush()' in the buffered input stream to the
>>> streaming mapper after every input line. seemed like unnecessary
>> overhead to
>>> me. was curious why (must be some rationale).
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: milesosb@gmail.com on behalf of Miles Osborne
>>> Sent: Tue 1/22/2008 6:26 AM
>>> To: hadoop-user@lucene.apache.org
>>> Subject: Hadoop-2438
>>> 
>>> Has there been any progress / a work-around for this?
>>> 
>>> Currently I'm experimenting with Streaming and I've encountered what
>> looks
>>> like the same problem as described here:
>>> 
>>> https://issues.apache.org/jira/browse/HADOOP-2438
>>> 
>>> So, I get much the same errors (see below).
>>> 
>>> For this particular task, when I replace the mappers and reducers with
>> the
>>> identity operation (ie just pass through the data) all is well.  When
>>> instead I try to do something more taxing
>>> (in this case, gathering together all ngrams with the same prefix), I
>> get
>>> these errors.
>>> 
>>> My guess is that this is something to do with caching / buffering, since
>> I
>>> presume that when the Stream mapper has real work to do, the associated
>> Java
>>> streamer buffers input until the Mapper signals that it can process more
>>> data.  If the Mapper is busy, then a lot of data would get cached,
>> causing
>>> some internal buffer to overflow.
>>> 
>>> Miles
>>> 
>>>> 
>>> 
>>> Date: Tue Jan 22 14:12:28 GMT 2008
>>> java.io.IOException: Broken pipe
>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>> at java.io.FileOutputStream.write(FileOutputStream.java:260)
>>> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
>> :65)
>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>>> 
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>>> java.io.IOException: MROutput/MRErrThread
>>> failed:java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
>> :349)
>>> at
>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>> PipeMapRed.java:344)
>>> 
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>>> java.io.IOException: MROutput/MRErrThread
>>> failed:java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>> at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
>> :349)
>>> at
>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>> PipeMapRed.java:344)
>>> 
>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>> :1760)
>>> 
>> 
>> 


Mime
View raw message