hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Zaliva <kroko...@gmail.com>
Subject Re: Hadoop-2438
Date Tue, 22 Jan 2008 22:48:05 GMT
On Jan 22, 2008, at 14:44, Ted Dunning wrote:

I am also very interested in machine learning applications of MapReduce.
Collaborative Filtering in particular. If there are some lists/groups/ 
publications
related to this subject I will appreciate any pointers.

Sincerely,
Vadim

>
> I would love to talk more off-line about our efforts in this regard.
>
> I will send you email.
>
>
> On 1/22/08 2:21 PM, "Miles Osborne" <miles@inf.ed.ac.uk> wrote:
>
>> In my case, I'm using actual mappers and reducers, rather than  
>> shell script
>> commands.  I've also used Map-Reduce at Google when I was on  
>> sabbatical
>> there in 2006.
>>
>> That aside, I do take your point --you need to have a good grip on  
>> what Map
>> Reduce does to understand some of the challenges.  Here at  
>> Edinburgh I'm
>> leading a little push to start doing some of our core research  
>> within this
>> environment.  As a starter, I'm looking at the simple task of  
>> estimating
>> large n-gram based language models using M-R (think 5-grams and  
>> upwards from
>> lots of web data).  We are also about to look at core machine  
>> learning, such
>> as EM etc within this framework.  So, lots of fun and games ... and  
>> for me,
>> it is quite nice doing this kind of thing.  A good break from the  
>> usual
>> research.
>>
>> Miles
>>
>> On 22/01/2008, Ted Dunning <tdunning@veoh.com> wrote:
>>>
>>>
>>>
>>> Streaming has some real conceptual confusions awaiting the unwary.
>>>
>>> For instance, if you implement line counting, a correct  
>>> implementation is
>>> this:
>>>
>>>    stream -mapper cat -reducer 'uniq -c'
>>>
>>> (stream is an alias I use to avoid typing hadoop -jar ....)
>>>
>>> It is tempting, though very dangerous to do
>>>
>>>    stream -mapper 'sort | uniq -c' -reducer '...add up counts...'
>>>
>>> But this doesn't work right because the mapper isn't to produce  
>>> output
>>> after
>>> the last input line.  (it also tends to not work due to quoting  
>>> issues,
>>> but
>>> we can ignore that issue for the moment).  A similar confusion  
>>> occurs when
>>> the mapper exits, even normally.  Take the following program:
>>>
>>>    stream -mapper 'head -10' -reducer '...whatever...'
>>>
>>> Here the mapper exits after acting like the identity mapper for  
>>> the first
>>> ten input records and then exits.  According to the implicit  
>>> contract, it
>>> should instead stick around and accept all subsequent inputs and not
>>> produce
>>> any output.
>>>
>>> The need for fairly deep understanding of how hadoop and how  
>>> normal shell
>>> processing idioms need to be modified makes streaming a pretty  
>>> tricky
>>> thing
>>> to use, especially for the map-reduce novice.
>>>
>>> I don't think that this problem can be easily corrected since it  
>>> is due to
>>> a
>>> fairly fundamental mismatch between shell programming tradition  
>>> and what a
>>> mapper or reducer is.
>>>
>>>
>>> On 1/22/08 8:48 AM, "Joydeep Sen Sarma" <jssarma@facebook.com>  
>>> wrote:
>>>
>>>>> My guess is that this is something to do with caching / buffering,
>>> since I
>>>>> presume that when the Stream mapper has real work to do, the  
>>>>> associated
>>> Java
>>>>> streamer buffers input until the Mapper signals that it can  
>>>>> process
>>> more
>>>>> data.  If the Mapper is busy, then a lot of data would get cached,
>>> causing
>>>>> some internal buffer to overflow.
>>>>
>>>> unlikely. the java buffer would be fixed size. it would write to  
>>>> a unix
>>> pipe
>>>> periodically. if the streaming mapper is not consuming data - the  
>>>> java
>>> side
>>>> would quickly become blocked writing to this pipe.
>>>>
>>>> the broken pipe case is extremely common and just tells that the  
>>>> mapper
>>> died.
>>>> best thing to do is find the stderr log for the task (from the
>>> jobtracker ui)
>>>> and find if the mapper left something there before dying.
>>>>
>>>>
>>>> if streaming gurus are reading this - i am curious about one  
>>>> unrelated
>>> thing -
>>>> the java map task does a 'flush()' in the buffered input stream  
>>>> to the
>>>> streaming mapper after every input line. seemed like unnecessary
>>> overhead to
>>>> me. was curious why (must be some rationale).
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: milesosb@gmail.com on behalf of Miles Osborne
>>>> Sent: Tue 1/22/2008 6:26 AM
>>>> To: hadoop-user@lucene.apache.org
>>>> Subject: Hadoop-2438
>>>>
>>>> Has there been any progress / a work-around for this?
>>>>
>>>> Currently I'm experimenting with Streaming and I've encountered  
>>>> what
>>> looks
>>>> like the same problem as described here:
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-2438
>>>>
>>>> So, I get much the same errors (see below).
>>>>
>>>> For this particular task, when I replace the mappers and reducers  
>>>> with
>>> the
>>>> identity operation (ie just pass through the data) all is well.   
>>>> When
>>>> instead I try to do something more taxing
>>>> (in this case, gathering together all ngrams with the same  
>>>> prefix), I
>>> get
>>>> these errors.
>>>>
>>>> My guess is that this is something to do with caching /  
>>>> buffering, since
>>> I
>>>> presume that when the Stream mapper has real work to do, the  
>>>> associated
>>> Java
>>>> streamer buffers input until the Mapper signals that it can  
>>>> process more
>>>> data.  If the Mapper is busy, then a lot of data would get cached,
>>> causing
>>>> some internal buffer to overflow.
>>>>
>>>> Miles
>>>>
>>>>>
>>>>
>>>> Date: Tue Jan 22 14:12:28 GMT 2008
>>>> java.io.IOException: Broken pipe
>>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>>> at java.io.FileOutputStream.write(FileOutputStream.java:260)
>>>> at  
>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
>>> :65)
>>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
>>>> 123)
>>>> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
>>>> 124)
>>>> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>>
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>> java.io.IOException: MROutput/MRErrThread
>>>> failed:java.lang.OutOfMemoryError: Java heap space
>>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java: 
>>>> 94)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>>> at org.apache.hadoop.mapred.MapTask 
>>>> $MapOutputBuffer.collect(MapTask.java
>>> :349)
>>>> at
>>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>> PipeMapRed.java:344)
>>>>
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>> java.io.IOException: MROutput/MRErrThread
>>>> failed:java.lang.OutOfMemoryError: Java heap space
>>>> at java.util.Arrays.copyOf(Arrays.java:2786)
>>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java: 
>>>> 94)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>> at org.apache.hadoop.io.Text.write(Text.java:243)
>>>> at org.apache.hadoop.mapred.MapTask 
>>>> $MapOutputBuffer.collect(MapTask.java
>>> :349)
>>>> at
>>>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>> PipeMapRed.java:344)
>>>>
>>>> at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> :1760)
>>>>
>>>
>>>
>


Mime
View raw message