hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Hadoop-2438
Date Sun, 27 Jan 2008 07:36:50 GMT
Miles and Vadim - are you aware of the new Lucene sub-project, Mahout?  I think Grant Ingersoll
mentioned it here the other day... http://lucene.apache.org/mahout/

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Vadim Zaliva <krokodil@gmail.com>
To: core-user@hadoop.apache.org
Sent: Tuesday, January 22, 2008 5:48:05 PM
Subject: Re: Hadoop-2438

On 
Jan 
22, 
2008, 
at 
14:44, 
Ted 
Dunning 
wrote:

I 
am 
also 
very 
interested 
in 
machine 
learning 
applications 
of 
MapReduce.
Collaborative 
Filtering 
in 
particular. 
If 
there 
are 
some 
lists/groups/ 
publications
related 
to 
this 
subject 
I 
will 
appreciate 
any 
pointers.

Sincerely,
Vadim

>
> 
I 
would 
love 
to 
talk 
more 
off-line 
about 
our 
efforts 
in 
this 
regard.
>
> 
I 
will 
send 
you 
email.
>
>
> 
On 
1/22/08 
2:21 
PM, 
"Miles 
Osborne" 
<miles@inf.ed.ac.uk> 
wrote:
>
>> 
In 
my 
case, 
I'm 
using 
actual 
mappers 
and 
reducers, 
rather 
than  
>> 
shell 
script
>> 
commands.  
I've 
also 
used 
Map-Reduce 
at 
Google 
when 
I 
was 
on  
>> 
sabbatical
>> 
there 
in 
2006.
>>
>> 
That 
aside, 
I 
do 
take 
your 
point 
--you 
need 
to 
have 
a 
good 
grip 
on  
>> 
what 
Map
>> 
Reduce 
does 
to 
understand 
some 
of 
the 
challenges.  
Here 
at  
>> 
Edinburgh 
I'm
>> 
leading 
a 
little 
push 
to 
start 
doing 
some 
of 
our 
core 
research  
>> 
within 
this
>> 
environment.  
As 
a 
starter, 
I'm 
looking 
at 
the 
simple 
task 
of  
>> 
estimating
>> 
large 
n-gram 
based 
language 
models 
using 
M-R 
(think 
5-grams 
and  
>> 
upwards 
from
>> 
lots 
of 
web 
data).  
We 
are 
also 
about 
to 
look 
at 
core 
machine  
>> 
learning, 
such
>> 
as 
EM 
etc 
within 
this 
framework.  
So, 
lots 
of 
fun 
and 
games 
... 
and  
>> 
for 
me,
>> 
it 
is 
quite 
nice 
doing 
this 
kind 
of 
thing.  
A 
good 
break 
from 
the  
>> 
usual
>> 
research.
>>
>> 
Miles
>>
>> 
On 
22/01/2008, 
Ted 
Dunning 
<tdunning@veoh.com> 
wrote:
>>>
>>>
>>>
>>> 
Streaming 
has 
some 
real 
conceptual 
confusions 
awaiting 
the 
unwary.
>>>
>>> 
For 
instance, 
if 
you 
implement 
line 
counting, 
a 
correct  
>>> 
implementation 
is
>>> 
this:
>>>
>>>  
  
stream 
-mapper 
cat 
-reducer 
'uniq 
-c'
>>>
>>> 
(stream 
is 
an 
alias 
I 
use 
to 
avoid 
typing 
hadoop 
-jar 
....)
>>>
>>> 
It 
is 
tempting, 
though 
very 
dangerous 
to 
do
>>>
>>>  
  
stream 
-mapper 
'sort 
| 
uniq 
-c' 
-reducer 
'...add 
up 
counts...'
>>>
>>> 
But 
this 
doesn't 
work 
right 
because 
the 
mapper 
isn't 
to 
produce  
>>> 
output
>>> 
after
>>> 
the 
last 
input 
line.  
(it 
also 
tends 
to 
not 
work 
due 
to 
quoting  
>>> 
issues,
>>> 
but
>>> 
we 
can 
ignore 
that 
issue 
for 
the 
moment).  
A 
similar 
confusion  
>>> 
occurs 
when
>>> 
the 
mapper 
exits, 
even 
normally.  
Take 
the 
following 
program:
>>>
>>>  
  
stream 
-mapper 
'head 
-10' 
-reducer 
'...whatever...'
>>>
>>> 
Here 
the 
mapper 
exits 
after 
acting 
like 
the 
identity 
mapper 
for  
>>> 
the 
first
>>> 
ten 
input 
records 
and 
then 
exits.  
According 
to 
the 
implicit  
>>> 
contract, 
it
>>> 
should 
instead 
stick 
around 
and 
accept 
all 
subsequent 
inputs 
and 
not
>>> 
produce
>>> 
any 
output.
>>>
>>> 
The 
need 
for 
fairly 
deep 
understanding 
of 
how 
hadoop 
and 
how  
>>> 
normal 
shell
>>> 
processing 
idioms 
need 
to 
be 
modified 
makes 
streaming 
a 
pretty  
>>> 
tricky
>>> 
thing
>>> 
to 
use, 
especially 
for 
the 
map-reduce 
novice.
>>>
>>> 
I 
don't 
think 
that 
this 
problem 
can 
be 
easily 
corrected 
since 
it  
>>> 
is 
due 
to
>>> 
a
>>> 
fairly 
fundamental 
mismatch 
between 
shell 
programming 
tradition  
>>> 
and 
what 
a
>>> 
mapper 
or 
reducer 
is.
>>>
>>>
>>> 
On 
1/22/08 
8:48 
AM, 
"Joydeep 
Sen 
Sarma" 
<jssarma@facebook.com>  
>>> 
wrote:
>>>
>>>>> 
My 
guess 
is 
that 
this 
is 
something 
to 
do 
with 
caching 
/ 
buffering,
>>> 
since 
I
>>>>> 
presume 
that 
when 
the 
Stream 
mapper 
has 
real 
work 
to 
do, 
the  
>>>>> 
associated
>>> 
Java
>>>>> 
streamer 
buffers 
input 
until 
the 
Mapper 
signals 
that 
it 
can  
>>>>> 
process
>>> 
more
>>>>> 
data.  
If 
the 
Mapper 
is 
busy, 
then 
a 
lot 
of 
data 
would 
get 
cached,
>>> 
causing
>>>>> 
some 
internal 
buffer 
to 
overflow.
>>>>
>>>> 
unlikely. 
the 
java 
buffer 
would 
be 
fixed 
size. 
it 
would 
write 
to  
>>>> 
a 
unix
>>> 
pipe
>>>> 
periodically. 
if 
the 
streaming 
mapper 
is 
not 
consuming 
data 
- 
the  
>>>> 
java
>>> 
side
>>>> 
would 
quickly 
become 
blocked 
writing 
to 
this 
pipe.
>>>>
>>>> 
the 
broken 
pipe 
case 
is 
extremely 
common 
and 
just 
tells 
that 
the  
>>>> 
mapper
>>> 
died.
>>>> 
best 
thing 
to 
do 
is 
find 
the 
stderr 
log 
for 
the 
task 
(from 
the
>>> 
jobtracker 
ui)
>>>> 
and 
find 
if 
the 
mapper 
left 
something 
there 
before 
dying.
>>>>
>>>>
>>>> 
if 
streaming 
gurus 
are 
reading 
this 
- 
i 
am 
curious 
about 
one  
>>>> 
unrelated
>>> 
thing 
-
>>>> 
the 
java 
map 
task 
does 
a 
'flush()' 
in 
the 
buffered 
input 
stream  
>>>> 
to 
the
>>>> 
streaming 
mapper 
after 
every 
input 
line. 
seemed 
like 
unnecessary
>>> 
overhead 
to
>>>> 
me. 
was 
curious 
why 
(must 
be 
some 
rationale).
>>>>
>>>>
>>>>
>>>> 
-----Original 
Message-----
>>>> 
From: 
milesosb@gmail.com 
on 
behalf 
of 
Miles 
Osborne
>>>> 
Sent: 
Tue 
1/22/2008 
6:26 
AM
>>>> 
To: 
hadoop-user@lucene.apache.org
>>>> 
Subject: 
Hadoop-2438
>>>>
>>>> 
Has 
there 
been 
any 
progress 
/ 
a 
work-around 
for 
this?
>>>>
>>>> 
Currently 
I'm 
experimenting 
with 
Streaming 
and 
I've 
encountered  
>>>> 
what
>>> 
looks
>>>> 
like 
the 
same 
problem 
as 
described 
here:
>>>>
>>>> 
https://issues.apache.org/jira/browse/HADOOP-2438
>>>>
>>>> 
So, 
I 
get 
much 
the 
same 
errors 
(see 
below).
>>>>
>>>> 
For 
this 
particular 
task, 
when 
I 
replace 
the 
mappers 
and 
reducers  
>>>> 
with
>>> 
the
>>>> 
identity 
operation 
(ie 
just 
pass 
through 
the 
data) 
all 
is 
well.  
 
>>>> 
When
>>>> 
instead 
I 
try 
to 
do 
something 
more 
taxing
>>>> 
(in 
this 
case, 
gathering 
together 
all 
ngrams 
with 
the 
same  
>>>> 
prefix), 
I
>>> 
get
>>>> 
these 
errors.
>>>>
>>>> 
My 
guess 
is 
that 
this 
is 
something 
to 
do 
with 
caching 
/  
>>>> 
buffering, 
since
>>> 
I
>>>> 
presume 
that 
when 
the 
Stream 
mapper 
has 
real 
work 
to 
do, 
the  
>>>> 
associated
>>> 
Java
>>>> 
streamer 
buffers 
input 
until 
the 
Mapper 
signals 
that 
it 
can  
>>>> 
process 
more
>>>> 
data.  
If 
the 
Mapper 
is 
busy, 
then 
a 
lot 
of 
data 
would 
get 
cached,
>>> 
causing
>>>> 
some 
internal 
buffer 
to 
overflow.
>>>>
>>>> 
Miles
>>>>
>>>>>
>>>>
>>>> 
Date: 
Tue 
Jan 
22 
14:12:28 
GMT 
2008
>>>> 
java.io.IOException: 
Broken 
pipe
>>>> 
at 
java.io.FileOutputStream.writeBytes(Native 
Method)
>>>> 
at 
java.io.FileOutputStream.write(FileOutputStream.java:260)
>>>> 
at  
>>>> 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
>>> 
:65)
>>>> 
at 
java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
>>>> 
123)
>>>> 
at 
java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
>>>> 
124)
>>>> 
at 
java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>>> 
at 
org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>>>> 
at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> 
at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> 
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> 
:1760)
>>>>
>>>>
>>>> 
at 
org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>>>> 
at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> 
at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> 
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> 
:1760)
>>>>
>>>> 
java.io.IOException: 
MROutput/MRErrThread
>>>> 
failed:java.lang.OutOfMemoryError: 
Java 
heap 
space
>>>> 
at 
java.util.Arrays.copyOf(Arrays.java:2786)
>>>> 
at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java: 
>>>> 
94)
>>>> 
at 
java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>> 
at 
org.apache.hadoop.io.Text.write(Text.java:243)
>>>> 
at 
org.apache.hadoop.mapred.MapTask 
>>>> 
$MapOutputBuffer.collect(MapTask.java
>>> 
:349)
>>>> 
at
>>>> 
org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>> 
PipeMapRed.java:344)
>>>>
>>>> 
at 
org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>>> 
at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> 
at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> 
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> 
:1760)
>>>>
>>>> 
java.io.IOException: 
MROutput/MRErrThread
>>>> 
failed:java.lang.OutOfMemoryError: 
Java 
heap 
space
>>>> 
at 
java.util.Arrays.copyOf(Arrays.java:2786)
>>>> 
at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java: 
>>>> 
94)
>>>> 
at 
java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>> 
at 
org.apache.hadoop.io.Text.write(Text.java:243)
>>>> 
at 
org.apache.hadoop.mapred.MapTask 
>>>> 
$MapOutputBuffer.collect(MapTask.java
>>> 
:349)
>>>> 
at
>>>> 
org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>> 
PipeMapRed.java:344)
>>>>
>>>> 
at 
org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
>>>> 
at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>> 
at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>> 
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
>>> 
:1760)
>>>>
>>>
>>>
>





Mime
View raw message