hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Gandhi <gopal.gandhi2...@yahoo.com>
Subject Re: [Streaming] I figured out a way to do combining using mapper, would anybody check it?
Date Tue, 22 Jul 2008 00:10:12 GMT
Thanks, but for the memory thing, according to http://wiki.apache.org/hadoop/HadoopMapReduce
, Hadoop combiner is also based on memory. Quote: "When the map operation outputs its pairs
they are already available in memory. For efficiency reasons, sometimes it makes sense to
take advantage of this fact by supplying a combiner class to perform a reduce-type function.
... A combine operation will start gathering the output in in-memory lists (instead of on
disk), one list per word."
So my code works exactly the same as a Hadoop combiner in terms of memory usage. 



----- Original Message ----
From: lohit <lohit_bv@yahoo.com>
To: core-dev@hadoop.apache.org
Sent: Monday, July 21, 2008 3:46:33 PM
Subject: Re: [Streaming] I figured out a way to do combining using mapper, would anybody check
it?

Yes, for this example, its same. Although you might want to consider one more thing. In your
code you eat up all you input data into memory and then dump it. So, if your input split is
very big, your hash would be big as well, and also, if reading this data into hash takes more
than mapred.task.timeout time, I think there is no status reported to job tracker, which assumes
that task is gone and might kill the task. 

Thanks,
Lohit



----- Original Message ----
From: Gopal Gandhi <gopal.gandhi2008@yahoo.com>
To: core-dev@hadoop.apache.org
Cc: core-user@hadoop.apache.org
Sent: Monday, July 21, 2008 2:35:45 PM
Subject: [Streaming] I figured out a way to do combining using mapper, would anybody check
it?

I am using Hadoop Streaming. 
I figured out a way to do combining using mapper, is it the same as using a separate combiner?

For example: the input is a list of words, I want to count their total number for each word.

The traditional mapper is:

while (<STDIN>) {
  chomp ($_);
  $word = $_;
  print ($word\t1\n);
}
.........

Instead of using a additional combiner, I modify the mapper to use a hash

%hash = ();
while (<STDIN>) {
  chomp ($_);
  $word = $_;
  $hash{$word} ++;
}

foreach $key (%hash){
  print "$key\t$hash{$key}\n";
}

Is it the same as using a seperate combiner?



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message