hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Strange Hadoop behavior - Different results on equivalent input
Date Mon, 17 Sep 2007 18:08:49 GMT


This isn't a matter of side effects.

The issue is that the combiner only sees output from a single map task.
That means that the counts will be (statistically speaking) smaller than for
the final reduce.



On 9/17/07 9:34 AM, "Luca Telloli" <lucatelloli@yahoo.it> wrote:

> Hello Devaraj, 
> thanks for your detailed answer. I did indeed try another time running
> the same tasks and using the same inputs as described in my first email
> and you're right, the output is the same. This time I used the original
> wordcount provided by the distribution.
> 
> I realized that my mistake was to modify the Reduce class for post
> processing of output in this way:
> 
>      if(sum > 100)
>       output.collect(key, new IntWritable(sum));
>     }
> 
> without keeping the original Combiner class (in the WordCount example,
> Reducer and Combiner are the same class, named Reduce): I guess that,
> because Combiner works with local data in memory, instead of disk files,
> this can generate unwanted side effects like the one I experienced.
> 
> Then, I tried to run a post-process operation on Reduce values, using
> the original Reduce as a Combiner, and a new Reduce (with a threshold
> condition such as above) as the Reducer. This worked correctly. I assume
> that doing similar operations as above at the Reduce time is reliable
> and does not generate side effects.
> 
> Cheers,
> Luca 
> 
> On Sun, 2007-09-16 at 16:48 -0700, Devaraj Das wrote:
>> Hi Luca,
>> You really raised my curiousity and I went and tried it myself. I had a
>> bunch of files adding up to 591 MB in a dir, and an equivalent single file
>> in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The
>> outputs were exactly the same.
>> The split sizes will not affect the outcome in the wordcount case. The #maps
>> is a function of the hdfs block size, #maps the user specified,
>> length/number of files. The RecordReader,
>> org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where
>> files could be split anywhere (newlines could straddle hdfs block boundary).
>> If you look at 
>> org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this
>> info is used. 
>> Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it
>> accepts the user specified mapred.reduce.tasks and doesn't manipulate that.
>> You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
>> Thanks,
>> Devaraj.
>> 
>> 
>>> -----Original Message-----
>>> From: Luca Telloli [mailto:lucatelloli@yahoo.it]
>>> Sent: Friday, September 14, 2007 7:17 AM
>>> To: hadoop-user@lucene.apache.org
>>> Subject: Strange Hadoop behavior - Different results on
>>> equivalent input
>>> 
>>> Hello everyone,
>>> I'm new to Hadoop and to this mailing list so: Hello. =)
>>> 
>>> I'm experiencing a problem that I can't understand; I'm
>>> performing a wordcount task (from the examples in the source)
>>> on a single Hadoop node configured as a pseudo-distributed
>>> environment. My input is a set of document I scratched from
>>> /usr/share/doc.
>>> 
>>> I have two inputs:
>>> - the first one is a set of three files of 189, 45 and 1.9
>>> MB, named input-compact
>>> - the second one is the same as above, put on a single 236MB
>>> file with cat, named input-single, so I'm talking about
>>> "equivalent" input
>>> 
>>> Logs report 11 map tasks for one job and 10 for the other,
>>> both having a total of 2 reduce tasks. I expect the outcome
>>> to be the same, but it's not, as it follows from the tail of
>>> my outputs 
>>> 
>>> $ tail /tmp/output-*
>>> ==> /tmp/output-compact <==
>>> yet.</td>       164
>>> you     23719
>>> You     4603
>>> your    7097
>>> Zend    111
>>> zero,   101
>>> zero    1637
>>> zero-based      114
>>> zval    140
>>> zval*   191
>>> 
>>> ==> /tmp/output-single <==
>>> Y       289
>>> (Yasuhiro       105
>>> yet.</td>       164
>>> you     23719
>>> You     4622
>>> your    7121
>>> zero,   101
>>> zero    1646
>>> zero-based      114
>>> zval*   191
>>> 
>>> - Does the way Hadoop splits its input in block on HDFS
>>> influence the possible outcome of the computation?
>>> 
>>> - Even so: how can the result be so different? I mean, the
>>> word zval, having 140 occurrences in the first run, doesn't
>>> even appear in the second one!
>>> 
>>> - Third question: I've been seeing that, when files are
>>> small, hadoop tends to make as many maps as the number of
>>> files. My initial input was scattered into 13k different
>>> small files and was not good for the task, as I realized
>>> quite soon, having almost 13k maps running the same task.
>>> At that time, I specified a few parameters in my
>>> initialization file, like mapred.map.tasks = 10 and
>>> mapred.reduce.tasks = 2. I wonder how hadoop decides on the
>>> number of maps; on the help it says that mapred.map.tasks is
>>> a _value per job_ but I wonder if instead is not some
>>> function of <#tasks, #input files> or other parameters.
>>> 
>>> - Finally, is there a way to completely force these
>>> parameters (numbers of maps and reduce)?
>>> 
>>> Apologies if any of these questions might sound dumb, I'm
>>> really new to the software and willing to learn more.
>>> 
>>> Thanks,
>>> Luca 
>>> 
>>> 
>> 
> 


Mime
View raw message