hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <lukas.vl...@gmail.com>
Subject Ideal number of mappers and reducers; any physical limits?
Date Mon, 14 Jul 2008 21:10:42 GMT

I have a couple of *basic* questions about Hadoop internals.

1) If I understood correctly the ideal number of Reducers is equal to number
of distinct keys (or custom Partitioners) emitted from from all Mappers at
given Map-Reduce iteration. Is that correct?

2) In configuration there can be set maximum number of Reducers. How does
Hadoop handle the situation when there are more intermediate keys emitted
from Mappers then this number? AFAIK the intermediate results are stored in
SequenceFiles. Does it mean that this intermetidate persistent storeage is
somehow scanned to all records of the same key (or custom Partinioner value)
and such chunk of data is send to one Reduced and if no Reducer is left them
the process waits unitl some of them is done and can be assigned a new chunk
of data?

3) Is there any recommendation about how to set up a job if number of
intermediate keys is not know beforehand?

4) Is there any physical limit of number of Reducers given by internal
Hadoop architecture?

... and finally ...

5) Does anybody know how and what exactly do folks in Yahoo! use Hadoop for?
If the biggest reported Hadoop cluster has something like 2000 machines then
the total number of Mappers/Reducers can be like 2000*200 (assuming there
are for example 200 Reducers running on each machine), which is a big number
but still probably not big enough to handle processing of really large
graphs data structures IMHO. As far as I understood Google is not directly
using Map-Reduce form of PageRank calculation for whole internet graph
processing (see http://www.youtube.com/watch?v=BT-piFBP4fE). So, if Yahoo!
needs scaling algorithm for really large tasks, what do they use if not



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message