Rough estimation: since word count requires very little computation, it is
io centric, we can do estimation based on disk speed.
Assume 10 disk with each 100MBps for each node, that is about 1GBps per
node; assume 70% utilization in mapper, we have 700MBps for each node. For
30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
data.
Adding some map reduce overhead and the final merging, say 20% overhead, we
can expect about 10 minutes here.
On Tuesday, April 15, 2014, Shashidhar Rao <raoshashidhar123@gmail.com>
wrote:
> Hi,
>
> Can somebody provide me a rough estimate of the time taken in hours/mins
> for a cluster of say 30 nodes to run a map reduce job to perform a word
> count on say 10 TB of data, assuming that the hardware and the map reduce
> program is tuned optimally.
>
> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word
> count it could be just to analyze the above size of data.
>
> Regards
> Shashidhar
>

Regards,
*Stanley Shi,*
