hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Herou" <marcus.he...@tailsweep.com>
Subject NumMapTasks and NumReduceTasks with MapRunnable
Date Sat, 13 Dec 2008 09:30:44 GMT

We are finally in the beta stage with our crawler and have tested it with a
few hundred thousand urls. However it performs worse than if we would run it
on a local machine without connecting to a hadoop JobTracker.
Each crawl Job is fairly alike a Nutch Fetcher job which spawns X threads
which all read the same RecordReader and starts to fetch the current url
However I am not able to utilize all our 9 machines at the same time which
is really preferable since this is an external IO bound job (remote

How can I with a crawl list of just 9 urls (stupidly small I know) make sure
that all machines is used at least once ?
With a crawl list of 900 how can i make sure at least 100 are crawled at the
same time on all machines ?
And so on with much bigger crawl lists (which is why need hadoop anyway).

Just as I write this I launched a job where i manually set the numMapTasks
to 9 and it seems to be fruitful, quite fast crawl actually :) however I
wonder if this is how I should think with all MapRunnables ?
Next Job we call is PersistOutLinks and yep it goes through a massive list
of source->target links and saves them in a DB.

This list is of a magnitude of at least a 100 times larger than the Fetcher
list. Is it still smart to hardcode a value 9 to numMapTasks for this
MapRunnable job ? Or should I create some form of InputFormat.getInputSplits
based on the crawl/outlink sizes ? Of course the numMapTasks are not
hardcoded but they are injected into the Configuration based on a properties



Marcus Herou CTO and co-founder Tailsweep AB

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message