hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Smith" <mike.smith....@gmail.com>
Subject map/reduce jobs results are not consistent in hadoop 0.11
Date Fri, 09 Feb 2007 08:16:27 GMT
The map/reduce jobs are not consistent in hadoop 0.11 release and trunk both
when you rerun the same job. I have observed this inconsistency of the map
output in different jobs. A simple test to double check is to use hadoop
0.11 with nutch trunk.



1)       Make crawl

2)       Update the crawldb

3)       Use readdb –stat to the get the statistics

4)       Update the crawldb again (the crawldb should be still the same
since no new crawl has happened).

5)       Now use readdb –stat to the get the statistics again.


You will see two statistics will be different.


07/02/08 22:13:43 INFO crawl.CrawlDbReader: TOTAL urls: 6782524

07/02/08 22:13:43 INFO crawl.CrawlDbReader: retry 0:    6757921

07/02/08 22:13:43 INFO crawl.CrawlDbReader: retry 1:    24601

07/02/08 22:13:43 INFO crawl.CrawlDbReader: retry 2:    2

07/02/08 22:13:43 INFO crawl.CrawlDbReader: min score:  0.0090

07/02/08 22:13:43 INFO crawl.CrawlDbReader: avg score:  0.436

07/02/08 22:13:43 INFO crawl.CrawlDbReader: max score:  9005.445

07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
6102449

07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 2 (db_fetched):
570983

07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 3 (db_gone): 23359

07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
41248

07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
44485

07/02/08 22:13:50 INFO crawl.CrawlDbReader: CrawlDb statistics: done



07/02/09 02:38:29 INFO crawl.CrawlDbReader: TOTAL urls: 6438347

07/02/09 02:38:29 INFO crawl.CrawlDbReader: retry 0:    6414923

07/02/09 02:38:29 INFO crawl.CrawlDbReader: retry 1:    23422

07/02/09 02:38:29 INFO crawl.CrawlDbReader: retry 2:    2

07/02/09 02:38:29 INFO crawl.CrawlDbReader: min score:  0.0090

07/02/09 02:38:29 INFO crawl.CrawlDbReader: avg score:  0.453

07/02/09 02:38:29 INFO crawl.CrawlDbReader: max score:  10358.287

07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
5787233

07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 2 (db_fetched):
547037

07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 3 (db_gone): 22311

07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
39315

07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
42451

07/02/09 02:38:36 INFO crawl.CrawlDbReader: CrawlDb statistics: done

If you continue doing this, each time you will see different statistics.
This is not the nutch problem, since it happens for none nutch jobs as
well. I guess somewhere between mappers and reducers some keys are missing
randomly. Has anybody experienced this?

Thanks,
Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message