nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Parser hangs
Date Mon, 04 Jul 2011 14:34:37 GMT
I've only have very small production crawls running on Hadoop. This large 
scale test is in the process of migration to a Hadoop cluster. I'll keep an 
eye on your comments on the reducer once the migration has completed.

Thanks for explaining.



On Monday 04 July 2011 16:28:40 Julien Nioche wrote:
> > On Monday 04 July 2011 15:52:36 Julien Nioche wrote:
> > > no problem. Like most Hadoop jobs the output of the mapper is written,
> > 
> > then
> > 
> > > there is the shuffle etc... finally it goes through the reducer
> > > (ParseSegment l137) - mostly IO bound
> > 
> > I see. The log line is written in the mapper so it's the reduce phase
> > that takes ages to complete. I didn't see much IO-wait though. IO was
> > very little
> > when compared to the total run time of the reduce phase.
> 
> the reducer itself does very little but the time could be spent
> deserializing when the objects are read to be sent to the reducer- in which
> case it would be CPU bound
> 
> > Any advice on how to provide log output to show progress there? It seems
> > parser suffers from the same problem as the fetcher since both reducers
> > take a
> > lot of time.
> 
> That's not something that I've experienced and I'm surprised that the
> reduce step takes that long.
> Again the Hadoop webapps are the best way of monitoring a crawl + they also
> add loads of status info (# docs per Mimetype, errors, etc...). IMHO
> running Nutch in local mode is only useful for testing / debugging /
> running very small crawls
> 
> > > You can check the status of the job on the Hadoop webapps, assuming
> > > that you're running Nutch in (pseudo) distributed mode of course which
> > > is preferable for large crawls
> > 
> > This large-scale test runs locally atm. Hadoop has been set up but hasn't
> > been
> > migrated yet.
> > 
> > > On 4 July 2011 14:29, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> > > > Julien, and others,
> > > > 
> > > > This was a wild goose chase! The parser just now finished. In this
> > > > case
> > 
> > i
> > 
> > > > rephrase the question: what is it doing after all docs have been
> > 
> > parsed?
> > 
> > > > The
> > > > entire parse took less than whatever it was doing after it parsed the
> > > > last document.
> > > > 
> > > > Thanks!
> > > > 
> > > > (Sorry Julien ;)
> > > > 
> > > > On Monday 04 July 2011 14:24:07 Markus Jelsma wrote:
> > > > > Hi,
> > > > > 
> > > > > Another large crawl seems to lead to problems, this time the
> > > > > parser. I've added logging to the parser so i can follow it's
> > > > > progress; it outputs the key of the document it's processing.
> > > > > 
> > > > > It now seems to hang. The proces continues to use CPU time (it
> > > > > fluctuates normally) and i can confirm that the document in
> > > > > question is parsable.
> > > > 
> > > > Both
> > > > 
> > > > > with ParserChecker and a complete crawl cycle of that one URL.
> > > > > 
> > > > > I don't know if the parse job is finishing up as i can't see it but
> > > > > this
> > > > 
> > > > is
> > > > 
> > > > > the last output of the log:
> > > > > 
> > > > > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://
> > > > 
> > > > <HOST>
> > > > 
> > > > > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find
> > > > > rules for scope 'outlink', using default
> > > > > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find
> > > > > rules for scope 'fetcher', using default
> > > > > 
> > > > > As you can see it's already doing `nothing` for 45 minutes. What
is
> > 
> > it
> > 
> > > > > doing? Will it ever finish?
> > > > > 
> > > > > Thanks
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Mime
View raw message