nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Parser hangs
Date Mon, 04 Jul 2011 13:27:18 GMT
None of these. All these URL's work fine with ParserChecker. I've also tried 
several more that are not in the snippet below, all parse well, so does the 
PDF except it's slow.

On Monday 04 July 2011 15:21:53 Julien Nioche wrote:
> Which is the one that loops with the ParserChecker?
> 
> On 4 July 2011 14:18, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> > These are the last few lines of the currently running parse job:
> > 
> > 2011-07-04 11:43:15,450 INFO  parse.ParseSegment - Parsing:
> > http://www.elseviergezondheidszorg.nl/1068128/Stappenplan-Zorgvisie-
> > Opleidingwijzer.pdf
> > 2011-07-04 11:43:16,173 INFO  parse.ParseSegment - Parsing:
> > http://www.elseviergezondheidszorg.nl/1128911/Aanmelden-nieuwsbrief.html
> > 2011-07-04 11:43:16,316 INFO  parse.ParseSegment - Parsing:
> > 
> > http://www.elsevieropleidingen.nl/applicaties/alfabetische-opleidinglijst
> > .aspx 2011-07-04 11:43:16,324 INFO  parse.ParseSegment - Parsing:
> > http://www.elsgulpen.nl
> > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing:
> > http://www.elshaarzaak.nl/
> > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules
> > for
> > scope 'outlink', using default
> > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules
> > for
> > scope 'fetcher', using default
> > 
> > I see no text file, all HTML and one PDF. The elshaarzaak.nl is confirmed
> > to
> > parse nicely in a small test crawl on another machine using same Nutch
> > 1.4-dev
> > version and config.
> > 
> > On Monday 04 July 2011 15:13:10 Julien Nioche wrote:
> > > Only the last one is likely to correspond to that document as the first
> > > 2 are for a .txt document.
> > > 
> > > Can you tell me what the URL is so that I can check whether the issue
> > > is reproductible?
> > > 
> > > Thanks
> > > 
> > > > > try calling jstack to see where it is stuck?
> > > > 
> > > > I've obtained a thread dump but need some assistance on how to to
> > > > interpret it. It actually doing something as some threads' trace
> > > > change between jstack
> > > > calls.
> > > > 
> > > > 
> > > > These three threads change. Note the calls to Tika. I'm no longer
> > > > sure what it's processing now. Im only sure the last log line
> > > > `Parsing: URL` is a plain
> > > > old HTML page.
> > > > 
> > > > Thanks
> > > > 
> > > > 
> > > > "Thread-91065" prio=10 tid=0x00007ff788146000 nid=0x2a30 runnable
> > > > [0x00007ff77b5f4000]
> > > > 
> > > >   java.lang.Thread.State: RUNNABLE
> > > >   
> > > >        at java.util.Arrays.copyOf(Arrays.java:2882)
> > > >        at
> > 
> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > 
> > > > :100)
> > > > :
> > > >        at
> > > > 
> > > > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390
> > > > )
> > > > 
> > > >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> > > >        - locked <0x00000000dc200000> (a java.lang.StringBuffer)
> > > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > > > 
> > > > Source)
> > > > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > 
> > > > rator.java:146)
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > 
> > > > :39)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.writeReplacement(SafeContentHandle
> > 
> > > > r.java:143)
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:105
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > 
> > > > :151)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > 
> > > > va:261)
> > > > 
> > > >        at
> > 
> > org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > > > 
> > > >        at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:3
> > > >        5) at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:2
> > > >        4) at
> > > > 
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > 
> > > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > >        at java.lang.Thread.run(Thread.java:662)
> > > > 
> > > > and
> > > > 
> > > > 
> > > > "Thread-91016" prio=10 tid=0x00007ff788ad2800 nid=0x2952 runnable
> > > > [0x00007ff77b7f5000]
> > > > 
> > > >   java.lang.Thread.State: RUNNABLE
> > > >   
> > > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > > > 
> > > > Source)
> > > > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > 
> > > > rator.java:146)
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > 
> > > > :39)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:101
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > 
> > > > :151)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > 
> > > > va:261)
> > > > 
> > > >        at
> > 
> > org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > > > 
> > > >        at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:3
> > > >        5) at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:2
> > > >        4) at
> > > > 
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > 
> > > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > >        at java.lang.Thread.run(Thread.java:662)
> > > > 
> > > > and
> > > > 
> > > > 
> > > > 
> > > > "Thread-57923" prio=10 tid=0x00000000422ef000 nid=0x1fbe runnable
> > > > [0x00007ff780c14000]
> > > > 
> > > >   java.lang.Thread.State: RUNNABLE
> > > >   
> > > >        at java.util.Arrays.copyOf(Arrays.java:2882)
> > > >        at
> > 
> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > 
> > > > :100)
> > > > :
> > > >        at
> > > > 
> > > > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390
> > > > )
> > > > 
> > > >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> > > >        - locked <0x00000000f5e57810> (a java.lang.StringBuffer)
> > > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > > > 
> > > > Source)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser
> > 
> > > > .java:463)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:19
> > 
> > > > 5)
> > > > 
> > > >        at
> > > > 
> > > > org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:82
> > > > 1)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.
> > 
> > > > java:2033)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
> > 
> > > >        at
> >  
> >  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
> >  
> > > >        at
> > > > 
> > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
> > > > )
> > > > 
> > > >        at
> > > > 
> > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
> > > > )
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
> > 
> > > > :164)
> > > > :
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
> > > > 
> > > >        at
> >  
> >  org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
> >  
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:147)
> > > > 
> > > >        at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:3
> > > >        5) at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:2
> > > >        4) at
> > > > 
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > 
> > > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > >        at java.lang.Thread.run(Thread.java:662)
> > > > 
> > > > Thanks
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Mime
View raw message