manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basso Luca <LBa...@Regione.Emilia-Romagna.it>
Subject Internal server error (500) causing a crawl interruption
Date Mon, 06 Oct 2014 11:08:47 GMT
Hi Karl,
we're using the Web repository connector in conjunction with the Solr output connector to
crawl a number of web portals (MCF vers. 1.6.1). Unfortunately the crawl job often stops giving
the following error:
"Repeated service interruptions - failure processing documents: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status: 500, message: Internal Server Error".
>From the MCF and SOLR logs (which we report hereafter) it seems that the problem is arising
from Tika and apply to various types of documents (.rtf, .pdf, etc.).
How can we fix it?
Thank you.

Best regards,
Luca

MCF log:

WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception during indexing http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service interruption reported for job
1412340881687 connection 'Webcrawler': Solr exception during indexing http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
Caused by: org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status:500, message:Internal Server Error

WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception during indexing http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service interruption reported for job 1412252016695
connection 'Webcrawler': Solr exception during indexing http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
Caused by: org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status:500, message:Internal Server Error

SOLR log:

8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.10.80.97:9474-2)
null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198:
Illegal IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a<mailto:org.apache.tika.parser.pdf.PDFParser@6533a82a>
       at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
        at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a<mailto:org.apache.tika.parser.pdf.PDFParser@6533a82a>
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
        ... 20 more
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 23 more
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 2047
        at java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
        at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
        at org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
        at org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
        at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
        ... 27 more

17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.10.80.97:9474-2)
null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285<mailto:org.apache.tika.parser.rtf.RTFParser@73361285>
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
        at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285<mailto:org.apache.tika.parser.rtf.RTFParser@73361285>
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
        ... 20 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
        at org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
        at org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
        at org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
        at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
        at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
        at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 23 more

Mime
View raw message