manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Internal server error (500) causing a crawl interruption
Date Mon, 20 Oct 2014 10:15:52 GMT
Hi Luca,

I am sorry, but we only get back a 500 error from Solr, and that is not
enough information to determine that Tika failed.  Having a general policy
of ignoring 500 errors, which occur when *any* solr exception is thrown,
seems like a bad idea to me.  Indeed, I am concerned that it is not a Tika
failure that you are seeing, but rather something like Solr running out of
memory, which should definitely never be ignored.

You can tell by looking at the actual exception Solr logs to determine what
the underlying cause is.

Thanks,
Karl


On Mon, Oct 20, 2014 at 5:00 AM, Basso Luca <
LBasso@regione.emilia-romagna.it> wrote:

> Hi Shinichiro,
> we found the right configuration just before your suggestion.
> Thank you!
>
> Nevertheless, applying "ignoreTikaException" reduces somewhat the problem
> but doesn't resolve it completely.
> Specifically, the problem still persist for some pdf files (not only for
> scanned pdf and/or pdf converted from ms-office documents).
> Given that the Tika project is not resolving this issue, we suggest that
> the problem could be bypassed at the MCF job or output connector level,
> by means of a specific flag telling the MCF webcrawler to skip "non ok
> status: 500, message: Internal Server Error” and keep on crawling.
>
> Dear Karl, can you insert this possibility in the next MCF release?
> Thanks a lot, as ever.
>
> Luca
>
>
> -----Messaggio originale-----
> Da: Shinichiro Abe [mailto:shinichiro.abe.1@gmail.com]
> Inviato: martedì 7 ottobre 2014 03:21
> A: user@manifoldcf.apache.org
> Cc: user@manifoldcf.apache.org
> Oggetto: Re: Internal server error (500) causing a crawl interruption
>
> Hi Luca,
>
> Please try to configure ignoreTikaException=true.
>
>   <requestHandler name="/update/extract"
>
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
>     <lst name="defaults">
>       <str name="fmap.content">text</str>
>       <str name="lowernames">true</str>
>       <bool name="ignoreTikaException">true</bool>
>       <str name="uprefix">ignored_</str>
>       <str name="captureAttr">true</str>
>     </lst>
>   </requestHandler>
>
> Regards,
> Shinichiro Abe
>
> On 2014/10/06, at 20:15, Karl Wright <daddywri@gmail.com> wrote:
>
> > Hi Luca,
> >
> > There is a solr setting which configures Solr Cell to ignore tika
> errors.  I don't remember what it is offhand, but you will want to set it
> properly to disable tika errors.
> >
> > Thanks,
> > Karl
> >
> >
> > On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca <
> LBasso@regione.emilia-romagna.it> wrote:
> > Hi Karl,
> >
> > we’re using the Web repository connector in conjunction with the Solr
> output connector to crawl a number of web portals (MCF vers. 1.6.1).
> Unfortunately the crawl job often stops giving the following error:
> >
> > “Repeated service interruptions – failure processing documents: Server
> at http://vm97lnx:9474/solr/rerweb5 returned non ok status: 500, message:
> Internal Server Error”.
> >
> > From the MCF and SOLR logs (which we report hereafter) it seems that the
> problem is arising from Tika and apply to various types of documents (.rtf,
> .pdf, etc.).
> >
> > How can we fix it?
> >
> > Thank you.
> >
> >
> >
> > Best regards,
> >
> > Luca
> >
> >
> >
> > MCF log:
> >
> >
> >
> > WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception
> during indexing
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
> (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
> status:500, message:Internal Server Error
> >
> > org.apache.solr.common.SolrException: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> > WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service interruption
> reported for job 1412340881687 connection 'Webcrawler': Solr exception
> during indexing
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
> (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
> status:500, message:Internal Server Error
> >
> > ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed:
> Repeated service interruptions - failure processing document: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
> service interruptions - failure processing document: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> > Caused by: org.apache.solr.common.SolrException: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> >
> >
> > WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception during
> indexing
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
> (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
> status:500, message:Internal Server Error
> >
> > org.apache.solr.common.SolrException: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> > WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service interruption
> reported for job 1412252016695 connection 'Webcrawler': Solr exception
> during indexing
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
> (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
> status:500, message:Internal Server Error
> >
> > ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed:
> Repeated service interruptions - failure processing document: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
> service interruptions - failure processing document: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> > Caused by: org.apache.solr.common.SolrException: Server at
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> message:Internal Server Error
> >
> >
> >
> > SOLR log:
> >
> >
> >
> > 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
> (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@6533a82a
> >
> >        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> >
> >         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >
> >         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >
> >         at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> >
> >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> >
> >         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> >
> >         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> >
> >         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> >
> >         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> >
> >         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> >
> >         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> >
> >         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> >
> >         at
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> >
> >         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> >
> >         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >
> >         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >
> >         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> >
> >         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> >
> >         at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> >
> >         at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> >
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a
> >
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> >
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >
> >         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >
> >         at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> >
> >         ... 20 more
> >
> > Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> >
> >         at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
> >
> >         at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
> >
> >         at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
> >
> >         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
> >
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >
> >         ... 23 more
> >
> > Caused by: java.lang.StringIndexOutOfBoundsException: String index out
> of range: 2047
> >
> >         at
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
> >
> >         at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
> >
> >         at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
> >
> >         at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> >
> >         at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
> >
> >         at
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
> >
> >         at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
> >
> >         ... 27 more
> >
> >
> >
> > 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
> (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.rtf.RTFParser@73361285
> >
> >         at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> >
> >         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >
> >         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >
> >         at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> >
> >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> >
> >         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> >
> >         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> >
> >         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> >
> >         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> >
> >         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> >
> >         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> >
> >         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> >
> >         at
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> >
> >         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> >
> >         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >
> >         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >
> >         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> >
> >         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> >
> >         at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> >
> >         at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> >
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285
> >
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> >
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >
> >         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >
> >         at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> >
> >         ... 20 more
> >
> > Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
> >
> >         at
> org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
> >
> >         at
> org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
> >
> >         at
> org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
> >
> >         at
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
> >
> >         at
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
> >
> >         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
> >
> >         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >
> >         ... 23 more
> >
> >
>
>

Mime
View raw message