manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kamil Żyta <kamil.z...@pwr.edu.pl>
Subject Re: Internal server error (500) causing a crawl interruption
Date Mon, 20 Oct 2014 16:11:10 GMT
Hi,
I have some bad files too and get 500 errors from Solr, tested on
Solr stable and trunk (Tika 1.5, 1.6). ManifoldCF job hang and never end.
ManifoldCF have 'Transformation Connections' where I added Tika extractor.
How this works? It's only metadata extraction or mime detection?
If manifoldCF had complete Tika extraction it would had better handle Tika
errors.

Regards,
KŻ

On Mon, Oct 20, 2014 at 06:15:52AM -0400, Karl Wright wrote:
>    Hi Luca,
>    I am sorry, but we only get back a 500 error from Solr, and that is not
>    enough information to determine that Tika failed.  Having a general policy
>    of ignoring 500 errors, which occur when *any* solr exception is thrown,
>    seems like a bad idea to me.  Indeed, I am concerned that it is not a Tika
>    failure that you are seeing, but rather something like Solr running out of
>    memory, which should definitely never be ignored.
>    You can tell by looking at the actual exception Solr logs to determine
>    what the underlying cause is.
>    Thanks,
>    Karl
>    On Mon, Oct 20, 2014 at 5:00 AM, Basso Luca
>    <[1]LBasso@regione.emilia-romagna.it> wrote:
> 
>      Hi Shinichiro,
>      we found the right configuration just before your suggestion.
>      Thank you!
> 
>      Nevertheless, applying "ignoreTikaException" reduces somewhat the
>      problem but doesn't resolve it completely.
>      Specifically, the problem still persist for some pdf files (not only for
>      scanned pdf and/or pdf converted from ms-office documents).
>      Given that the Tika project is not resolving this issue, we suggest that
>      the problem could be bypassed at the MCF job or output connector level,
>      by means of a specific flag telling the MCF webcrawler to skip "non ok
>      status: 500, message: Internal Server Error” and keep on crawling.
> 
>      Dear Karl, can you insert this possibility in the next MCF release?
>      Thanks a lot, as ever.
> 
>      Luca
> 
>      -----Messaggio originale-----
>      Da: Shinichiro Abe [mailto:[2]shinichiro.abe.1@gmail.com]
>      Inviato: martedì 7 ottobre 2014 03:21
>      A: [3]user@manifoldcf.apache.org
>      Cc: [4]user@manifoldcf.apache.org
>      Oggetto: Re: Internal server error (500) causing a crawl interruption
>      Hi Luca,
> 
>      Please try to configure ignoreTikaException=true.
> 
>        <requestHandler name="/update/extract"
>                       
>      class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
>      startup="lazy">
>          <lst name="defaults">
>            <str name="fmap.content">text</str>
>            <str name="lowernames">true</str>
>            <bool name="ignoreTikaException">true</bool>
>            <str name="uprefix">ignored_</str>
>            <str name="captureAttr">true</str>
>          </lst>
>        </requestHandler>
> 
>      Regards,
>      Shinichiro Abe
> 
>      On 2014/10/06, at 20:15, Karl Wright <[5]daddywri@gmail.com> wrote:
> 
>      > Hi Luca,
>      >
>      > There is a solr setting which configures Solr Cell to ignore tika
>      errors.  I don't remember what it is offhand, but you will want to set
>      it properly to disable tika errors.
>      >
>      > Thanks,
>      > Karl
>      >
>      >
>      > On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca
>      <[6]LBasso@regione.emilia-romagna.it> wrote:
>      > Hi Karl,
>      >
>      > we’re using the Web repository connector in conjunction with the Solr
>      output connector to crawl a number of web portals (MCF vers. 1.6.1).
>      Unfortunately the crawl job often stops giving the following error:
>      >
>      > “Repeated service interruptions – failure processing documents: Server
>      at [7]http://vm97lnx:9474/solr/rerweb5 returned non ok status: 500,
>      message: Internal Server Error”.
>      >
>      > From the MCF and SOLR logs (which we report hereafter) it seems that
>      the problem is arising from Tika and apply to various types of documents
>      (.rtf, .pdf, etc.).
>      >
>      > How can we fix it?
>      >
>      > Thank you.
>      >
>      >
>      >
>      > Best regards,
>      >
>      > Luca
>      >
>      >
>      >
>      > MCF log:
>      >
>      >
>      >
>      > WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception
>      during indexing
>      [8]http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
>      (500): Server at [9]http://vm97lnx:9474/solr/rerweb5 returned non ok
>      status:500, message:Internal Server Error
>      >
>      > org.apache.solr.common.SolrException: Server at
>      [10]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      > WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service
>      interruption reported for job 1412340881687 connection 'Webcrawler':
>      Solr exception during indexing
>      [11]http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
>      (500): Server at [12]http://vm97lnx:9474/solr/rerweb5 returned non ok
>      status:500, message:Internal Server Error
>      >
>      > ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed:
>      Repeated service interruptions - failure processing document: Server at
>      [13]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>      service interruptions - failure processing document: Server at
>      [14]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      > Caused by: org.apache.solr.common.SolrException: Server at
>      [15]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      >
>      >
>      > WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception
>      during indexing
>      [16]http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
>      (500): Server at [17]http://vm97lnx:9474/solr/rerweb5 returned non ok
>      status:500, message:Internal Server Error
>      >
>      > org.apache.solr.common.SolrException: Server at
>      [18]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      > WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service
>      interruption reported for job 1412252016695 connection 'Webcrawler':
>      Solr exception during indexing
>      [19]http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
>      (500): Server at [20]http://vm97lnx:9474/solr/rerweb5 returned non ok
>      status:500, message:Internal Server Error
>      >
>      > ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed:
>      Repeated service interruptions - failure processing document: Server at
>      [21]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>      service interruptions - failure processing document: Server at
>      [22]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      > Caused by: org.apache.solr.common.SolrException: Server at
>      [23]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
>      message:Internal Server Error
>      >
>      >
>      >
>      > SOLR log:
>      >
>      >
>      >
>      > 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
>      (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException:
>      org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
>      from org.apache.tika.parser.pdf.PDFParser@6533a82a
>      >
>      >        at
>      org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>      >
>      >         at
>      org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>      >
>      >         at
>      org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>      >
>      >         at
>      org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
>      >
>      >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
>      >
>      >         at
>      org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
>      >
>      >         at
>      org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
>      >
>      >         at
>      org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
>      >
>      >         at
>      org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
>      >
>      >         at
>      org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
>      >
>      >         at
>      org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
>      >
>      >         at
>      org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
>      >
>      >         at
>      org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
>      >
>      >         at
>      org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
>      >
>      >         at
>      org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>      >
>      >         at
>      org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>      >
>      >         at
>      org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
>      >
>      >         at
>      org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
>      >
>      >         at
>      org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
>      >
>      >         at
>      org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
>      >
>      >         at java.lang.Thread.run(Thread.java:745)
>      >
>      > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
>      IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a
>      >
>      >         at
>      org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
>      >
>      >         at
>      org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>      >
>      >         at
>      org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>      >
>      >         at
>      org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
>      >
>      >         ... 20 more
>      >
>      > Caused by: org.apache.pdfbox.exceptions.WrappedIOException
>      >
>      >         at
>      org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
>      >
>      >         at
>      org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
>      >
>      >         at
>      org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
>      >
>      >         at
>      org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
>      >
>      >         at
>      org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>      >
>      >         ... 23 more
>      >
>      > Caused by: java.lang.StringIndexOutOfBoundsException: String index out
>      of range: 2047
>      >
>      >         at
>      java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
>      >
>      >         at
>      java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
>      >
>      >         at
>      org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
>      >
>      >         at
>      org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
>      >
>      >         at
>      org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
>      >
>      >         at
>      org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
>      >
>      >         at
>      org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
>      >
>      >         ... 27 more
>      >
>      >
>      >
>      > 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
>      (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException:
>      org.apache.tika.exception.TikaException: Unexpected RuntimeException
>      from org.apache.tika.parser.rtf.RTFParser@73361285
>      >
>      >         at
>      org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>      >
>      >         at
>      org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>      >
>      >         at
>      org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>      >
>      >         at
>      org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
>      >
>      >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
>      >
>      >         at
>      org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
>      >
>      >         at
>      org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
>      >
>      >         at
>      org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
>      >
>      >         at
>      org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
>      >
>      >         at
>      org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
>      >
>      >         at
>      org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
>      >
>      >         at
>      org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
>      >
>      >         at
>      org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
>      >
>      >         at
>      org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
>      >
>      >         at
>      org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>      >
>      >         at
>      org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>      >
>      >         at
>      org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
>      >
>      >         at
>      org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
>      >
>      >         at
>      org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
>      >
>      >         at
>      org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
>      >
>      >         at java.lang.Thread.run(Thread.java:745)
>      >
>      > Caused by: org.apache.tika.exception.TikaException: Unexpected
>      RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285
>      >
>      >         at
>      org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>      >
>      >         at
>      org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>      >
>      >         at
>      org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>      >
>      >         at
>      org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
>      >
>      >         ... 20 more
>      >
>      > Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
>      >
>      >         at
>      org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
>      >
>      >         at
>      org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
>      >
>      >         at
>      org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
>      >
>      >         at
>      org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
>      >
>      >         at
>      org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
>      >
>      >         at
>      org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
>      >
>      >         at
>      org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>      >
>      >         ... 23 more
>      >
>      >
> 
> References
> 
>    Visible links
>    1. mailto:LBasso@regione.emilia-romagna.it
>    2. mailto:shinichiro.abe.1@gmail.com
>    3. mailto:user@manifoldcf.apache.org
>    4. mailto:user@manifoldcf.apache.org
>    5. mailto:daddywri@gmail.com
>    6. mailto:LBasso@regione.emilia-romagna.it
>    7. http://vm97lnx:9474/solr/rerweb5
>    8. http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
>    9. http://vm97lnx:9474/solr/rerweb5
>   10. http://vm97lnx:9474/solr/rerweb5
>   11. http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
>   12. http://vm97lnx:9474/solr/rerweb5
>   13. http://vm97lnx:9474/solr/rerweb5
>   14. http://vm97lnx:9474/solr/rerweb5
>   15. http://vm97lnx:9474/solr/rerweb5
>   16. http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
>   17. http://vm97lnx:9474/solr/rerweb5
>   18. http://vm97lnx:9474/solr/rerweb5
>   19. http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
>   20. http://vm97lnx:9474/solr/rerweb5
>   21. http://vm97lnx:9474/solr/rerweb5
>   22. http://vm97lnx:9474/solr/rerweb5
>   23. http://vm97lnx:9474/solr/rerweb5

Mime
View raw message