manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Internal server error (500) causing a crawl interruption
Date Mon, 20 Oct 2014 16:13:42 GMT
Can you provide the solr exception, from the solr log?
Karl

On Mon, Oct 20, 2014 at 12:11 PM, Kamil Żyta <kamil.zyta@pwr.edu.pl> wrote:

> Hi,
> I have some bad files too and get 500 errors from Solr, tested on
> Solr stable and trunk (Tika 1.5, 1.6). ManifoldCF job hang and never end.
> ManifoldCF have 'Transformation Connections' where I added Tika extractor.
> How this works? It's only metadata extraction or mime detection?
> If manifoldCF had complete Tika extraction it would had better handle Tika
> errors.
>
> Regards,
> KŻ
>
> On Mon, Oct 20, 2014 at 06:15:52AM -0400, Karl Wright wrote:
> >    Hi Luca,
> >    I am sorry, but we only get back a 500 error from Solr, and that is
> not
> >    enough information to determine that Tika failed.  Having a general
> policy
> >    of ignoring 500 errors, which occur when *any* solr exception is
> thrown,
> >    seems like a bad idea to me.  Indeed, I am concerned that it is not a
> Tika
> >    failure that you are seeing, but rather something like Solr running
> out of
> >    memory, which should definitely never be ignored.
> >    You can tell by looking at the actual exception Solr logs to determine
> >    what the underlying cause is.
> >    Thanks,
> >    Karl
> >    On Mon, Oct 20, 2014 at 5:00 AM, Basso Luca
> >    <[1]LBasso@regione.emilia-romagna.it> wrote:
> >
> >      Hi Shinichiro,
> >      we found the right configuration just before your suggestion.
> >      Thank you!
> >
> >      Nevertheless, applying "ignoreTikaException" reduces somewhat the
> >      problem but doesn't resolve it completely.
> >      Specifically, the problem still persist for some pdf files (not
> only for
> >      scanned pdf and/or pdf converted from ms-office documents).
> >      Given that the Tika project is not resolving this issue, we suggest
> that
> >      the problem could be bypassed at the MCF job or output connector
> level,
> >      by means of a specific flag telling the MCF webcrawler to skip "non
> ok
> >      status: 500, message: Internal Server Error” and keep on crawling.
> >
> >      Dear Karl, can you insert this possibility in the next MCF release?
> >      Thanks a lot, as ever.
> >
> >      Luca
> >
> >      -----Messaggio originale-----
> >      Da: Shinichiro Abe [mailto:[2]shinichiro.abe.1@gmail.com]
> >      Inviato: martedì 7 ottobre 2014 03:21
> >      A: [3]user@manifoldcf.apache.org
> >      Cc: [4]user@manifoldcf.apache.org
> >      Oggetto: Re: Internal server error (500) causing a crawl
> interruption
> >      Hi Luca,
> >
> >      Please try to configure ignoreTikaException=true.
> >
> >        <requestHandler name="/update/extract"
> >
> >      class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> >      startup="lazy">
> >          <lst name="defaults">
> >            <str name="fmap.content">text</str>
> >            <str name="lowernames">true</str>
> >            <bool name="ignoreTikaException">true</bool>
> >            <str name="uprefix">ignored_</str>
> >            <str name="captureAttr">true</str>
> >          </lst>
> >        </requestHandler>
> >
> >      Regards,
> >      Shinichiro Abe
> >
> >      On 2014/10/06, at 20:15, Karl Wright <[5]daddywri@gmail.com> wrote:
> >
> >      > Hi Luca,
> >      >
> >      > There is a solr setting which configures Solr Cell to ignore tika
> >      errors.  I don't remember what it is offhand, but you will want to
> set
> >      it properly to disable tika errors.
> >      >
> >      > Thanks,
> >      > Karl
> >      >
> >      >
> >      > On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca
> >      <[6]LBasso@regione.emilia-romagna.it> wrote:
> >      > Hi Karl,
> >      >
> >      > we’re using the Web repository connector in conjunction with the
> Solr
> >      output connector to crawl a number of web portals (MCF vers. 1.6.1).
> >      Unfortunately the crawl job often stops giving the following error:
> >      >
> >      > “Repeated service interruptions – failure processing documents:
> Server
> >      at [7]http://vm97lnx:9474/solr/rerweb5 returned non ok status: 500,
> >      message: Internal Server Error”.
> >      >
> >      > From the MCF and SOLR logs (which we report hereafter) it seems
> that
> >      the problem is arising from Tika and apply to various types of
> documents
> >      (.rtf, .pdf, etc.).
> >      >
> >      > How can we fix it?
> >      >
> >      > Thank you.
> >      >
> >      >
> >      >
> >      > Best regards,
> >      >
> >      > Luca
> >      >
> >      >
> >      >
> >      > MCF log:
> >      >
> >      >
> >      >
> >      > WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception
> >      during indexing
> >      [8]
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
> >      (500): Server at [9]http://vm97lnx:9474/solr/rerweb5 returned non
> ok
> >      status:500, message:Internal Server Error
> >      >
> >      > org.apache.solr.common.SolrException: Server at
> >      [10]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      > WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service
> >      interruption reported for job 1412340881687 connection 'Webcrawler':
> >      Solr exception during indexing
> >      [11]
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
> >      (500): Server at [12]http://vm97lnx:9474/solr/rerweb5 returned non
> ok
> >      status:500, message:Internal Server Error
> >      >
> >      > ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception
> tossed:
> >      Repeated service interruptions - failure processing document:
> Server at
> >      [13]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated
> >      service interruptions - failure processing document: Server at
> >      [14]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      > Caused by: org.apache.solr.common.SolrException: Server at
> >      [15]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      >
> >      >
> >      > WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception
> >      during indexing
> >      [16]
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
> >      (500): Server at [17]http://vm97lnx:9474/solr/rerweb5 returned non
> ok
> >      status:500, message:Internal Server Error
> >      >
> >      > org.apache.solr.common.SolrException: Server at
> >      [18]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      > WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service
> >      interruption reported for job 1412252016695 connection 'Webcrawler':
> >      Solr exception during indexing
> >      [19]
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
> >      (500): Server at [20]http://vm97lnx:9474/solr/rerweb5 returned non
> ok
> >      status:500, message:Internal Server Error
> >      >
> >      > ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception
> tossed:
> >      Repeated service interruptions - failure processing document:
> Server at
> >      [21]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated
> >      service interruptions - failure processing document: Server at
> >      [22]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      > Caused by: org.apache.solr.common.SolrException: Server at
> >      [23]http://vm97lnx:9474/solr/rerweb5 returned non ok status:500,
> >      message:Internal Server Error
> >      >
> >      >
> >      >
> >      > SOLR log:
> >      >
> >      >
> >      >
> >      > 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
> >      (http-/10.10.80.97:9474-2)
> null:org.apache.solr.common.SolrException:
> >      org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException
> >      from org.apache.tika.parser.pdf.PDFParser@6533a82a
> >      >
> >      >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> >      >
> >      >         at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >      >
> >      >         at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >      >
> >      >         at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> >      >
> >      >         at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> >      >
> >      >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> >      >
> >      >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> >      >
> >      >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> >      >
> >      >         at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> >      >
> >      >         at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> >      >
> >      >         at
> >
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> >      >
> >      >         at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >      >
> >      >         at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> >      >
> >      >         at
> >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> >      >
> >      >         at
> >
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> >      >
> >      >         at
> >
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> >      >
> >      >         at java.lang.Thread.run(Thread.java:745)
> >      >
> >      > Caused by: org.apache.tika.exception.TikaException: TIKA-198:
> Illegal
> >      IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a
> >      >
> >      >         at
> >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> >      >
> >      >         at
> >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >      >
> >      >         at
> >
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >      >
> >      >         at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> >      >
> >      >         ... 20 more
> >      >
> >      > Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> >      >
> >      >         at
> >      org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
> >      >
> >      >         at
> >      org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
> >      >
> >      >         at
> >      org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
> >      >
> >      >         at
> >      org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
> >      >
> >      >         at
> >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >      >
> >      >         ... 23 more
> >      >
> >      > Caused by: java.lang.StringIndexOutOfBoundsException: String
> index out
> >      of range: 2047
> >      >
> >      >         at
> >
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
> >      >
> >      >         at
> >      java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
> >      >
> >      >         at
> >
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
> >      >
> >      >         at
> >
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> >      >
> >      >         at
> >
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
> >      >
> >      >         at
> >
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
> >      >
> >      >         at
> >      org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
> >      >
> >      >         ... 27 more
> >      >
> >      >
> >      >
> >      > 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
> >      (http-/10.10.80.97:9474-2)
> null:org.apache.solr.common.SolrException:
> >      org.apache.tika.exception.TikaException: Unexpected RuntimeException
> >      from org.apache.tika.parser.rtf.RTFParser@73361285
> >      >
> >      >         at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> >      >
> >      >         at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >      >
> >      >         at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >      >
> >      >         at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> >      >
> >      >         at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> >      >
> >      >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> >      >
> >      >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> >      >
> >      >         at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> >      >
> >      >         at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> >      >
> >      >         at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> >      >
> >      >         at
> >
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> >      >
> >      >         at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >      >
> >      >         at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >      >
> >      >         at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> >      >
> >      >         at
> >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> >      >
> >      >         at
> >
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> >      >
> >      >         at
> >
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> >      >
> >      >         at java.lang.Thread.run(Thread.java:745)
> >      >
> >      > Caused by: org.apache.tika.exception.TikaException: Unexpected
> >      RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285
> >      >
> >      >         at
> >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> >      >
> >      >         at
> >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >      >
> >      >         at
> >
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >      >
> >      >         at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> >      >
> >      >         ... 20 more
> >      >
> >      > Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
> >      >
> >      >         at
> >
> org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
> >      >
> >      >         at
> >
> org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
> >      >
> >      >         at
> >
> org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
> >      >
> >      >         at
> >
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
> >      >
> >      >         at
> >
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
> >      >
> >      >         at
> >      org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
> >      >
> >      >         at
> >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >      >
> >      >         ... 23 more
> >      >
> >      >
> >
> > References
> >
> >    Visible links
> >    1. mailto:LBasso@regione.emilia-romagna.it
> >    2. mailto:shinichiro.abe.1@gmail.com
> >    3. mailto:user@manifoldcf.apache.org
> >    4. mailto:user@manifoldcf.apache.org
> >    5. mailto:daddywri@gmail.com
> >    6. mailto:LBasso@regione.emilia-romagna.it
> >    7. http://vm97lnx:9474/solr/rerweb5
> >    8.
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
> >    9. http://vm97lnx:9474/solr/rerweb5
> >   10. http://vm97lnx:9474/solr/rerweb5
> >   11.
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
> >   12. http://vm97lnx:9474/solr/rerweb5
> >   13. http://vm97lnx:9474/solr/rerweb5
> >   14. http://vm97lnx:9474/solr/rerweb5
> >   15. http://vm97lnx:9474/solr/rerweb5
> >   16.
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
> >   17. http://vm97lnx:9474/solr/rerweb5
> >   18. http://vm97lnx:9474/solr/rerweb5
> >   19.
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
> >   20. http://vm97lnx:9474/solr/rerweb5
> >   21. http://vm97lnx:9474/solr/rerweb5
> >   22. http://vm97lnx:9474/solr/rerweb5
> >   23. http://vm97lnx:9474/solr/rerweb5
>

Mime
View raw message