Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBCAB17D33 for ; Tue, 7 Oct 2014 01:21:11 +0000 (UTC) Received: (qmail 7765 invoked by uid 500); 7 Oct 2014 01:21:11 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 7715 invoked by uid 500); 7 Oct 2014 01:21:11 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 7705 invoked by uid 99); 7 Oct 2014 01:21:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 01:21:11 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shinichiro.abe.1@gmail.com designates 209.85.220.44 as permitted sender) Received: from [209.85.220.44] (HELO mail-pa0-f44.google.com) (209.85.220.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Oct 2014 01:20:44 +0000 Received: by mail-pa0-f44.google.com with SMTP id et14so6227629pad.31 for ; Mon, 06 Oct 2014 18:20:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=FIwrfEjZjcgUrTlhIVzkNZ1M4P2j6XJV5Ydi7+YOym4=; b=FFQS7Y1pZwNJwdQ72YZhyWI/1M18BM07bWuAQWFp6epxDZMDcH1kgv+jjeBXOr5lKZ aIPCmbOptZRI9ger7sm2iZBn3Xu7OtDEIvaYiYBNDs+4ht9aCIecM9YCBn7A8Uqg5UmN oP/XR8NVSxOB4F7QMnOqKUei3MemHPSLnMVslq4nT9OG+v/0cb/MTk4LC2Y0Zrga8TL3 rNrLU/EECRR858iClCgWcIFGoeuiLE4wT0gzpqayN0sXypY5MS8RUYWlKEI0UW+x74Kk Al5xHM/4+k7JMDEUnL6WbnxM5SUYfEANRGe2KZOVlLBXv94Ni531RQmzZjYDR1W8bz2R 0EIg== X-Received: by 10.70.101.6 with SMTP id fc6mr743591pdb.68.1412644842313; Mon, 06 Oct 2014 18:20:42 -0700 (PDT) Received: from [192.168.1.9] (y073164.dynamic.ppp.asahi-net.or.jp. [118.243.73.164]) by mx.google.com with ESMTPSA id rg6sm6909952pdb.20.2014.10.06.18.20.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 06 Oct 2014 18:20:41 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Internal server error (500) causing a crawl interruption From: Shinichiro Abe In-Reply-To: Date: Tue, 7 Oct 2014 10:20:38 +0900 Cc: "user@manifoldcf.apache.org" Content-Transfer-Encoding: quoted-printable Message-Id: <0C971854-7EA6-437A-98E3-FDE34DAC791F@gmail.com> References: <6F8A27C574183A4283989FC1977E8B7A19E1CFCB@BL331SRV.ente.regione.emr.it> To: user@manifoldcf.apache.org X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org Hi Luca, Please try to configure ignoreTikaException=3Dtrue. text true true ignored_ true Regards, Shinichiro Abe On 2014/10/06, at 20:15, Karl Wright wrote: > Hi Luca, >=20 > There is a solr setting which configures Solr Cell to ignore tika = errors. I don't remember what it is offhand, but you will want to set = it properly to disable tika errors. >=20 > Thanks, > Karl >=20 >=20 > On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca = wrote: > Hi Karl, >=20 > we=E2=80=99re using the Web repository connector in conjunction with = the Solr output connector to crawl a number of web portals (MCF vers. = 1.6.1). Unfortunately the crawl job often stops giving the following = error: >=20 > =E2=80=9CRepeated service interruptions =E2=80=93 failure processing = documents: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok = status: 500, message: Internal Server Error=E2=80=9D. >=20 > =46rom the MCF and SOLR logs (which we report hereafter) it seems that = the problem is arising from Tika and apply to various types of documents = (.rtf, .pdf, etc.). >=20 > How can we fix it? >=20 > Thank you. >=20 > =20 >=20 > Best regards, >=20 > Luca >=20 > =20 >=20 > MCF log: >=20 > =20 >=20 > WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception = during indexing = http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-re= gionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/= at_download/file/BolognaArchivioTerritoriale.rtf (500): Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > org.apache.solr.common.SolrException: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service = interruption reported for job 1412340881687 connection 'Webcrawler': = Solr exception during indexing = http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-re= gionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/= at_download/file/BolognaArchivioTerritoriale.rtf (500): Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed: = Repeated service interruptions - failure processing document: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated = service interruptions - failure processing document: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > Caused by: org.apache.solr.common.SolrException: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > =20 >=20 > WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception = during indexing = http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazi= one-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf = (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok = status:500, message:Internal Server Error >=20 > org.apache.solr.common.SolrException: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service = interruption reported for job 1412252016695 connection 'Webcrawler': = Solr exception during indexing = http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazi= one-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf = (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok = status:500, message:Internal Server Error >=20 > ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed: = Repeated service interruptions - failure processing document: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated = service interruptions - failure processing document: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > Caused by: org.apache.solr.common.SolrException: Server at = http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, = message:Internal Server Error >=20 > =20 >=20 > SOLR log: >=20 > =20 >=20 > 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] = (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException: = org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException = from org.apache.tika.parser.pdf.PDFParser@6533a82a >=20 > at = org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extractin= gDocumentLoader.java:225) >=20 > at = org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content= StreamHandlerBase.java:74) >=20 > at = org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas= e.java:135) >=20 > at = org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReque= st(RequestHandlers.java:241) >=20 > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) >=20 > at = org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java= :768) >=20 > at = org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav= a:415) >=20 > at = org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav= a:205) >=20 > at = org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicati= onFilterChain.java:280) >=20 > at = org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilter= Chain.java:248) >=20 > at = org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.= java:275) >=20 > at = org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.= java:161) >=20 > at = org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityC= ontextAssociationValve.java:165) >=20 > at = org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:1= 55) >=20 > at = org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:1= 02) >=20 > at = org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.ja= va:109) >=20 > at = org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372= ) >=20 > at = org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)= >=20 > at = org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Ht= tp11Protocol.java:679) >=20 > at = org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) >=20 > at java.lang.Thread.run(Thread.java:745) >=20 > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal = IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a >=20 > at = org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) >=20 > at = org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >=20 > at = org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) >=20 > at = org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extractin= gDocumentLoader.java:219) >=20 > ... 20 more >=20 > Caused by: org.apache.pdfbox.exceptions.WrappedIOException >=20 > at = org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244) >=20 > at = org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206) >=20 > at = org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171) >=20 > at = org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124) >=20 > at = org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >=20 > ... 23 more >=20 > Caused by: java.lang.StringIndexOutOfBoundsException: String index out = of range: 2047 >=20 > at = java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:76= 2) >=20 > at = java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258) >=20 > at = org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1= 000) >=20 > at = org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)= >=20 > at = org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241= ) >=20 > at = org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558) >=20 > at = org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188) >=20 > ... 27 more >=20 > =20 >=20 > 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] = (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException: = org.apache.tika.exception.TikaException: Unexpected RuntimeException = from org.apache.tika.parser.rtf.RTFParser@73361285 >=20 > at = org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extractin= gDocumentLoader.java:225) >=20 > at = org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content= StreamHandlerBase.java:74) >=20 > at = org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas= e.java:135) >=20 > at = org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReque= st(RequestHandlers.java:241) >=20 > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) >=20 > at = org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java= :768) >=20 > at = org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav= a:415) >=20 > at = org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav= a:205) >=20 > at = org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applicati= onFilterChain.java:280) >=20 > at = org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilter= Chain.java:248) >=20 > at = org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.= java:275) >=20 > at = org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.= java:161) >=20 > at = org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityC= ontextAssociationValve.java:165) >=20 > at = org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:1= 55) >=20 > at = org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:1= 02) >=20 > at = org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.ja= va:109) >=20 > at = org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372= ) >=20 > at = org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)= >=20 > at = org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Ht= tp11Protocol.java:679) >=20 > at = org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) >=20 > at java.lang.Thread.run(Thread.java:745) >=20 > Caused by: org.apache.tika.exception.TikaException: Unexpected = RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285 >=20 > at = org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) >=20 > at = org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >=20 > at = org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) >=20 > at = org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extractin= gDocumentLoader.java:219) >=20 > ... 20 more >=20 > Caused by: java.lang.ArrayIndexOutOfBoundsException: 9 >=20 > at = org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.= java:872) >=20 > at = org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.ja= va:566) >=20 > at = org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.j= ava:492) >=20 > at = org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459) >=20 > at = org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448) >=20 > at = org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56) >=20 > at = org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >=20 > ... 23 more >=20 >=20