manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafa Haro <rh...@apache.org>
Subject Re: External Tika Server
Date Tue, 04 Dec 2018 09:12:09 GMT
Hi Furkan,

You seem to be getting a Timeout from Tesseract. This might be happening
with large documents (too many pages). Maybe there is some configuration
parameter for increasing timeouts that you can use at Tika side

Rafa

On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <furkankamaci@gmail.com> wrote:

> Hi,
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
> INFO  meta (application/msword)
> WARN  meta: Text extraction failed
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> ... 44 more
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
> ... 50 more
> Caused by: java.util.concurrent.TimeoutException
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
> ... 55 more
>
> How can I solve it?
>
> Kind Regards,
> Furkan KAMACI
>

Mime
View raw message