In my tika server, I added:

-spawnChild -taskTimeoutMillis 1000000

To bypass the timeout problem

 

Mario

 

 

Da: Furkan KAMACI <furkankamaci@gmail.com>
Inviato: martedì 4 dicembre 2018 10:16
A: user@manifoldcf.apache.org; Rafa Haro <rharo@apache.org>
Oggetto: Re: External Tika Server

 

Hi Rafa,

 

I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :)

 

Kind Regards,

Furkan KAMACI

 

On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rharo@apache.org> wrote:

Hi Furkan, 

 

You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side

 

Rafa

 

On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <furkankamaci@gmail.com> wrote:

Hi,

 

I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them):

 

INFO  meta (application/msword)

WARN  meta: Text extraction failed

org.apache.tika.exception.TikaException: Unable to extract PDF content

at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)

at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)

at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)

at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)

at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)

at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)

at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)

at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)

at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)

at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)

at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)

at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)

at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)

at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)

at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)

at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)

at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at org.eclipse.jetty.server.Server.handle(Server.java:531)

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)

at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)

at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)

at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)

at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)

at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)

at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)

at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)

at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)

at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)

at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)

at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page

at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)

at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)

at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)

at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)

at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)

at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)

at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)

... 44 more

Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout

at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)

at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)

at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)

at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)

at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)

at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)

... 50 more

Caused by: java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask.get(FutureTask.java:205)

at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)

... 55 more

 

How can I solve it?

 

Kind Regards,

Furkan KAMACI