manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: External Tika Server
Date Wed, 05 Dec 2018 13:28:15 GMT
I use 1.19.1

On Wed, Dec 5, 2018 at 4:14 PM Bisonti Mario <Mario.Bisonti@vimar.com>
wrote:

> Hallo.
>
> Which is your tika server version?
>
>
>
> You could try to download last build version from here, to check if it
> works.
>
>
>
> https://builds.apache.org/job/Tika-trunk/lastStableBuild/
>
>
>
>
>
> *Da:* Furkan KAMACI <furkankamaci@gmail.com>
> *Inviato:* mercoledì 5 dicembre 2018 13:37
> *A:* user@manifoldcf.apache.org
> *Cc:* Rafa Haro <rharo@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Mario,
>
>
>
> Thanks for the answer. I still get an error message at a pdf at which
> parsing via HTTP works but via ManifoldCF doesn't. I get that error:
>
>
>
> WARN  meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@7e76e3f5
>
>                at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>
>                at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
>                at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
>                at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
>                at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
>                at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
>                at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
>                at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown
> Source)
>
>                at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>                at java.lang.reflect.Method.invoke(Method.java:498)
>
>                at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
>                at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
>                at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
>                at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
>                at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
>                at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
>                at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
>                at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
>                at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
>                at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
>                at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
>                at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
>                at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
>                at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
>                at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
>                at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
>                at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
>                at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
>                at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
>                at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
>                at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
>                at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
>                at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
>                at
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
>                at
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
>                at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
>                at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
>                at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
>                at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.awt.image.RasterFormatException: (y + height) is outside
> raster
>
>                at
> sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
>
>                at
> sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
>
>                at
> sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
>
>                at
> sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
>
>                at
> sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
>
>                at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
>
>                at
> sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
>
>                at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
>
>                at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
>
>                at
> org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
>
>                at
> org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
>
>                at
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
>
>                at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
>
>                at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
>
>                at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
>                at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
>                at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
>                at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
>                at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
>                at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
>                ... 42 more
>
> INFO  tika (application/pdf)
>
> WARN  No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10
>
>
>
> On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>
>
> In my tika server, I added:
>
> -spawnChild -taskTimeoutMillis 1000000
>
> To bypass the timeout problem
>
>
>
> Mario
>
>
>
>
>
> *Da:* Furkan KAMACI <furkankamaci@gmail.com>
> *Inviato:* martedì 4 dicembre 2018 10:16
> *A:* user@manifoldcf.apache.org; Rafa Haro <rharo@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Rafa,
>
>
>
> I can parse same document via HTTP URL of Tika Server. I thought that
> there maybe a timeout parameter within ManifoldCF while communicating with
> Tika Server :)
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
>
> On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rharo@apache.org> wrote:
>
> Hi Furkan,
>
>
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
>
>
> Rafa
>
>
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
>
>
> INFO  meta (application/msword)
>
> WARN  meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
> ... 44 more
>
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
> ... 50 more
>
> Caused by: java.util.concurrent.TimeoutException
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>
> ... 55 more
>
>
>
> How can I solve it?
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>

Mime
View raw message