jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paco Avila <monk...@gmail.com>
Subject Re: Async Text Extraction
Date Wed, 31 Mar 2010 22:38:01 GMT
AFAIK you can't, but would be a nice improvement.

On Thu, Apr 1, 2010 at 12:31 AM, Miguel Prieto <jmpr.py@gmail.com> wrote:
> I'm using JackRabbit as a repository for pdf documents and I have some
> questions regarding Text Extraction. I'm using the Repository locally, not
> remotely (rmi, dav). Model 1 as shown in the
> http://jackrabbit.apache.org/deployment-models.html
>
> In http://wiki.apache.org/jackrabbit/Search you can read that: "*Text
> extraction is done asynchronously in a in a background thread. That means
> changed or added text is not available immediately...*". I've also seen the
> configuration parameters, but I'll like to know a little bit more about how
> and who is responsible for starting this thread. Can I Keep it from running?
> (For example when doing a batch upload of documents) , Can I start it? Can
> anyone give me a hint about this?.
>
> Also, I've been getting these 2 warnings after uploading some pdfs. How can
> I know which documents (binary properties) where causing them?, Is there a
> way I can handle these warnings with some sort of listener Class?
>
> *WARN * PDFStreamEngine: java.io.IOException: Error: expected hex character
> and not  :32 (PDFStreamEngine.java, line 529)
> java.io.IOException: Error: expected hex character and not  :32
>    at
> org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:316)
>    at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:138)
>    at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:488)
>    at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:363)
>    at
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
>    at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:50)
>    at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
>    at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
>    at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
>    at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
>    at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
>    at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
>    at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
>    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>    at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>    at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195)
>    at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>
>
> *WARN * LazyTextExtractorField: Failed to extract text from a binary
> property (LazyTextExtractorField.java, line 165)
> java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider
>    at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108)
>    at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
>    at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235)
>    at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
>    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
>    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
>    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>    at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
>    at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195)
>    at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>    at java.lang.Thread.run(Thread.java:619)
>
>
> Thanks,
>
> Miguel Prieto
>



-- 
OpenKM
http://www.openkm.com
http://www.guia-ubuntu.org

Mime
View raw message