jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miguel Prieto <jmpr...@gmail.com>
Subject Async Text Extraction
Date Wed, 31 Mar 2010 22:31:31 GMT
I'm using JackRabbit as a repository for pdf documents and I have some
questions regarding Text Extraction. I'm using the Repository locally, not
remotely (rmi, dav). Model 1 as shown in the
http://jackrabbit.apache.org/deployment-models.html

In http://wiki.apache.org/jackrabbit/Search you can read that: "*Text
extraction is done asynchronously in a in a background thread. That means
changed or added text is not available immediately...*". I've also seen the
configuration parameters, but I'll like to know a little bit more about how
and who is responsible for starting this thread. Can I Keep it from running?
(For example when doing a batch upload of documents) , Can I start it? Can
anyone give me a hint about this?.

Also, I've been getting these 2 warnings after uploading some pdfs. How can
I know which documents (binary properties) where causing them?, Is there a
way I can handle these warnings with some sort of listener Class?

*WARN * PDFStreamEngine: java.io.IOException: Error: expected hex character
and not  :32 (PDFStreamEngine.java, line 529)
java.io.IOException: Error: expected hex character and not  :32
    at
org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:316)
    at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:138)
    at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:488)
    at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:363)
    at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
    at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:50)
    at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
    at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
    at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
    at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195)
    at
org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)


*WARN * LazyTextExtractorField: Failed to extract text from a binary
property (LazyTextExtractorField.java, line 165)
java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
    at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108)
    at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235)
    at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
    at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
    at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:195)
    at
org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)


Thanks,

Miguel Prieto

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message