Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 85023 invoked from network); 31 Mar 2010 22:38:28 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 31 Mar 2010 22:38:28 -0000 Received: (qmail 82626 invoked by uid 500); 31 Mar 2010 22:38:28 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 82607 invoked by uid 500); 31 Mar 2010 22:38:28 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 82598 invoked by uid 99); 31 Mar 2010 22:38:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Mar 2010 22:38:27 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of monkiki@gmail.com designates 209.85.218.211 as permitted sender) Received: from [209.85.218.211] (HELO mail-bw0-f211.google.com) (209.85.218.211) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Mar 2010 22:38:22 +0000 Received: by bwz3 with SMTP id 3so466477bwz.11 for ; Wed, 31 Mar 2010 15:38:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Ep4tVF2lsBoOfv3eSCbYUWsu7HdUF+TNuabW2pCqkas=; b=WPPhLhKvc4x40aGWY2LHR8Frh3ndpfnDnBWq+N0xekdtw+oXoX8ZcQtLmX1uJlG++l Wl8Ujxzdu/bPk6kcYU+z1b/RZIweetNqEVTA1JTxDTQg1CvbsNzxATaBCfLsR36dtzDI pn5F8hwLqHTftpzsdPiiwlvrJoCRx4EVquk6Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ccwUI8ThX5BxMgodQ0ebSD2OgnQzRUpzlHkp3wg1tzJOpf33pZoY8gR8cpEGZYA+5G AO2ONXnn+Qz3rCBicSjl9sRuTO804TYJEB66pSQ+hqMo8V/q6EH0Y5oH27vegixQ1LXc NCQ9Dk7+eQ2rdYeczTTUbgy53/07HdR4mapJo= MIME-Version: 1.0 Received: by 10.204.15.13 with HTTP; Wed, 31 Mar 2010 15:38:01 -0700 (PDT) In-Reply-To: References: Date: Thu, 1 Apr 2010 00:38:01 +0200 Received: by 10.204.140.213 with SMTP id j21mr280041bku.110.1270075081129; Wed, 31 Mar 2010 15:38:01 -0700 (PDT) Message-ID: Subject: Re: Async Text Extraction From: Paco Avila To: users@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable AFAIK you can't, but would be a nice improvement. On Thu, Apr 1, 2010 at 12:31 AM, Miguel Prieto wrote: > I'm using JackRabbit as a repository for pdf documents and I have some > questions regarding Text Extraction. I'm using the Repository locally, no= t > remotely (rmi, dav). Model 1 as shown in the > http://jackrabbit.apache.org/deployment-models.html > > In http://wiki.apache.org/jackrabbit/Search you can read that: "*Text > extraction is done asynchronously in a in a background thread. That means > changed or added text is not available immediately...*". I've also seen t= he > configuration parameters, but I'll like to know a little bit more about h= ow > and who is responsible for starting this thread. Can I Keep it from runni= ng? > (For example when doing a batch upload of documents) , Can I start it? Ca= n > anyone give me a hint about this?. > > Also, I've been getting these 2 warnings after uploading some pdfs. How c= an > I know which documents (binary properties) where causing them?, Is there = a > way I can handle these warnings with some sort of listener Class? > > *WARN * PDFStreamEngine: java.io.IOException: Error: expected hex charact= er > and not =A0:32 (PDFStreamEngine.java, line 529) > java.io.IOException: Error: expected hex character and not =A0:32 > =A0 =A0at > org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:316) > =A0 =A0at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:138) > =A0 =A0at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:488= ) > =A0 =A0at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:363) > =A0 =A0at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine= .java:343) > =A0 =A0at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:= 50) > =A0 =A0at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.ja= va:516) > =A0 =A0at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.j= ava:229) > =A0 =A0at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java= :188) > =A0 =A0at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:3= 67) > =A0 =A0at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:= 291) > =A0 =A0at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247= ) > =A0 =A0at > org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180) > =A0 =A0at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) > =A0 =A0at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69) > =A0 =A0at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) > =A0 =A0at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) > =A0 =A0at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(Jackrabbit= Parser.java:189) > =A0 =A0at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(Jackrabbit= Parser.java:195) > =A0 =A0at > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTas= k.run(LazyTextExtractorField.java:160) > > > *WARN * LazyTextExtractorField: Failed to extract text from a binary > property (LazyTextExtractorField.java, line 165) > java.lang.NoClassDefFoundError: > org/bouncycastle/jce/provider/BouncyCastleProvider > =A0 =A0at > org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108) > =A0 =A0at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:57= 3) > =A0 =A0at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235= ) > =A0 =A0at > org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180) > =A0 =A0at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) > =A0 =A0at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69) > =A0 =A0at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) > =A0 =A0at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) > =A0 =A0at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(Jackrabbit= Parser.java:189) > =A0 =A0at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(Jackrabbit= Parser.java:195) > =A0 =A0at > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTas= k.run(LazyTextExtractorField.java:160) > =A0 =A0at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > =A0 =A0at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:3= 03) > =A0 =A0at java.util.concurrent.FutureTask.run(FutureTask.java:138) > =A0 =A0at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acce= ss$301(ScheduledThreadPoolExecutor.java:98) > =A0 =A0at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(= ScheduledThreadPoolExecutor.java:207) > =A0 =A0at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor= .java:886) > =A0 =A0at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav= a:908) > =A0 =A0at java.lang.Thread.run(Thread.java:619) > > > Thanks, > > Miguel Prieto > --=20 OpenKM http://www.openkm.com http://www.guia-ubuntu.org