From users-return-13457-apmail-jackrabbit-users-archive=jackrabbit.apache.org@jackrabbit.apache.org Tue Nov 24 18:57:14 2009 Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 70302 invoked from network); 24 Nov 2009 18:57:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Nov 2009 18:57:14 -0000 Received: (qmail 7148 invoked by uid 500); 24 Nov 2009 18:57:13 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 7119 invoked by uid 500); 24 Nov 2009 18:57:13 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 7108 invoked by uid 99); 24 Nov 2009 18:57:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2009 18:57:13 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sebastienlaunay@gmail.com designates 209.85.218.214 as permitted sender) Received: from [209.85.218.214] (HELO mail-bw0-f214.google.com) (209.85.218.214) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2009 18:57:11 +0000 Received: by bwz6 with SMTP id 6so6498425bwz.11 for ; Tue, 24 Nov 2009 10:56:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=mitZeenZEbMkg2oQUNPlNM0kRunl6vHEQdd0WO6dxvc=; b=JmmAV8sQGZV6HTaW4F5Ns2xFEPPF5eBKXzzccG9u4egzURUN2ueAlerLwZbMGtjHip e9wxEVTIeHeH/J2tnIb7l3rxnqUaKuQmk+xk/JvLqqHD7scI3iKCKYXGfwRGmk+EvqGu 3QZ+56eKZaFXPFSf1nuCCkpjYyLeRa0VoRiMs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=OaOsAkYyDUExOuSchN6n0JZc6VJ/B/RSlaqjmwVb3Jv2gBh9dP4ACU/WhixjT5+xlK JLVfVp0QhKEVBjxGAQC0AM7TxXopQw4pLl8ACfv5SfhE6NZ86o7O1FgV2KchrZPjhSlv PhK6x1QcJQ+vXEIIkrRdsgIm+4//9cvTg509k= MIME-Version: 1.0 Received: by 10.204.10.2 with SMTP id n2mr75874bkn.91.1259089009682; Tue, 24 Nov 2009 10:56:49 -0800 (PST) In-Reply-To: <8f70390911241012i17720422y74e1fc14c376fbaa@mail.gmail.com> References: <8f70390911240837r6e5adbc8vea7dae30b897cac5@mail.gmail.com> <510143ac0911240850n7f46b2c6yf7364f0f5968751e@mail.gmail.com> <8f70390911241012i17720422y74e1fc14c376fbaa@mail.gmail.com> Date: Tue, 24 Nov 2009 19:56:49 +0100 Message-ID: <4d6717570911241056n7715b3cep7ab0da25a608da33@mail.gmail.com> Subject: Re: How can I access to the TextExtractor result? From: =?ISO-8859-1?Q?S=E9bastien_Launay?= To: users@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I you to get their hands dirty 2009/11/24 Paco Avila : > Thanks, this is the expected answer :( > > Anyway, there is any way to detect a failed text extraction ? I know, > I can see the log but the failure it not associated to a file or path. > > Some times when I upload a document (word, pdf, etc.) to my DMS build > on Jackrabbit, it is not indexed. Office documents seems to be > specially problematic due to its propietary format. And the problem is > that I don't know which document had problems it their text > extraction, specially if use extractorPoolSize > 1. > > Perhaps this question should be send to the development list? I thinks > this can be a very useful improvement to Jackrabbit. > > On Tue, Nov 24, 2009 at 5:50 PM, Jukka Zitting = wrote: >> Hi, >> >> On Tue, Nov 24, 2009 at 5:37 PM, Paco Avila wrote: >>> I wonder if I can access the text produced by the TextExtractor from a >>> document file (like a PDF, for example) >> >> Jackrabbit doesn't store the extracted text anywhere, it is just used >> to add the document to the inverted Lucene index. >> >> You can always use the text extractor directly to get the text >> content. Check out http://lucene.apache.org/tika/ for more details >> about the Tika toolkit that we nowadays use for text extraction. >> >> BR, >> >> Jukka Zitting >> > > > > -- > Paco Avila > OpenKM > http://www.openkm.com > http://www.guia-ubuntu.org > --=20 S=E9bastien Launay