Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 46443 invoked from network); 25 Nov 2009 17:00:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Nov 2009 17:00:54 -0000 Received: (qmail 86990 invoked by uid 500); 25 Nov 2009 17:00:54 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 86926 invoked by uid 500); 25 Nov 2009 17:00:53 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 86918 invoked by uid 99); 25 Nov 2009 17:00:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Nov 2009 17:00:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of monkiki@gmail.com designates 209.85.218.214 as permitted sender) Received: from [209.85.218.214] (HELO mail-bw0-f214.google.com) (209.85.218.214) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Nov 2009 17:00:44 +0000 Received: by bwz6 with SMTP id 6so7326927bwz.11 for ; Wed, 25 Nov 2009 09:00:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=X2wr/noQgqC6Yb8eEqce4XfkLUEkl2dPluGROmjYnQ4=; b=nMTPy2+WKjcJxqF4rcDUTZmBDzGDsu39LsfFlqlK/FOUahikWUhmeVlg6RhtQOScbI rq65qQKxxLX5kK4+1qnVlXvCBv9V/yiatfeTgoXPKP85E3YHEUjK2m5epRm0hENBv63D gseu8PpaibRLbKhaKOh4x8rCjMpAUtK4q90ZY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=k8BKTqp9UvBmkFIUm02ZM8nnqmC9h4uuleD0rAfHLEmqBazfYJEpjUlRSDAxARxmRl Cgf0Sl+7+JefzjGvBhbyF4Xs99qehWZuzyeXP1wSZoVvxfYaj7hdzoI+rSGW+qQP+msg SSNGP/gJPExy9z54yE9aStsi0qbWkF2kcjG4E= MIME-Version: 1.0 Received: by 10.204.48.212 with SMTP id s20mr7776607bkf.101.1259168424206; Wed, 25 Nov 2009 09:00:24 -0800 (PST) In-Reply-To: <510143ac0911250526t4cf0d440t2557850d2badd067@mail.gmail.com> References: <8f70390911241153l56758814pd0023bf5e4dba738@mail.gmail.com> <510143ac0911250526t4cf0d440t2557850d2badd067@mail.gmail.com> From: Paco Avila Date: Wed, 25 Nov 2009 18:00:04 +0100 Message-ID: <8f70390911250900w24daa737m12636cf80e69b177@mail.gmail.com> Subject: Re: detect a failed text extraction? To: dev@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Nov 25, 2009 at 2:26 PM, Jukka Zitting wrote: > Hi, > > On Tue, Nov 24, 2009 at 8:53 PM, Paco Avila wrote: >> There is any way to detect a failed text extraction ? I know, I can >> see the log but the failure it not associated to a file or path. >> [...] >> I have posted this question in the user list, but I think it is >> interesting talking about how it can be achieved. > > Could we solve this by improving the level of logging in the indexer? > > Alternatively, if you don't have easy access to the log files, we > could possibly inject some special unique term to the index as a > marker of failed text extraction. That way you could query for all > nodes for which text extraction failed. Increasing the log level can be a goog approach: the objective is link a failed text extraction with a node path. This way, I can see if the submitted document has failed in the text extraction process. The other approach (injecting a special term) also is very cute because I can get a list of failed indexed document from a XPath query. Both solutions can be combined to improve the jackrabbit experience: the XPath query give a list of unindexed document and the log can hep to know what failed in the text extraction. > Finally, as a debugging tool we could add a feature to the Jackrabbit > webapp that allows you to download the extracted text content of a > binary instead of the binary itself. We'd simply run a new text > extraction pass on the stored binary and return the extracted text or > any encountered errors to he client. This also can be interesting. > > BR, > > Jukka Zitting > -- Paco Avila OpenKM http://www.openkm.com http://www.guia-ubuntu.org