Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 58D167CB7 for ; Tue, 15 Nov 2011 04:22:42 +0000 (UTC) Received: (qmail 36472 invoked by uid 500); 15 Nov 2011 04:22:41 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 36436 invoked by uid 500); 15 Nov 2011 04:22:37 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 36423 invoked by uid 99); 15 Nov 2011 04:22:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Nov 2011 04:22:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Nov 2011 04:22:28 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1RQAXB-00073T-Gn for lucy-user@incubator.apache.org; Mon, 14 Nov 2011 20:22:09 -0800 Date: Mon, 14 Nov 2011 20:22:09 -0800 From: Marvin Humphrey To: lucy-user@incubator.apache.org Message-ID: <20111115042209.GA27084@rectangular.com> References: <4EC161D0.1060103@aevum.de> <20111114212215.GA26256@rectangular.com> <4EC1C342.7080401@aevum.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4EC1C342.7080401@aevum.de> User-Agent: Mutt/1.5.18 (2008-05-17) Subject: Re: [lucy-user] Custom analyzers On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote: >>> Are there any other caveats? Is there any documentation on how to write >>> your own analyzer classes? >> >> The subclassing API for Analyzer was redacted prior to Lucy 0.1 in >> anticipation of refactoring; Lucy::Analysis::Inversion and >> Lucy::Analysis::Token are not public classes. So what you are trying to do is >> not officially supported. >> >> That said, we know that we need to restore this capability. The more people >> who are hacking on the Lucy core analysis code, the sooner we will be able to >> do so. > > Are there any additional pointers for people who'd like to hack on this? Let's see. One thing to bear in mind is that Analyzer performance from Perl is going to be considerably slower than from C. There's a lot of string copying that has to happen in order to make token data available in Perl-space. That may drive you to port your Analyzer module to C eventually, which would be an effective route to gaining familiarity with the Lucy indexing chain, as all that core code is in C. Aside from that... the current Analyzer codebase is stable and fine -- I just want to find ways to speed it up. I tried to optimize it with some memory allocation tricks a while back (allocating tokens from memory pools), but the results were disappointing -- the changes failed to produce speedups which justified the complexity costs. Lucene uses a completely different Analysis model from Lucy, where Analyzers mutate a single token rather than processing all tokens at once before handing off to the next Analyzer in the chain. It's less intuitive, and I'm not actually sure whether it would be faster, but it's worth a try. That's unlikely to happen until somebody besides me decides they really care about Analyzer speed, though. Lastly, it would be nice to have a cookbook entry on subclassing Analyzer, but which would live on the Lucy wiki for now. I'm probably in the best position to write that up -- I might remember other pointers which would benefit you while doing so. However I'm starting to fall behind on dealing with all these user inquiries and support questions of late... > Thinking more about it, Unicode normalization would also be a nice feature > for the Lucy analyzer. I would think that you would run your text through Unicode normalization prior to indexing it. But I suppose that Analyzers might produce tokens from normalized source text which would not be normalized themselves. > Would it make sense to have all the Unicode functionality in the Lucy > core using a third party Unicode library? Or should we rely on the > Unicode support of the host language like we do for case folding? That hinges on the dependability, portability, licensing terms and ease-of-integration for this theoretical third party Unicode library. Dependencies are cool so long as we can bundle them, they don't take a million years to compile, they don't sabotage all the hard work we've done to make Lucy portable, etc. (For a longer take on dependencies, see .) In the absence of a suitable library, we can continue to fall back on the host language support, which will at least be reliable and portable. Marvin Humphrey