Return-Path: X-Original-To: apmail-lucy-user-archive@www.apache.org Delivered-To: apmail-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 25C3BE2CF for ; Thu, 17 Jan 2013 13:49:04 +0000 (UTC) Received: (qmail 91612 invoked by uid 500); 17 Jan 2013 13:49:04 -0000 Delivered-To: apmail-lucy-user-archive@lucy.apache.org Received: (qmail 91406 invoked by uid 500); 17 Jan 2013 13:48:59 -0000 Mailing-List: contact user-help@lucy.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@lucy.apache.org Delivered-To: mailing list user@lucy.apache.org Received: (qmail 91354 invoked by uid 99); 17 Jan 2013 13:48:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 13:48:57 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.227.126.186] (HELO moutng.kundenserver.de) (212.227.126.186) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 13:48:48 +0000 Received: from [192.168.1.39] (dslb-088-065-041-091.pools.arcor-ip.net [88.65.41.91]) by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis) id 0M0MsD-1T4QrO1K29-00uykw; Thu, 17 Jan 2013 14:48:27 +0100 Message-ID: <50F80126.30803@aevum.de> Date: Thu, 17 Jan 2013 14:48:22 +0100 From: Nick Wellnhofer User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: user@lucy.apache.org CC: Nikola Tulechki References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:LoXIldA+wSfl/eqOSWfknT6tIweMFdJ5HNXoJJrIyWG CEPJxhuWVDrBzLRvreChX6Hi3RbWekGOr0NI3yQnQDoKs/Fwa6 9y2OGY4ypJc1cJ3iJC+ZeXRfVoFMa+Qk/FJDCgWFwSbkMhv6yJ 0459JAX7BPibw6v1G8rQ9720UMeJntZ9fUv5mTPOMjZf0+uf9x PM39ZFw0eh7EBQw0+JHXQKJ4ATt/awMXQqBadrqjboPLnIapXJ 5MpfL9TnBi8TrPTZoVSJG6DNXHXkajdPiVAOLrDR3WUshj1UtV 8uFAYwjto9u0cFsYrZeVLdoO63/EAP36OPZtcDVPNKBx7bCNw= = X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-user] Snowball stemmer stoplists On 17/01/2013 10:21, Nikola Tulechki wrote: > Hello > > I am using lucy on a technical documentation and I have a bunch of > acronyms that must not be stemmed. > Is there a way to add a stoplist to the stemmer so it skips some terms? It can be done, but it's not trivial and probably not very performant. First, you have to write your own Analyzer class in Perl. See the following threads for some guidance: http://mail-archives.apache.org/mod_mbox/lucy-user/201111.mbox/%3C4EC161D0.1060103@aevum.de%3E http://mail-archives.apache.org/mod_mbox/lucy-user/201207.mbox/%3C4FF1A2B7.7060207@easyconnect.no%3E We really need a cookbook entry describing how to write custom analyzers. But to get started, here is some minimal skeleton code that I have used in the past: package My::Custom::Analyzer; use strict; use base qw(Lucy::Analysis::Analyzer); sub new { my ($class, %args) = @_; my $self = $class->SUPER::new(%args); # Setup your analyzer here return $self; } sub transform { my ($self, $inversion) = @_; while (my $token = $inversion->next) { my $text = $token->get_text; # Transform $text here $token->set_text($text); } $inversion->reset; return $inversion; } sub equals { return 1; } 1; For a proper implementation, you should also provide "dump" and "load" methods and a real "equals" method but they're not really necessary for a one-off job. Only remember to always reindex after changing the parameters of your custom analyzer. Without "dump", "load" and "equals" you won't get an error message in this case. Your custom analyzer should then stem the words that are not in your stoplist one by one using the (undocumented) "split" method. So the "transform text" part of your analyzer would look like: if (!$stoplist->{$text}) { my $tokens = $stemmer->split($text); $text = $tokens->[0]; } Also note that you have to store member variables like "stoplist" and "stemmer" of your analyzer class using the "inside-out" approach (one global hash per variable). You'll find some example code showing how to do that in the threads I mentioned above. Nick