Return-Path: X-Original-To: apmail-lucy-user-archive@www.apache.org Delivered-To: apmail-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D63FE0CC for ; Sun, 30 Dec 2012 11:21:45 +0000 (UTC) Received: (qmail 48683 invoked by uid 500); 30 Dec 2012 11:21:44 -0000 Delivered-To: apmail-lucy-user-archive@lucy.apache.org Received: (qmail 48487 invoked by uid 500); 30 Dec 2012 11:21:39 -0000 Mailing-List: contact user-help@lucy.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@lucy.apache.org Delivered-To: mailing list user@lucy.apache.org Received: (qmail 48444 invoked by uid 99); 30 Dec 2012 11:21:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Dec 2012 11:21:38 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [173.201.193.111] (HELO p3plsmtpa08-10.prod.phx3.secureserver.net) (173.201.193.111) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Dec 2012 11:21:29 +0000 Received: from Aleksandars-MacBook-Pro.local ([109.171.130.211]) by p3plsmtpa08-10.prod.phx3.secureserver.net with id hnM31k0014ZoW2M01nM5zP; Sun, 30 Dec 2012 04:21:05 -0700 Message-ID: <50E023BE.6030909@Radovanovic.com> Date: Sun, 30 Dec 2012 14:21:34 +0300 From: Aleksandar Radovanovic User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: user@lucy.apache.org References: <50DF0A99.1060606@Radovanovic.com> In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: multipart/alternative; boundary="------------080708080902080306080204" X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-user] New feature suggestion --------------080708080902080306080204 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 12/30/12 4:22 AM, Marvin Humphrey wrote: > On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic > wrote: >> I was wondering, would it be possible to add a new feature to the >> indexing engine (or somehow simulate it) that will do EXACTLY opposite >> of Lucy::Analysis::SnowballStopFilter? In other words, instead of >> blocking a list of stopwords, indexing engine will index ONLY phrases >> supplied in the user list to the exact match. Or even better, prioritize >> them for indexing: index the user list first and then use Lucy analyzer >> for words that are not in the list. >> >> Why this can be useful? In chemistry for example, it is simply >> impossible to create a rule that will index chemical names correctly ( >> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of >> thousands). Also, in a biomedical text some seemingly common words can >> for example, represent a gene or protein name which should not be >> stemmed. To summarize, this feature will allow one to create a correct >> index(es) of specialized terms. > I think you could achieve this now by extracting the list of terms yourself > prior to indexing and using a custom RegexTokenizer. > > my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+'); > my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer); > $schema->spec_field(name => 'chemicals', type => $type); > > ... > > my @chemical_names = extract_chem_names($content); > my $chem_content = join(' ', @chemical_names); > $indexer->add_doc({ > content => $content, > chemicals => $chem_content, > ... > }); > > If the chemical names may contain whitespace, I'd suggest using "\x1F", the > ASCII "unit separator", as a delimiter. > > my $tokenizer = Lucy::Analysis::RegexTokenizer->new( > pattern => '[^\\x1F]+' > ); > > ... > > my $chem_content = join("\x1F", @chemical_names); > > At search-time, you'd need to duplicate the transform and feed the content to > an extra QueryParser. > > my $main_parser = Lucy::Search::QueryParser->new( > schema => $searcher->get_schema, > ); > my $chem_parser = Lucy::Search::QueryParser->new( > schema => $searcher->get_schema, > fields => ['chemicals'], > ); > my $main_query = $main_parser->parse($query_string); > my $chem_query = $chem_parser->parse(extract_chem_names($query_string)); > my $or_query = Lucy::Search::ORQuery->new( > children => [$main_query, $chem_query], > ); > my $hits = $searcher->hits(query => $or_query); > ... > > The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you > some ideas as well. > > Cheers, > > Marvin Humphrey > > Thank you Marvin, I tried what you have suggested! It works fine, but my main problem still remains: how to find and index *predefined* phrases. In your example this boils down to the implementation of /extract_chem_names($content). / I was hoping to use some Lucy functionality for this - indexing the whole text, searching the index for predefined phrases and index them separately. But this does not work correctly for biomedical documents in which text often looks like random sequence of weird characters, and strange, no-language words which Lucy simply skips, or stems incorrectly. So, the core of my idea is to have something opposite to stopwords. A list of phrases which will be indexed without stemmer - exactly as they appear in the user supplied list. I was wondering why such a simple and obvious feature was not implemented - or am I missing something? Alex --------------080708080902080306080204--