Mailing-List: contact user-help@lucy.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@lucy.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <50E023BE.6030909@Radovanovic.com>
Date: Sun, 30 Dec 2012 14:21:34 +0300
From: Aleksandar Radovanovic <Aleksandar@Radovanovic.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: user@lucy.apache.org
References: <50DF0A99.1060606@Radovanovic.com>
 <CAAS6=7i_UhSVCaqPwSjaRCzGmSj=JKH4D4U5HBcLJsGgZ12GKw@mail.gmail.com>
In-Reply-To: 
 <CAAS6=7i_UhSVCaqPwSjaRCzGmSj=JKH4D4U5HBcLJsGgZ12GKw@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------080708080902080306080204"
Subject: Re: [lucy-user] New feature suggestion

--------------080708080902080306080204
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

On 12/30/12 4:22 AM, Marvin Humphrey wrote:
> On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic
> <Aleksandar@radovanovic.com> wrote:
>> I was wondering, would it be possible to add a new feature  to the
>> indexing engine (or somehow simulate it) that will do EXACTLY opposite
>> of Lucy::Analysis::SnowballStopFilter? In other words, instead of
>> blocking a list of stopwords, indexing engine will index ONLY phrases
>> supplied in the user list to the exact match. Or even better, prioritize
>> them for indexing: index the user list first and then use Lucy analyzer
>> for words that are not in the list.
>>
>> Why this can be useful? In chemistry for example, it is simply
>> impossible to create a rule that will index chemical names correctly (
>> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of
>> thousands). Also, in a biomedical text some seemingly common words can
>> for example, represent a gene or protein name which should not be
>> stemmed.  To summarize, this feature will allow one to create a correct
>> index(es) of specialized terms.
> I think you could achieve this now by extracting the list of terms yourself
> prior to indexing and using a custom RegexTokenizer.
>
>     my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+');
>     my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer);
>     $schema->spec_field(name => 'chemicals', type => $type);
>
>     ...
>
>     my @chemical_names = extract_chem_names($content);
>     my $chem_content = join(' ', @chemical_names);
>     $indexer->add_doc({
>         content   => $content,
>         chemicals => $chem_content,
>         ...
>     });
>
> If the chemical names may contain whitespace, I'd suggest using "\x1F", the
> ASCII "unit separator", as a delimiter.
>
>     my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
>         pattern => '[^\\x1F]+'
>     );
>
>     ...
>
>     my $chem_content = join("\x1F", @chemical_names);
>
> At search-time, you'd need to duplicate the transform and feed the content to
> an extra QueryParser.
>
>     my $main_parser = Lucy::Search::QueryParser->new(
>         schema => $searcher->get_schema,
>     );
>     my $chem_parser = Lucy::Search::QueryParser->new(
>         schema => $searcher->get_schema,
>         fields => ['chemicals'],
>     );
>     my $main_query = $main_parser->parse($query_string);
>     my $chem_query = $chem_parser->parse(extract_chem_names($query_string));
>     my $or_query = Lucy::Search::ORQuery->new(
>         children => [$main_query, $chem_query],
>     );
>     my $hits = $searcher->hits(query => $or_query);
>     ...
>
> The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you
> some ideas as well.
>
> Cheers,
>
> Marvin Humphrey
>
>
Thank you Marvin, I tried what you have suggested! It works fine, but my
main problem still remains: how to find and index *predefined* phrases.
In your example this boils down to the implementation of 
/extract_chem_names($content). /

I was hoping to use some Lucy functionality for this - indexing the
whole text, searching the index for predefined phrases and index them
separately. But this does not work correctly for biomedical documents in
which text often looks like random sequence of weird characters, and
strange, no-language words which Lucy simply skips, or stems incorrectly.

So, the core of my idea is to have something opposite to stopwords. A
list of phrases which will be indexed without stemmer - exactly as they
appear in the user supplied list. I was wondering why such a simple and
obvious feature was not implemented - or am I missing something?

Alex

--------------080708080902080306080204--