lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-user] Cannot get the exact phrase match
Date Thu, 27 Dec 2012 12:40:39 GMT
On Dec 26, 2012, at 11:00 , Aleksandar Radovanovic <Aleksandar@Radovanovic.com> wrote:

> However, if I, for example, search for chemistry related phrase: OF(+)
> search returns no result. On the other hand, the quoted phrase: "OF(+)" 
> returns  every single document containing the preposition "of".  The
> highlighter clearly shows that "OF(+)" was still not not found as the
> "(+)"  part was not highlighted.
> 
> Is there an easy solution, or must I analyze the user's input and decide
> what to use: IndexSearcher for non quoted queries and
> TermQuery/PhraseQuery for quoted, or must I create some special regex
> rules for words containing non-letters? There are many of these in
> biomedical field.

You can use the RegexTokenizer to define how your documents are split into tokens:

http://lucy.apache.org/docs/perl/Lucy/Analysis/RegexTokenizer.html

To handle the use case described above, you could for example add parens and the plus sign
to the list of word characters. So your pattern would look something like '[\w()+]+'. But
this would match parens everywhere which is probably not what you want. Another approach is
to split on parens and create tokens for sequences of plus signs resulting in a pattern like
'\w+|\++'.

Nick


Mime
View raw message