lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Wed, 22 Jul 2015 09:04:54 GMT
As a start Diego, how do you currently parse the user query to build the
Lucene queries ?

Cheers

2015-07-22 8:35 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:

> Hi Alessandro,
>
> yes, i want the user to be able to surround the query with "" to run the
> phrase query with a NOT tokenized phrase.
>
> What do i have to do?
>
> Thanks and Kind regards
>
> On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > Hey Jack, reading the doc :
> >
> > " Set to true if phrase queries will be automatically generated when the
> > analyzer returns more than one term from whitespace delimited text. NOTE:
> > this behavior may not be suitable for all languages.
> >
> > Set to false if phrase queries should only be generated when surrounded
> by
> > double quotes."
> >
> >
> > In the user case , i guess he's likely to use double quotes.
> >
> > The only problem he sees so far is that the phrase query uses the query
> > time analyser to actually split the tokens.
> >
> > First we need a feedback from him, but I guess he would like to have the
> > phrase query, to not tokenise the text within the double quotes.
> >
> > In the case we should find a way.
> >
> >
> > Cheers
> >
> > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <jack.krupansky@gmail.com>:
> >
> > > If you don't explicitly enable automatic phrase queries, the Lucene
> query
> > > parser will assume an OR operator on the sub-terms when a white
> > > space-delimited term analyzes into a sequence of terms.
> > >
> > > See:
> > >
> > >
> >
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <socaceti@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > i'm new to lucene and tried to write my own analyzer to support
> > > > hyphenated words like wi-fi, jean-pierre, etc.
> > > > For our customer it is important to find the word
> > > > - wi-fi by wi, fi, wifi, wi-fi
> > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> > > >
> > > >
> > > >
> > > >
> > > > The analyzer:
> > > > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
> > > >
> > > >   protected NormalizeCharMap charConvertMap;
> > > >
> > > >   public MinLuceneAnalyzer() {
> > > >     initCharConvertMap();
> > > >   }
> > > >
> > > >   protected void initCharConvertMap() {
> > > >     NormalizeCharMap.Builder builder = new
> NormalizeCharMap.Builder();
> > > >     builder.add("\"", "");
> > > >     charConvertMap = builder.build();
> > > >   }
> > > >
> > > >   @Override
> > > >   protected TokenStreamComponents createComponents(final String
> > > fieldName)
> > > > {
> > > >
> > > >     final Tokenizer src = new WhitespaceTokenizer();
> > > >
> > > >     TokenStream tok = new WordDelimiterFilter(src,
> > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > > >             | WordDelimiterFilter.CATENATE_WORDS,
> > > >         null);
> > > >     tok = new LowerCaseFilter(tok);
> > > >     tok = new LengthFilter(tok, 1, 255);
> > > >     tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> > > >
> > > >     return new TokenStreamComponents(src, tok);
> > > >   }
> > > >
> > > >   @Override
> > > >   protected Reader initReader(String fieldName, Reader reader) {
> > > >     return new MappingCharFilter(charConvertMap, reader);
> > > >   }
> > > > }
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > The analyzer seems to work except for exact phrase match queries.
> > > >
> > > > e.g. the following words are indexed
> > > >
> > > > FD-A320-REC-SIM-1
> > > > FD-A320-REC-SIM-10
> > > > FD-A320-REC-SIM-11
> > > > MIA-FD-A320-REC-SIM-1
> > > > SIN-FD-A320-REC-SIM-1
> > > >
> > > >
> > > > The (exact) query "FD-A320-REC-SIM-1" returns
> > > > FD-A320-REC-SIM-1
> > > > MIA-FD-A320-REC-SIM-1
> > > > SIN-FD-A320-REC-SIM-1
> > > >
> > > > for our customer this is wrong because this exact phrase match
> > > > query should only return the single entry FD-A320-REC-SIM-1
> > > >
> > > > Do you have any ideas or tips, how we have to change our current
> > > > analyzer to support this requirement???
> > > >
> > > >
> > > > Thanks and Kind regards
> > > > Diego
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message