lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Socaceti <socac...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Wed, 22 Jul 2015 07:35:29 GMT
Hi Alessandro,

yes, i want the user to be able to surround the query with "" to run the
phrase query with a NOT tokenized phrase.

What do i have to do?

Thanks and Kind regards

On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> Hey Jack, reading the doc :
>
> " Set to true if phrase queries will be automatically generated when the
> analyzer returns more than one term from whitespace delimited text. NOTE:
> this behavior may not be suitable for all languages.
>
> Set to false if phrase queries should only be generated when surrounded by
> double quotes."
>
>
> In the user case , i guess he's likely to use double quotes.
>
> The only problem he sees so far is that the phrase query uses the query
> time analyser to actually split the tokens.
>
> First we need a feedback from him, but I guess he would like to have the
> phrase query, to not tokenise the text within the double quotes.
>
> In the case we should find a way.
>
>
> Cheers
>
> 2015-07-21 13:12 GMT+01:00 Jack Krupansky <jack.krupansky@gmail.com>:
>
> > If you don't explicitly enable automatic phrase queries, the Lucene query
> > parser will assume an OR operator on the sub-terms when a white
> > space-delimited term analyzes into a sequence of terms.
> >
> > See:
> >
> >
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> >
> >
> > -- Jack Krupansky
> >
> > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <socaceti@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > i'm new to lucene and tried to write my own analyzer to support
> > > hyphenated words like wi-fi, jean-pierre, etc.
> > > For our customer it is important to find the word
> > > - wi-fi by wi, fi, wifi, wi-fi
> > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> > >
> > >
> > >
> > >
> > > The analyzer:
> > > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
> > >
> > >   protected NormalizeCharMap charConvertMap;
> > >
> > >   public MinLuceneAnalyzer() {
> > >     initCharConvertMap();
> > >   }
> > >
> > >   protected void initCharConvertMap() {
> > >     NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
> > >     builder.add("\"", "");
> > >     charConvertMap = builder.build();
> > >   }
> > >
> > >   @Override
> > >   protected TokenStreamComponents createComponents(final String
> > fieldName)
> > > {
> > >
> > >     final Tokenizer src = new WhitespaceTokenizer();
> > >
> > >     TokenStream tok = new WordDelimiterFilter(src,
> > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > >             | WordDelimiterFilter.CATENATE_WORDS,
> > >         null);
> > >     tok = new LowerCaseFilter(tok);
> > >     tok = new LengthFilter(tok, 1, 255);
> > >     tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> > >
> > >     return new TokenStreamComponents(src, tok);
> > >   }
> > >
> > >   @Override
> > >   protected Reader initReader(String fieldName, Reader reader) {
> > >     return new MappingCharFilter(charConvertMap, reader);
> > >   }
> > > }
> > >
> > >
> > >
> > >
> > >
> > > The analyzer seems to work except for exact phrase match queries.
> > >
> > > e.g. the following words are indexed
> > >
> > > FD-A320-REC-SIM-1
> > > FD-A320-REC-SIM-10
> > > FD-A320-REC-SIM-11
> > > MIA-FD-A320-REC-SIM-1
> > > SIN-FD-A320-REC-SIM-1
> > >
> > >
> > > The (exact) query "FD-A320-REC-SIM-1" returns
> > > FD-A320-REC-SIM-1
> > > MIA-FD-A320-REC-SIM-1
> > > SIN-FD-A320-REC-SIM-1
> > >
> > > for our customer this is wrong because this exact phrase match
> > > query should only return the single entry FD-A320-REC-SIM-1
> > >
> > > Do you have any ideas or tips, how we have to change our current
> > > analyzer to support this requirement???
> > >
> > >
> > > Thanks and Kind regards
> > > Diego
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message