lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Socaceti <socac...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Wed, 22 Jul 2015 09:20:32 GMT
Hi Alessandro,

i guess code says more than worlds :)

...

public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
public static final String MULTIPLE_CHARACTER_WILDCARD = "*";

...

  if (isExactCriteriaString(userCriteria)) {
    String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
        escape(userCriteria.substring(1, userCriteria.length() - 1)));
    userCriteriaProcessed = userCriteriaEscaped;
  } else {
    userCriteriaProcessed = escape(userCriteria);

    if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
      userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
    }
  }

...

public static String escape(String s) {
  String result = s;

  if (s != null && !s.trim().isEmpty()) {
    String toEscape = s.trim();

    if (toEscape.contains("*")) {
      StringBuilder sb = new StringBuilder();

      for (int i = 0; i < toEscape.length(); i++) {
        char curChar = toEscape.charAt(i);
        if (curChar == '*')
          sb.append('*');
        else
          sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
      }

      result = sb.toString();
    } else {
      result = QueryParser.escape(toEscape);
    }
  }

  return result;
}

...

Thanks and Kind regards



On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> As a start Diego, how do you currently parse the user query to build the
> Lucene queries ?
>
> Cheers
>
> 2015-07-22 8:35 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:
>
> > Hi Alessandro,
> >
> > yes, i want the user to be able to surround the query with "" to run the
> > phrase query with a NOT tokenized phrase.
> >
> > What do i have to do?
> >
> > Thanks and Kind regards
> >
> > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
> > benedetti.alex85@gmail.com> wrote:
> >
> > > Hey Jack, reading the doc :
> > >
> > > " Set to true if phrase queries will be automatically generated when
> the
> > > analyzer returns more than one term from whitespace delimited text.
> NOTE:
> > > this behavior may not be suitable for all languages.
> > >
> > > Set to false if phrase queries should only be generated when surrounded
> > by
> > > double quotes."
> > >
> > >
> > > In the user case , i guess he's likely to use double quotes.
> > >
> > > The only problem he sees so far is that the phrase query uses the query
> > > time analyser to actually split the tokens.
> > >
> > > First we need a feedback from him, but I guess he would like to have
> the
> > > phrase query, to not tokenise the text within the double quotes.
> > >
> > > In the case we should find a way.
> > >
> > >
> > > Cheers
> > >
> > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <jack.krupansky@gmail.com>:
> > >
> > > > If you don't explicitly enable automatic phrase queries, the Lucene
> > query
> > > > parser will assume an OR operator on the sub-terms when a white
> > > > space-delimited term analyzes into a sequence of terms.
> > > >
> > > > See:
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> > > >
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <socaceti@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > i'm new to lucene and tried to write my own analyzer to support
> > > > > hyphenated words like wi-fi, jean-pierre, etc.
> > > > > For our customer it is important to find the word
> > > > > - wi-fi by wi, fi, wifi, wi-fi
> > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > The analyzer:
> > > > > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
> > > > >
> > > > >   protected NormalizeCharMap charConvertMap;
> > > > >
> > > > >   public MinLuceneAnalyzer() {
> > > > >     initCharConvertMap();
> > > > >   }
> > > > >
> > > > >   protected void initCharConvertMap() {
> > > > >     NormalizeCharMap.Builder builder = new
> > NormalizeCharMap.Builder();
> > > > >     builder.add("\"", "");
> > > > >     charConvertMap = builder.build();
> > > > >   }
> > > > >
> > > > >   @Override
> > > > >   protected TokenStreamComponents createComponents(final String
> > > > fieldName)
> > > > > {
> > > > >
> > > > >     final Tokenizer src = new WhitespaceTokenizer();
> > > > >
> > > > >     TokenStream tok = new WordDelimiterFilter(src,
> > > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> > > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> > > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > > > >             | WordDelimiterFilter.CATENATE_WORDS,
> > > > >         null);
> > > > >     tok = new LowerCaseFilter(tok);
> > > > >     tok = new LengthFilter(tok, 1, 255);
> > > > >     tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> > > > >
> > > > >     return new TokenStreamComponents(src, tok);
> > > > >   }
> > > > >
> > > > >   @Override
> > > > >   protected Reader initReader(String fieldName, Reader reader) {
> > > > >     return new MappingCharFilter(charConvertMap, reader);
> > > > >   }
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > The analyzer seems to work except for exact phrase match queries.
> > > > >
> > > > > e.g. the following words are indexed
> > > > >
> > > > > FD-A320-REC-SIM-1
> > > > > FD-A320-REC-SIM-10
> > > > > FD-A320-REC-SIM-11
> > > > > MIA-FD-A320-REC-SIM-1
> > > > > SIN-FD-A320-REC-SIM-1
> > > > >
> > > > >
> > > > > The (exact) query "FD-A320-REC-SIM-1" returns
> > > > > FD-A320-REC-SIM-1
> > > > > MIA-FD-A320-REC-SIM-1
> > > > > SIN-FD-A320-REC-SIM-1
> > > > >
> > > > > for our customer this is wrong because this exact phrase match
> > > > > query should only return the single entry FD-A320-REC-SIM-1
> > > > >
> > > > > Do you have any ideas or tips, how we have to change our current
> > > > > analyzer to support this requirement???
> > > > >
> > > > >
> > > > > Thanks and Kind regards
> > > > > Diego
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message