lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Tue, 21 Jul 2015 09:10:52 GMT
Hi Diego,
let me try to help :

I find this a little bit confused :

"For our customer it is important to find the word
- *wi-fi* by wi, *fi*, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*"

But :
"
The (exact) query "*FD-A320-REC-SIM-1*" returns
FD-A320-REC-SIM-1
MIA-*FD-A320-REC-SIM-1*
SIN-FD-A320-REC-SIM-1

for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1
"

If you noticed the suffix "fi" in the first example can be compared to the
suffix "FD-A320-REC-SIM-1" in the second.
To qualify your requirement :

Do you want the user to be able to surround the query with "" to run the
phrase query with a NOT tokenized phrase ?
Because by default , a phrase query is tokenized like the others, but term
positions affect the matching !
In the case I identified your requirement, we can have a think to a
solution!


Cheers



2015-07-17 9:41 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:

> Hi all,
>
> i'm new to lucene and tried to write my own analyzer to support
> hyphenated words like wi-fi, jean-pierre, etc.
> For our customer it is important to find the word
> - wi-fi by wi, fi, wifi, wi-fi
> - jean-pierre by jean, pierre, jean-pierre, jean-*
>
>
>
>
> The analyzer:
> public class SupportHyphenatedWordsAnalyzer extends Analyzer {
>
>   protected NormalizeCharMap charConvertMap;
>
>   public MinLuceneAnalyzer() {
>     initCharConvertMap();
>   }
>
>   protected void initCharConvertMap() {
>     NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>     builder.add("\"", "");
>     charConvertMap = builder.build();
>   }
>
>   @Override
>   protected TokenStreamComponents createComponents(final String fieldName)
> {
>
>     final Tokenizer src = new WhitespaceTokenizer();
>
>     TokenStream tok = new WordDelimiterFilter(src,
>         WordDelimiterFilter.PRESERVE_ORIGINAL
>             | WordDelimiterFilter.GENERATE_WORD_PARTS
>             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
>             | WordDelimiterFilter.CATENATE_WORDS,
>         null);
>     tok = new LowerCaseFilter(tok);
>     tok = new LengthFilter(tok, 1, 255);
>     tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
>     return new TokenStreamComponents(src, tok);
>   }
>
>   @Override
>   protected Reader initReader(String fieldName, Reader reader) {
>     return new MappingCharFilter(charConvertMap, reader);
>   }
> }
>
>
>
>
>
> The analyzer seems to work except for exact phrase match queries.
>
> e.g. the following words are indexed
>
> FD-A320-REC-SIM-1
> FD-A320-REC-SIM-10
> FD-A320-REC-SIM-11
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
>
> The (exact) query "FD-A320-REC-SIM-1" returns
> FD-A320-REC-SIM-1
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
> for our customer this is wrong because this exact phrase match
> query should only return the single entry FD-A320-REC-SIM-1
>
> Do you have any ideas or tips, how we have to change our current
> analyzer to support this requirement???
>
>
> Thanks and Kind regards
> Diego
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message