lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: email field - analyzed and not analyzed in single field using custom analyzer
Date Thu, 15 Jun 2017 14:13:25 GMT
Hi Kumaran,

WordDelimiterGraphFilter with PRESERVE_ORIGINAL should do what you want: <http://lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html>.

Here’s a test I added to TestWordDelimiterGraphFilter.java that passed for me:

-----
public void testEmail() throws Exception {
  final int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS
| PRESERVE_ORIGINAL;    
  Analyzer a = new Analyzer() {
    @Override public TokenStreamComponents createComponents(String field) {
      Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
      return new TokenStreamComponents(tokenizer, new WordDelimiterGraphFilter(tokenizer,
flags, null));
    }
  };
  assertAnalyzesTo(a, "will.smith@yahoo.com",
      new String[] { "will.smith@yahoo.com", "will", "smith", "yahoo", "com" },
      null, null, null,
      new int[] { 1, 0, 1, 1, 1 },
      null, false);
  a.close();
}
-----

--
Steve
www.lucidworks.com

> On Jun 15, 2017, at 8:53 AM, Kumaran Ramasubramanian <kums.134@gmail.com> wrote:
> 
> Hi All,
> 
> i want to index email fields as both analyzed and not analyzed using custom
> analyzer.
> 
> for example,
> smith@yahoo.com
> will.smith@yahoo.com
> 
> that is,  indexing smith@yahoo.com as single token as well as analyzed
> tokens in same email field...
> 
> 
> My existing custom analyzer,
> 
> public class CustomSearchAnalyzer extends StopwordAnalyzerBase
> {
> 
>    public CustomSearchAnalyzer(Version matchVersion, Reader stopwords)
> throws Exception
>    {
>        super(matchVersion, loadStopwordSet(stopwords, matchVersion));
>    }
> 
>    @Override
>    protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader)
>    {
>        final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> reader);
>        src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>        TokenStream tok = new ClassicFilter(src);
>        tok = new LowerCaseFilter(getVersion(), tok);
>        tok = new StopFilter(getVersion(), tok, stopwords);
>        tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> search
> 
>        return new Analyzer.TokenStreamComponents(src, tok)
>        {
>            @Override
>            protected void setReader(final Reader reader) throws IOException
>            {
> 
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>                super.setReader(reader);
>            }
>        };
>    }
> }
> 
> 
> And so i want to achieve like,
> 
> 1.if i search using query "smith@yahoo.com", records with
> will.smith@yahoo.com should not come...
> 2.Also i should be able to search using query "smith" in that field
> 3.if possible, should be able to detect email values in all other fields
> and apply the same type of tokenization
> 
> How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add
> UAX29URLEmailTokenizer in my existing custom analyzer without using email
> analyzer ( perfieldanalyzer )  for email field.. And so i can apply this
> tokenizer for email terms of all fields..
> 
> 
> 
> -
> Kumaran R


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message