Hi Kumaran,
WordDelimiterGraphFilter with PRESERVE_ORIGINAL should do what you want: <http://lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html>.
Here’s a test I added to TestWordDelimiterGraphFilter.java that passed for me:
-----
public void testEmail() throws Exception {
final int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS
| PRESERVE_ORIGINAL;
Analyzer a = new Analyzer() {
@Override public TokenStreamComponents createComponents(String field) {
Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
return new TokenStreamComponents(tokenizer, new WordDelimiterGraphFilter(tokenizer,
flags, null));
}
};
assertAnalyzesTo(a, "will.smith@yahoo.com",
new String[] { "will.smith@yahoo.com", "will", "smith", "yahoo", "com" },
null, null, null,
new int[] { 1, 0, 1, 1, 1 },
null, false);
a.close();
}
-----
--
Steve
www.lucidworks.com
> On Jun 15, 2017, at 8:53 AM, Kumaran Ramasubramanian <kums.134@gmail.com> wrote:
>
> Hi All,
>
> i want to index email fields as both analyzed and not analyzed using custom
> analyzer.
>
> for example,
> smith@yahoo.com
> will.smith@yahoo.com
>
> that is, indexing smith@yahoo.com as single token as well as analyzed
> tokens in same email field...
>
>
> My existing custom analyzer,
>
> public class CustomSearchAnalyzer extends StopwordAnalyzerBase
> {
>
> public CustomSearchAnalyzer(Version matchVersion, Reader stopwords)
> throws Exception
> {
> super(matchVersion, loadStopwordSet(stopwords, matchVersion));
> }
>
> @Override
> protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader)
> {
> final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> reader);
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> TokenStream tok = new ClassicFilter(src);
> tok = new LowerCaseFilter(getVersion(), tok);
> tok = new StopFilter(getVersion(), tok, stopwords);
> tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> search
>
> return new Analyzer.TokenStreamComponents(src, tok)
> {
> @Override
> protected void setReader(final Reader reader) throws IOException
> {
>
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> super.setReader(reader);
> }
> };
> }
> }
>
>
> And so i want to achieve like,
>
> 1.if i search using query "smith@yahoo.com", records with
> will.smith@yahoo.com should not come...
> 2.Also i should be able to search using query "smith" in that field
> 3.if possible, should be able to detect email values in all other fields
> and apply the same type of tokenization
>
> How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add
> UAX29URLEmailTokenizer in my existing custom analyzer without using email
> analyzer ( perfieldanalyzer ) for email field.. And so i can apply this
> tokenizer for email terms of all fields..
>
>
>
> -
> Kumaran R
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|