lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: EdgeNGramFilterFactory for Chinese characters
Date Mon, 26 Oct 2015 02:19:16 GMT
Hi Tomoko,

Thank you for your recommendation.

I wasn't in favour of using copyField at first to have 2 separate fields
for English and Chinese tokens, as it  not only increase the index size,
but also slow down the performance for both indexing and querying.

Will try to see if there is anyway to managed it by only a single field?

Regards.
Edwin


On 25 October 2015 at 22:59, Tomoko Uchida <tomoko.uchida.1111@gmail.com>
wrote:

> Hi, Edwin,
>
> > This means it is better to have 2 separate fields for English and Chinese
> words?
>
> Yes. I mean,
> 1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract English
> and Chinese tokens.
> 2. Define FIELD_2 that use PatternTokenizerFactory to extract English
> tokens and EdgeNGramFilter to break up tokens to sub-strings.
>     There might be some possible tokenizer/filter chains to extract English
> tokens, please try and find the best way ;)
> 3. Index original text to FIELD_1 to search tokens as they are. (for both
> of English and Chinese words)
> 4. Index original text to FIELD_2 to perform prefix match. (for English
> words)
> 5. Search FIELD_1 and FIELD_2 by using edismax query parser, etc.
>
> You can use copyField to index original text data to FIELD_1 and FIELD_2.
> Downside of this method is that increase index size as you know.
>
> If you want to manage that *by one field*, I think you can create custom
> token filter on your own... but it may be slightly advanced.
>
> Thanks,
> Tomoko
>
> 2015-10-25 22:48 GMT+09:00 Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>:
>
> > Hi Tomoko,
> >
> > Thank you for your reply.
> >
> > > If you need to perform partial (prefix) match for **only English
> words**,
> > > you can create a separate field that keeps only English words (I've
> never
> > > tried that, but might be possible by PatternTokenizerFactory or other
> > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to the
> > field.
> >
> > This means it is better to have 2 separate fields for English and Chinese
> > words?
> > Not quite sure what you mean by that.
> >
> > Regards,
> > Edwin
> >
> >
> >
> > On 25 October 2015 at 11:42, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> >
> > wrote:
> >
> > > > I have rich-text documents that are in both English and Chinese, and
> > > > currently I have EdgeNGramFilterFactory enabled during indexing, as I
> > > need
> > > > it for partial matching for English words. But this means it will
> also
> > > > break up each of the Chinese characters into different tokens.
> > >
> > > EdgeNGramFilterFactory creates sub-strings (prefixes) from each token.
> > Its
> > > behavior is independent of language.
> > > If you need to perform partial (prefix) match for **only English
> words**,
> > > you can create a separate field that keeps only English words (I've
> never
> > > tried that, but might be possible by PatternTokenizerFactory or other
> > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to the
> > field.
> > >
> > > Hope it helps,
> > > Tomoko
> > >
> > > 2015-10-23 13:04 GMT+09:00 Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > Would like to check, is it good to use EdgeNGramFilterFactory for
> > indexes
> > > > that contains Chinese characters?
> > > > Will it affect the accuracy of the search for Chinese words?
> > > >
> > > > I have rich-text documents that are in both English and Chinese, and
> > > > currently I have EdgeNGramFilterFactory enabled during indexing, as I
> > > need
> > > > it for partial matching for English words. But this means it will
> also
> > > > break up each of the Chinese characters into different tokens.
> > > >
> > > > I'm using the HMMChineseTokenizerFactory for my tokenizer.
> > > >
> > > > Thank you.
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message