lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yasufumi Mizoguchi <yasufumi0...@gmail.com>
Subject Re: Creating CJK bigram tokens with ClassicTokenizer
Date Wed, 03 Oct 2018 08:57:47 GMT
Hi, Shawn

Thank you for replying me.

> CJKBigramFilter shouldn't care what tokenizer you're using.  It should
> work with any tokenizer.  What problem are you seeing that you're trying
> to solve?  What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?

I am sorry for lack of information. I tried this with Solr 5.5.5 and 7.5.0.
And here is analyzer configuration from my managed-schema.

<fieldType name="text_classic" class="solr.TextField"
positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.ClassicTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory"/>
  </analyzer>
</fieldType>


And what I want to do is
1. to create CJ bigram token
2. to extract each word that contains a hyphen and stopwords as a single
token
   (e.g. as-is, to-be, etc...) from CJK and English sentences.

CJKBigramFilter seems to check TOKEN_TYPES attribute added by
StandardTokenizer when creating CJK bigram token.
(See
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java#L64
)

ClassicTokenizer also adds obsolete TOKEN_TYPES "CJ" to the CJ token and
"ALPHANUM" to the Korean alphabet, but both are not targets for
CJKBigramFilter...

Thanks,
Yasufumi

2018年10月2日(火) 0:05 Shawn Heisey <apache@elyograg.org>:

> On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote:
> > I am looking for the way to create CJK bigram tokens with
> ClassicTokenizer.
> > I tried this by using CJKBigramFilter, but it only supports for
> > StandardTokenizer...
>
> CJKBigramFilter shouldn't care what tokenizer you're using.  It should
> work with any tokenizer.  What problem are you seeing that you're trying
> to solve?  What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?
>
> I don't have access to the systems where I was using that filter, but if
> I recall correctly, I was using the whitespace tokenizer.
>
> Thanks,
> Shawn
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message