lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kuntal Ganguly <gangulykuntal1...@gmail.com>
Subject Solr Multilingual Indexing with one field- Guidance
Date Thu, 07 May 2015 18:23:49 GMT
Our current production index size is 1.5 TB with 3 shards. Currently we
have the following field type:

<fieldType name="text_ngram" class="solr.TextField"
positionIncrementGap="100">

<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3"
maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>

And the above field type is working well for the US and English language
clients.

Now we have some new Chinese and Japanese client ,so after google
http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/

https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search

for best approach for multilingual index,there seems to be pros/cons
associated with every approach.

Then i tried RnD with a single field approach and here's my new field type:

<fieldType name="text_multi" class="solr.TextField"
positionIncrementGap="100">

<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.CustomNGramFilterFactory" minGramSize="3"
maxGramSize="30" preserveOriginal="true"/>
</analyzer>
</fieldType>

I have kept the same tokenizer, only changed the filters.And it is working
well with all existing search /use-case for English documents as well as
new use case for Chinese/Japanese documents.

Now i have the following questions to the Solr experts/developer:

1) Is this a correct approach to do it? Or i'm missing something?

2) Can you give me an example where there will be problem with this above
new field type? A use-case/scenario with example will be very helpful.

3) Also is there any problem in future with different clients coming up?

Please provide some guidance

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message