lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Webster Homer <webster.ho...@milliporesigma.com>
Subject Query kills Solr
Date Tue, 11 Dec 2018 19:24:37 GMT
Is there a way to get an approximate measure of the memory used by an indexed field(s). I’m
looking into a problem with one of our Solr indexes. I have a Japanese query that causes the
replicas to run out of memory when processing a query.
Also, is there a way to change or disable the timeout in the Solr Console? When I run this
query there it always times out, and that is a real pain. I know that it will complete eventually.

I have this field type:
   <!-- Field type to support Asian languages
         Transforms Traditional Han to Simplified Han
         Transforms Hiragana to Katakana
         tokenizes languages to unigrams and bigrams for analysis and searching
     -->
    <fieldtype name="text_deep_cjk" class="solr.TextField" positionIncrementGap="10000"
autoGeneratePhraseQueries="false">
     <analyzer type="index">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
replacement="$1"/>
         <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca -->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case folding,
diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true"
hangul="true" outputUnigrams="true" />
      </analyzer>

     <analyzer type="query">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
replacement="$1"/>

       <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  -->
                <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" tokenizerFactory="solr.ICUTokenizerFactory" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca -->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case folding,
diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true"
hangul="true" outputUnigrams="true" />
      </analyzer>
    </fieldtype>
I have a number of fields of this type. The CJKBigramFilterFactory can generate a lot of tokens.
I’m concerned that this combination is what is killing our solr instances
This is the query that is causing my problems:
モノクローナル抗ニコチン性アセチルコリンレセプター(??7サブユニット)抗体
マウス宿主抗体

We are using Solr 7.2 in a solrcloud


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message