lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Fwd: Solr dynamic field blowing up the index size
Date Tue, 21 Feb 2017 15:27:48 GMT
Did you look in the data directories to check what index file extensions
contribute most to the difference? That could give a hint.

Regards,
    Alex

On 21 Feb 2017 9:47 AM, "Pratik Patel" <pratik@semandex.net> wrote:

> Here is the same question in stackOverflow for better format.
>
> http://stackoverflow.com/questions/42370231/solr-
> dynamic-field-blowing-up-the-index-size
>
> Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine but
> the problem is that index size with solr 6 is way too large. In solr 5,
> index size was about 15GB and in solr 6, for the same data, the index size
> is 300GB! I am not able to understand what contributes to such huge
> difference in solr 6.
>
> I have been able to identify a field which is blowing up the size of index.
> It is as follows.
>
> <dynamicField name="*_note" type="text_general" indexed="true"
> stored="true" multiValued="true"  />
>
> <field name="textproperty" type="text_general" indexed="true"
> stored="false" multiValued="true"  />
> <copyField source="*_note" dest="textproperty"/>
>
> When this field is commented out, the index size reduces to less than 10GB.
>
> This field is of type text_general. Following is the definition of this
> type.
>
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory" />
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="((?m)[a-z]+)'s" replacement="$1s" />
>         <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0"/>
>         <filter class="solr.KStemFilterFactory" />
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> />
>       </analyzer>
>       <analyzer type="query">
>         <charFilter class="solr.HTMLStripCharFilterFactory" />
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="((?m)[a-z]+)'s" replacement="$1s" />
>         <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0"/>
>         <filter class="solr.KStemFilterFactory" />
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> />
>       </analyzer>
>   </fieldType>
>
> Few things which I did to debug this issue:
>
>    - I have ensured that field type definition is same as what I was using
>    in solr 5 and it is also valid in version 6. This field type considers a
>    list of "stopwords" to be ignored during indexing. I have supplied the
> same
>    list of stopwords which we were using in solr 5. I have verified that
> path
>    of this file is correct and it is being loaded fine in solr admin UI.
> When
>    I analyse these fields using "Analysis" tab of the solr admin UI, I can
> see
>    that stopwords are being filtered out. However, when I query with some
> of
>    these stopwords, I do get the results back which makes me think that
>    probably stopwords are being indexed.
>
> Any idea what could increase the size of index by so much in solr 6?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message