lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dan sutton <danbsut...@gmail.com>
Subject Re: Large .frq file
Date Tue, 18 Jan 2011 16:10:25 GMT
Hi Shai,

What I really wanted to do was reduce the frq file size

Oddly (when tokenizing 3 seperate fields) with the
WhitespaceTokenizer, more terms are produced than with the CJK
analyzer and the CJK frq filesize is much larger ... examples below:

with WhitespaceTokenizer:
89M 	        _0.tis
1.4M 	_0.tii
71 	        _0.fnm
5.8M  	_0.fdx
741K 	_0.fdt
20  	       segments.gen
293  	segments_2
119M  	_0.frq

with CJKTokenizer:
31M   	_0.tis
633K 	_0.tii
71  	        _0.fnm
5.8M 	_0.fdx
741K  	_0.fdt
20  	        segments.gen
293  	segments_2
166M  	_0.frq

Also I believe solr calls addDocument with payLoads turned off. I'm
not sure why the size is much larger.

Cheers,
Dan

On Tue, Jan 18, 2011 at 12:41 PM, Shai Erera <serera@gmail.com> wrote:
> If I understand correctly, you compare the size of the .frq when
> WhitespaceTokenizer is used, vs the CJK ones?
>
> I'd bet this is because WhitespaceTokenizer creates far less terms than the
> CJK one. Whitespace tokenizes the text by separating on whitespace, while
> CJK does sort of N-Gram tokenization, which usually leads to much more terms
> created. This affects the .frq file in that there are much more posting
> lists created, which are stored in the .frq file.
>
> See if the .tii and .tis files differ and if their difference is the same
> order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
> should be of the same order of difference), then I believe this is the
> reason.
>
> Shai
>
> On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <danbsutton@gmail.com> wrote:
>
>> Hi,
>>
>> We're trying to create a large index via solr for trends and notice
>> that we have a large '.frq' file after doing the following:
>>
>>
>> make all text fields index="true", stored="false",
>> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
>> termOffsets="false" termVectors="false"
>>
>> We are using a variation on org.apache.lucene.analysis.cjk and notice
>> that the .frq is about 4 time larger than, for example, the
>> WhiteSpaceTokenizer.
>>
>>
>> Considering that with omitTermFreqAndPositions="true" for the text
>> fields I'd have thought this should be : "If omitTf were true it would
>> be this sequence of VInts instead:"
>> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>>
>>
>> Can anyone suggest how I can reduce the size of this file?
>>
>>
>> Many thanks,
>> Dan
>>
>> Lucene Specification Version: 2.9.1
>> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message