lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Question on the increase in the index space for larger indexes
Date Tue, 06 Sep 2011 17:49:38 GMT
You can try an optimize on the index, but if you haven't deleted (or updated)
any docs I wouldn't actually expect that to help much.

A reasonable question is what's actually in your index. This page will
tell you what sorts of data are in what extensions:
http://lucene.apache.org/java/3_0_2/fileformats.html#file-names

In particular, the *.fdt files contain the stored data. If you are storing
lots of fields and, for some reason, your older data contains more text
than your more recent documents, that could account for it.

You can certainly split your data amongst several indexes on separate
filesystems. Solr does something similar with "shards". One thing to be
aware of is that the tf/idf calculations are a local to each sub-index, they're
not global, so comparing scores across index parts can be tricky.

Hope this helps
Erick

On Tue, Sep 6, 2011 at 11:32 AM, Saurabh Gokhale
<saurabhgokhale@gmail.com> wrote:
> Hi All,
>
> I have a question about index size growing exponentially as the index goes
> larger.
>
> I am indexing last 2 years worth of data. Initially the index was growing 5
> GB for every 1 month. At the end of the 4th month of indexing my index size
> was 20GB (I was watching index size every few minutes)
>
> Then I saw index size started exponentially increasing and by the end of 1
> year worth of data processing, I was expecting the index to be 60 to 70 GB
> but the size grew to more than 120GB.
>
> 1. Is it an expected behavior?
> 2. Is there any optimization process that I can perform on the index to
> reclaim size? (Currently I am only adding documents to the index, no
> deletion)
> 3. Currently my searching on the index is running really fast, so can I
> break the index into multiple indexes programmatically and store the index
> on a separate file system which would have more space? Any idea as to how
> much performance hit will I get if I break the index to 4 indexes each for 6
> month worth of data?
>
> The reason to ask this question is, my initial estimate of the total
> space/size required to store the index went for the toss due to sudden
> increase in the index and now I will run short of the file system space
> assigned to me.
>
> Thanks
>
> Saurabh
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message