lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George Aroush" <geo...@aroush.net>
Subject RE: 40000 segments for index with 2000 documents
Date Wed, 01 Jul 2009 01:04:13 GMT
Optimization is disk bound -- it will read the whole index and write it
back.  If the 7 minute it took to optimize your index is not acceptable, get
a faster hard-drive (fast RPM, seek, etc.)

Btw, 3000 documents is small, but if they *all* (or most) are being updated
every 3-5 minutes, you will run into fragmentation issues (and many segment
files) as your discovered.

-- George


-----Original Message-----
From: Dean Harding [mailto:dean.harding@dload.com.au] 
Sent: Tuesday, June 30, 2009 7:03 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: 40000 segments for index with 2000 documents

> There are about 3000 documents with one field indexed that are being
> updated 3-5 times per minute.  It looks like new segment created per
> each transaction because right now there are about 40000 .cfs/.del
> (coupled) files which makes 80000 files in index and indexs size is
> about 25Mb. But after optimization (which took 7 minutes) index size
> shrunk to 350Kb.

So what's the performance like after optimization? Optimization doesn't
happen automatically in Lucene you must do it manually. Adding a document
simply appends it to the end of the index and removing a document simply
marks it as deleted. Updating a document is a remove-then-add operation.

It's only when you call Optimize() that it actually rearranges things on
disk for faster access, and that's something you should be doing on a
regular basis. Here, we do an Optimize() after every 1000 "modifications"
(add, delete, update). For a relatively small index like yours, regular
optimization shouldn't take more than a couple of seconds (it's only because
you let things go so out of hand that it took 7 minutes) and you can
continue to query the index while the optimization is happening.

At least, that's always been my understanding.

Dean.



Mime
View raw message