lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From findbestopensource <>
Subject Re: Scaling Lucene to 1bln docs
Date Tue, 10 Aug 2010 11:46:39 GMT
Hi Shelly,

You need to reduce your maxMergeDocs. set ramBufferSizeMB to 100,
which will help you to use less RAM in indexing.

>>>search time is 15 secs..
How you are calculating this time. Just taking time difference before
and after the search method or this involves time to parse the
document object and display in the UI?

Is this your first or second search? Usefully first couple of search
takes more time as the index will be warmed. Take benchmark for first
10 -20 search and see if the time has come down.

Is your index optimized? Optimized index may take less time to search.


On Tue, Aug 10, 2010 at 1:31 PM, Shelly_Singh <> wrote:
> Hi Anshum,
> I am already running with the 'setCompoundFile' option off.
> And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days
ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs
was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor.
> With regards to the multithreaded approach, I was considering creating 10 different threads
each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices.
Do you think this will improve performance.
> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search
time is 15 secs.. I can live with indexing time but the search time is highly unacceptable.
> Help again.
> -----Original Message-----
> From: Anshum []
> Sent: Tuesday, August 10, 2010 12:55 PM
> To:
> Subject: Re: Scaling Lucene to 1bln docs
> Hi Shelly,
> That seems like a reasonable data set size. I'd suggest you increase your
> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
> memory before writing it to a file (and incurring I/O). You could actually
> flush by RAM usage instead of a Doc count. Turn off using the Compound file
> structure for indexing as it generally takes more time creating a cfs index.
> Plus the time would not grow linearly as the larger the size of segments
> get, the more time it'd take to add more docs and merge those together
> intermittently.
> You may also use a multithreaded approach in case reading the source takes
> time in your case, though, the indexwriter would have to be shared among all
> threads.
> --
> Anshum Gupta
> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <>wrote:
>> Hi,
>> I am developing an application which uses Lucene for indexing and searching
>> 1 bln documents. (the document size is very small though. Each document has
>> a single field of 5-10 words; so I believe that my data size is within the
>> tested limits).
>> I am using the following configuration:
>> 1.      1.5 gig RAM to the jvm
>> 2.      100GB disk space.
>> 3.      Index creation tuning factors:
>> a.      mergeFactor = 10
>> b.      maxFieldLength = 10
>> c.      maxMergeDocs = 5000000 (if I try with a larger value, I get an
>> out-of-memory)
>> With these settings, I am able to create an index of 100 million docs (10
>> pow 8)  in 15 mins consuming a disk space of 2.5gb. Which is quite
>> satisfactory for me, but nevertheless, I want to know what else can be done
>> to tune it further. Please help.
>> Also, with these settings, can I expect the time and size to grow linearly
>> for 1bln (10 pow 9) documents?
>> Thanks and Regards,
>> Shelly Singh
>> Center For KNowledge Driven Information Systems, Infosys
>> Email:<>
>> Phone: (M) 91 992 369 7200, (VoIP)2022978622
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message