lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Scaling Lucene to 1bln docs
Date Tue, 10 Aug 2010 08:54:51 GMT
Correction: mergeFactor determines how many segments are merged at once.

It's IndexWriter's ramBufferSizeMB and/or maxBufferedDocs that
determine how many docs are buffered in RAM before a new segment is
flushed.

A higher mergeFactor will require more RAM during merging, will cause
longer running but fewer merges, requires more open files, and allows
your index to have more segments given a certain number of indexed
docs.

Mike

On Tue, Aug 10, 2010 at 4:24 AM, anshum.gupta@naukri.com
<anshumg@gmail.com> wrote:
> Would like to know, are you using a particular type of sort? Do you need to sort on relevance?
Can you shard and restrict your search to a limited set of indexes functionally?
>
> --
> Anshum
> http://blog.anshumgupta.net
>
> Sent from BlackBerry®
>
> -----Original Message-----
> From: Shelly_Singh <Shelly_Singh@infosys.com>
> Date: Tue, 10 Aug 2010 13:31:38
> To: java-user@lucene.apache.org<java-user@lucene.apache.org>
> Reply-To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Hi Anshum,
>
> I am already running with the 'setCompoundFile' option off.
> And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days
ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs
was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor.
>
> With regards to the multithreaded approach, I was considering creating 10 different threads
each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices.
Do you think this will improve performance.
>
> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search
time is 15 secs.. I can live with indexing time but the search time is highly unacceptable.
>
> Help again.
>
> -----Original Message-----
> From: Anshum [mailto:anshumg@gmail.com]
> Sent: Tuesday, August 10, 2010 12:55 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Hi Shelly,
> That seems like a reasonable data set size. I'd suggest you increase your
> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
> memory before writing it to a file (and incurring I/O). You could actually
> flush by RAM usage instead of a Doc count. Turn off using the Compound file
> structure for indexing as it generally takes more time creating a cfs index.
>
> Plus the time would not grow linearly as the larger the size of segments
> get, the more time it'd take to add more docs and merge those together
> intermittently.
> You may also use a multithreaded approach in case reading the source takes
> time in your case, though, the indexwriter would have to be shared among all
> threads.
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.com
>
>
> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <Shelly_Singh@infosys.com>wrote:
>
>> Hi,
>>
>> I am developing an application which uses Lucene for indexing and searching
>> 1 bln documents. (the document size is very small though. Each document has
>> a single field of 5-10 words; so I believe that my data size is within the
>> tested limits).
>>
>> I am using the following configuration:
>> 1.      1.5 gig RAM to the jvm
>> 2.      100GB disk space.
>> 3.      Index creation tuning factors:
>> a.      mergeFactor = 10
>> b.      maxFieldLength = 10
>> c.      maxMergeDocs = 5000000 (if I try with a larger value, I get an
>> out-of-memory)
>>
>> With these settings, I am able to create an index of 100 million docs (10
>> pow 8)  in 15 mins consuming a disk space of 2.5gb. Which is quite
>> satisfactory for me, but nevertheless, I want to know what else can be done
>> to tune it further. Please help.
>> Also, with these settings, can I expect the time and size to grow linearly
>> for 1bln (10 pow 9) documents?
>>
>> Thanks and Regards,
>>
>> Shelly Singh
>> Center For KNowledge Driven Information Systems, Infosys
>> Email: shelly_singh@infosys.com<mailto:shelly_singh@infosys.com>
>> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message