lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shelly_Singh <Shelly_Si...@infosys.com>
Subject RE: Scaling Lucene to 1bln docs
Date Tue, 10 Aug 2010 08:05:08 GMT
Hi Danil,

I get ur point. Infact, the latest readings I have for 1bln docs is also asserting the same
thing.
Index creation time is 2 hours.. which is fine by me... but search time is 15 secs.. which
is too high for any application.

I am planning to do a sharding of indices and then use a multisearcher for searching. Will
that help?

-----Original Message-----
From: Danil ŢORIN [mailto:torindan@gmail.com] 
Sent: Tuesday, August 10, 2010 1:06 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs

The problem actually won't be the indexing part.

Searching such large dataset will require a LOT of memory.
If you'll need sorting or faceting on one of the fields, jvm will explode ;)

Also GC times on large jvm heap are pretty disturbing (if you care
about your search performance)

So I'd advise you to split index into shards, each in it's own jvm.
This way you'll improve both indexing and search performance.

On Tue, Aug 10, 2010 at 10:25, Anshum <anshumg@gmail.com> wrote:
> Hi Shelly,
> That seems like a reasonable data set size. I'd suggest you increase your
> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
> memory before writing it to a file (and incurring I/O). You could actually
> flush by RAM usage instead of a Doc count. Turn off using the Compound file
> structure for indexing as it generally takes more time creating a cfs index.
>
> Plus the time would not grow linearly as the larger the size of segments
> get, the more time it'd take to add more docs and merge those together
> intermittently.
> You may also use a multithreaded approach in case reading the source takes
> time in your case, though, the indexwriter would have to be shared among all
> threads.
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.com
>
>
> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <Shelly_Singh@infosys.com>wrote:
>
>> Hi,
>>
>> I am developing an application which uses Lucene for indexing and searching
>> 1 bln documents. (the document size is very small though. Each document has
>> a single field of 5-10 words; so I believe that my data size is within the
>> tested limits).
>>
>> I am using the following configuration:
>> 1.      1.5 gig RAM to the jvm
>> 2.      100GB disk space.
>> 3.      Index creation tuning factors:
>> a.      mergeFactor = 10
>> b.      maxFieldLength = 10
>> c.      maxMergeDocs = 5000000 (if I try with a larger value, I get an
>> out-of-memory)
>>
>> With these settings, I am able to create an index of 100 million docs (10
>> pow 8)  in 15 mins consuming a disk space of 2.5gb. Which is quite
>> satisfactory for me, but nevertheless, I want to know what else can be done
>> to tune it further. Please help.
>> Also, with these settings, can I expect the time and size to grow linearly
>> for 1bln (10 pow 9) documents?
>>
>> Thanks and Regards,
>>
>> Shelly Singh
>> Center For KNowledge Driven Information Systems, Infosys
>> Email: shelly_singh@infosys.com<mailto:shelly_singh@infosys.com>
>> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message