lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Maximum index file size
Date Fri, 23 Oct 2009 06:49:00 GMT
Hi Hrishi,

  The only way you'll know is to try it with some subset of your data - some
queries can be very expensive, some are really easy.  It'll depend on your
document size, the vocabulary (total number and distribution of terms), and
kinds of queries, as well as of course your hardware.  I would start out
indexing the sizes you mention (10-1000GB), and run queries like those you
expect to be running in production against it, and measure your TPS after
it's been running for a while under load.

  To index even 1TB you should probably do this in parallel and then merge
afterwards if you want to build up this test index in any reasonable time,
but that final merge of the last two segments in your 1TB index is gonna be
a killer.

  One of the big problems you'll run into with this index size is that
you'll never have enough RAM to give your OS's IO cache enough room to keep
much of this index in memory, so you're going to be seeking in this monster
file a lot.  I'm not saying that you need to keep your index in RAM for good
performance, but I've always tried to keep the individual indexes I use at
least within a (binary) order of magnitude of the RAM available - if I'm on
a box with 16GB of memory, then an index bigger than 32GB is getting
dangerously big for my preferences.  This may be mitigated by using really
fast disks, possibly, which is yet another reason why you'll need to do some
performance profiling on a variety of sizes with similar-to-production data

  I wish I could be of more help - but I think on this size, you'll need to
play with it to see what works.  We here on the list would be *very*
interested to hear what you find, because I'll bet that the reason why
you're not getting very many responses to this question is not because
nobody cares, but because most of us don't really know if you can ever
really search multi-TB *single* indexes, or what kind of cluster
configuration works best for searching a 75 TB distributed lucene index!


On Thu, Oct 22, 2009 at 11:29 PM, Hrishikesh Agashe <> wrote:

> Thanks Jake.
> I have around 75 TB data to be indexed. So even though I do the sharding,
> individual index file size might still be pretty high. And that's why I
> wanted to find out whether there is any limit as such. And obviously whether
> such a huge index files can be searched at all.
> From your response it appears that 1 TB of 1 index file is too much. Is
> there any guideline to what kind of hardware will be required to handle
> (10GB, 50GB, 100GB, 500GB etc) size of index file (with sensible search
> times)
> --Hrishi
> -----Original Message-----
> From: Jake Mannix []
> Sent: Friday, October 23, 2009 11:09 AM
> To:
> Subject: Re: Maximum index file size
> On Thu, Oct 22, 2009 at 10:29 PM, Hrishikesh Agashe <
>> wrote:
> > Can I create an index file with very large size, like 1 TB or so? Is
> there
> > any limit on how large index file one can create? Also, will I be able to
> > search on this 1 TB index file at all?
> >
> Leaving aside the question of hardware or JVM limits on monstrous files,
> this question (can you search this file) is easier: if you've got say, a
> ten
> billion documents in one index, and you have a query which is going to hit
> maybe even just 0.1% of the documents, you'll need to do scoring of 10
> million hits in the course of that query.  To do this in under a second
> means you only have 100 nanoseconds to look at each document.  If your
> query
> hits 1% of your documents, you're down to 10 ns per document.  I've never
> tried searching a 1TB index, but I'd say that's pushing it.
> Is there a reason you can't shard your index, and instead put maybe 20
> shards of 50GB (or better - 100 shards of 10GB) each on a variety of
> machines, and just merge results?
>  -jake
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message