lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hrishikesh Agashe <hrishikesh_aga...@persistent.co.in>
Subject RE: Maximum index file size
Date Fri, 23 Oct 2009 06:29:04 GMT
Thanks Jake. 

I have around 75 TB data to be indexed. So even though I do the sharding, individual index
file size might still be pretty high. And that's why I wanted to find out whether there is
any limit as such. And obviously whether such a huge index files can be searched at all.

>From your response it appears that 1 TB of 1 index file is too much. Is there any guideline
to what kind of hardware will be required to handle (10GB, 50GB, 100GB, 500GB etc) size of
index file (with sensible search times)

--Hrishi

-----Original Message-----
From: Jake Mannix [mailto:jake.mannix@gmail.com] 
Sent: Friday, October 23, 2009 11:09 AM
To: java-user@lucene.apache.org
Subject: Re: Maximum index file size

On Thu, Oct 22, 2009 at 10:29 PM, Hrishikesh Agashe <
hrishikesh_agashe@persistent.co.in> wrote:

> Can I create an index file with very large size, like 1 TB or so? Is there
> any limit on how large index file one can create? Also, will I be able to
> search on this 1 TB index file at all?
>

Leaving aside the question of hardware or JVM limits on monstrous files,
this question (can you search this file) is easier: if you've got say, a ten
billion documents in one index, and you have a query which is going to hit
maybe even just 0.1% of the documents, you'll need to do scoring of 10
million hits in the course of that query.  To do this in under a second
means you only have 100 nanoseconds to look at each document.  If your query
hits 1% of your documents, you're down to 10 ns per document.  I've never
tried searching a 1TB index, but I'd say that's pushing it.

Is there a reason you can't shard your index, and instead put maybe 20
shards of 50GB (or better - 100 shards of 10GB) each on a variety of
machines, and just merge results?

  -jake

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message