lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ariel goldberg <arif...@yahoo.com>
Subject One index per user or one index per day?
Date Mon, 26 Feb 2007 15:25:51 GMT

Greetings,



 



I'm creating an application that 
requires the indexing of millions of documents on behalf of a large group of 
users, and was hoping to get an opinion on whether I should use one index per 
user or one index per day.



 



My application will have to handle 
the following:



 



- the indexing of about 1 million 5K 
documents per day, with each document containing about 5 
fields



- expiration of documents, since 
after a while, my hard drive would run out of 
room



- queries that consist of boolean 
expressions (e.g., the body field contains "a" AND "b", and the title field 
contains "c"), as well as ranges (e.g., the document needs to have been indexed 
between 2/25/07 10:00 am and 2/28/07 9:00 pm)



- permissions; in other words, user 
A might be able to search on documents X and Y, but user B might be able to 
search on documents Y and Z.



- up to 1,000 
users



 



So, I was considering the 
following:



 



1) Using one index per 
user



 



This would entail creating and using 
up to 1,000 indices.  Document Y in the example above would have to be 
duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
advantage here is that querying should be reasonably quick, because each index 
would only contain tens of thousands of documents, instead of millions.  The 
disadvantages: I'm concerned about the "too many open files" error, and I'm also 
concerned about the performance of 
deleteDocuments.



 



2) Using one index per 
day



 



Each day, I create a new index.  
Again, document Y in the example above would have to be duplicated (is there any 
way around this?)  The advantage here is that expiring documents means simply 
deleting the index corresponding to a particular day.  The disadvantage is the 
query performance, since the queries, which are already very complex, would have 
to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
indices to search across).



 



Tough to know for sure which option 
is better without testing, but does anyone have a gut reaction?  Any advice 
would be greatly appreciated!



 



Thanks,



Ariel






 
____________________________________________________________________________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message