lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shane <lucene-u...@my-family.us>
Subject Re: One index per user or one index per day?
Date Mon, 26 Feb 2007 17:03:53 GMT
If you can categorize the documents based on user permissions, that is 
the route I would go. 

For example users 1, 2,  and 3 are allowed to search documents a and b.  
In addition, user 1 can search documents c and d, while users 2 and 3 
can search documents e and f.  I would create 3 indexes: one for docs a 
and b, one for docs c and d, and finally one for docs e and f.  Then 
using your method of choice, you can restrict documents based on the 
users permission.

I realize scaling may cause an issue, but this route would allow you to 
normalize your data and reduce duplication in the system.

Shane

ariel goldberg wrote:
> Greetings,
>
>
>
>  
>
>
>
> I'm creating an application that 
> requires the indexing of millions of documents on behalf of a large group of 
> users, and was hoping to get an opinion on whether I should use one index per 
> user or one index per day.
>
>
>
>  
>
>
>
> My application will have to handle 
> the following:
>
>
>
>  
>
>
>
> - the indexing of about 1 million 5K 
> documents per day, with each document containing about 5 
> fields
>
>
>
> - expiration of documents, since 
> after a while, my hard drive would run out of 
> room
>
>
>
> - queries that consist of boolean 
> expressions (e.g., the body field contains "a" AND "b", and the title field 
> contains "c"), as well as ranges (e.g., the document needs to have been indexed 
> between 2/25/07 10:00 am and 2/28/07 9:00 pm)
>
>
>
> - permissions; in other words, user 
> A might be able to search on documents X and Y, but user B might be able to 
> search on documents Y and Z.
>
>
>
> - up to 1,000 
> users
>
>
>
>  
>
>
>
> So, I was considering the 
> following:
>
>
>
>  
>
>
>
> 1) Using one index per 
> user
>
>
>
>  
>
>
>
> This would entail creating and using 
> up to 1,000 indices.  Document Y in the example above would have to be 
> duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
> advantage here is that querying should be reasonably quick, because each index 
> would only contain tens of thousands of documents, instead of millions.  The 
> disadvantages: I'm concerned about the "too many open files" error, and I'm also 
> concerned about the performance of 
> deleteDocuments.
>
>
>
>  
>
>
>
> 2) Using one index per 
> day
>
>
>
>  
>
>
>
> Each day, I create a new index.  
> Again, document Y in the example above would have to be duplicated (is there any 
> way around this?)  The advantage here is that expiring documents means simply 
> deleting the index corresponding to a particular day.  The disadvantage is the 
> query performance, since the queries, which are already very complex, would have 
> to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
> indices to search across).
>
>
>
>  
>
>
>
> Tough to know for sure which option 
> is better without testing, but does anyone have a gut reaction?  Any advice 
> would be greatly appreciated!
>
>
>
>  
>
>
>
> Thanks,
>
>
>
> Ariel
>
>
>
>
>
>
>  
> ____________________________________________________________________________________
> Need Mail bonding?
> Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
> http://answers.yahoo.com/dir/?link=list&sid=396546091
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message