Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8498 invoked from network); 26 Feb 2007 17:06:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Feb 2007 17:06:23 -0000 Received: (qmail 98199 invoked by uid 500); 26 Feb 2007 17:06:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 98162 invoked by uid 500); 26 Feb 2007 17:06:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 98151 invoked by uid 99); 26 Feb 2007 17:06:24 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Feb 2007 09:06:24 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [128.121.61.10] (HELO phizblip.com) (128.121.61.10) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Feb 2007 09:06:12 -0800 Received: from [10.0.11.108] (avaliable.octelecom.net [208.187.181.114] (may be forged)) (authenticated bits=0) by phizblip.com (8.13.6.20060614/8.13.6) with ESMTP id l1QH5oKQ043637 for ; Mon, 26 Feb 2007 10:05:50 -0700 (MST) Message-ID: <45E312F9.4@my-family.us> Date: Mon, 26 Feb 2007 10:03:53 -0700 From: Shane User-Agent: Thunderbird 1.5.0.9 (X11/20070212) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: One index per user or one index per day? References: <20070226152552.26280.qmail@web53909.mail.yahoo.com> In-Reply-To: <20070226152552.26280.qmail@web53909.mail.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org If you can categorize the documents based on user permissions, that is the route I would go. For example users 1, 2, and 3 are allowed to search documents a and b. In addition, user 1 can search documents c and d, while users 2 and 3 can search documents e and f. I would create 3 indexes: one for docs a and b, one for docs c and d, and finally one for docs e and f. Then using your method of choice, you can restrict documents based on the users permission. I realize scaling may cause an issue, but this route would allow you to normalize your data and reduce duplication in the system. Shane ariel goldberg wrote: > Greetings, > > > > > > > > I'm creating an application that > requires the indexing of millions of documents on behalf of a large group of > users, and was hoping to get an opinion on whether I should use one index per > user or one index per day. > > > > > > > > My application will have to handle > the following: > > > > > > > > - the indexing of about 1 million 5K > documents per day, with each document containing about 5 > fields > > > > - expiration of documents, since > after a while, my hard drive would run out of > room > > > > - queries that consist of boolean > expressions (e.g., the body field contains "a" AND "b", and the title field > contains "c"), as well as ranges (e.g., the document needs to have been indexed > between 2/25/07 10:00 am and 2/28/07 9:00 pm) > > > > - permissions; in other words, user > A might be able to search on documents X and Y, but user B might be able to > search on documents Y and Z. > > > > - up to 1,000 > users > > > > > > > > So, I was considering the > following: > > > > > > > > 1) Using one index per > user > > > > > > > > This would entail creating and using > up to 1,000 indices. Document Y in the example above would have to be > duplicated. Expiration is performed via IndexWriter.deleteDocuments. The > advantage here is that querying should be reasonably quick, because each index > would only contain tens of thousands of documents, instead of millions. The > disadvantages: I'm concerned about the "too many open files" error, and I'm also > concerned about the performance of > deleteDocuments. > > > > > > > > 2) Using one index per > day > > > > > > > > Each day, I create a new index. > Again, document Y in the example above would have to be duplicated (is there any > way around this?) The advantage here is that expiring documents means simply > deleting the index corresponding to a particular day. The disadvantage is the > query performance, since the queries, which are already very complex, would have > to be performed using MultiSearcher (if expiration is after 10 days, that's 10 > indices to search across). > > > > > > > > Tough to know for sure which option > is better without testing, but does anyone have a gut reaction? Any advice > would be greatly appreciated! > > > > > > > > Thanks, > > > > Ariel > > > > > > > > ____________________________________________________________________________________ > Need Mail bonding? > Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users. > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org