Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 28281 invoked from network); 10 Nov 2005 01:36:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 10 Nov 2005 01:36:43 -0000 Received: (qmail 12466 invoked by uid 500); 10 Nov 2005 01:36:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 12445 invoked by uid 500); 10 Nov 2005 01:36:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 63711 invoked by uid 99); 9 Nov 2005 21:58:29 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) From: "Monsur Hossain" To: Subject: Sorting: string vs int Date: Wed, 9 Nov 2005 16:58:04 -0500 Organization: Xanga.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Thread-Index: AcXleKjWPJS9qfe+T9agc9SQfwPL7w== Message-Id: <200511091701483.SM01960@chicago> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi all. I have a question about sorting. Lucene in Action says: "For numeric types, each field being sorted for each document in the index requires that four bytes be cached. For String types, each unique term is also cached for each document." I want to make sure I'm understanding this correctly. Lets say I have a document with some text and a date; a typical document might look like this: DOCUMENT #1: text = hello world date = 20050401 Lets say I index 10,000 of these documents into a single Lucene index. I then create two IndexSearchers on this index and do a search. The first IndexSearcher sorts by date as an int, the other sorts by date as a string: IndexSearcher #1 = date sort on INT IndexSearcher #2 = date sort in STRING If I understand the quoted sentence correctly, IndexSearcher #1 will have an int array storing one date per document, while IndexSearcher #2 will have a string array with only unique dates? If so, is there a particular reason why sorting as an int doesn't cache unique dates? The reason I ask this is consider an index with 10,000 documents, where I store year, month, and day as separte fields (for simplicity lets assume I only store the years 2000 - 2005 only). When searching as an int, if each field of each document needs to be cached, that's 10,000 documents * 3 fields = 30,000 cached ints. If terms are uniquely cached, that's just 6 (for each year) + 12 (for each month) + 31 (for each day) = 49 cached ints. Am I interpreting any of this correctly? Thanks, Monsur --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org