lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Monsur Hossain" <mon...@xanga.net>
Subject Sorting: string vs int
Date Wed, 09 Nov 2005 21:58:04 GMT
Hi all.  I have a question about sorting.  Lucene in Action says: "For
numeric types, each field being sorted for each document in the index
requires that four bytes be cached.  For String types, each unique term is
also cached for each document."

I want to make sure I'm understanding this correctly.  Lets say I have a
document with some text and a date; a typical document might look like this:

DOCUMENT #1:
text = hello world
date = 20050401

Lets say I index 10,000 of these documents into a single Lucene index.  I
then create two IndexSearchers on this index and do a search.  The first
IndexSearcher sorts by date as an int, the other sorts by date as a string:

IndexSearcher #1 = date sort on INT
IndexSearcher #2 = date sort in STRING

If I understand the quoted sentence correctly, IndexSearcher #1 will have an
int array storing one date per document, while IndexSearcher #2 will have a
string array with only unique dates?  If so, is there a particular reason
why sorting as an int doesn't cache unique dates?

The reason I ask this is consider an index with 10,000 documents, where I
store year, month, and day as separte fields (for simplicity lets assume I
only store the years 2000 - 2005 only).  When searching as an int, if each
field of each document needs to be cached, that's 10,000 documents * 3
fields = 30,000 cached ints.  If terms are uniquely cached, that's just 6
(for each year) + 12 (for each month) + 31 (for each day) = 49 cached ints.
Am I interpreting any of this correctly?

Thanks,
Monsur



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message