lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Sorting: string vs int
Date Thu, 10 Nov 2005 02:23:26 GMT
The FieldCache (which is used for sorting), uses arrays of size
maxDoc() to cache field values.  String sorting will involve caching a
String[] (or StringIndex) and int sorting will involve caching an
int[].  Unique string values are shared in the array, but the String
values plus the String[] will always take up more room than the int[].

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706


On 11/9/05, Monsur Hossain <monsur@xanga.net> wrote:
> Hi all.  I have a question about sorting.  Lucene in Action says: "For
> numeric types, each field being sorted for each document in the index
> requires that four bytes be cached.  For String types, each unique term is
> also cached for each document."
>
> I want to make sure I'm understanding this correctly.  Lets say I have a
> document with some text and a date; a typical document might look like this:
>
> DOCUMENT #1:
> text = hello world
> date = 20050401
>
> Lets say I index 10,000 of these documents into a single Lucene index.  I
> then create two IndexSearchers on this index and do a search.  The first
> IndexSearcher sorts by date as an int, the other sorts by date as a string:
>
> IndexSearcher #1 = date sort on INT
> IndexSearcher #2 = date sort in STRING
>
> If I understand the quoted sentence correctly, IndexSearcher #1 will have an
> int array storing one date per document, while IndexSearcher #2 will have a
> string array with only unique dates?  If so, is there a particular reason
> why sorting as an int doesn't cache unique dates?
>
> The reason I ask this is consider an index with 10,000 documents, where I
> store year, month, and day as separte fields (for simplicity lets assume I
> only store the years 2000 - 2005 only).  When searching as an int, if each
> field of each document needs to be cached, that's 10,000 documents * 3
> fields = 30,000 cached ints.  If terms are uniquely cached, that's just 6
> (for each year) + 12 (for each month) + 31 (for each day) = 49 cached ints.
> Am I interpreting any of this correctly?
>
> Thanks,
> Monsur

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message