lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Monsur Hossain" <mon...@monsur.com>
Subject RE: Sorting: string vs int
Date Thu, 10 Nov 2005 17:43:59 GMT

Thanks Yonik, it makes sense now.  So getStringIndex indexes every sorted
string field in the retArray (one per document), and then each unique string
term in the mterms array.  What is the purpose of the mterms array?

Thanks,
Monsur




> -----Original Message-----
> From: Yonik Seeley [mailto:yseeley@gmail.com] 
> Sent: Wednesday, November 09, 2005 9:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Sorting: string vs int
> 
> The FieldCache (which is used for sorting), uses arrays of size
> maxDoc() to cache field values.  String sorting will involve caching a
> String[] (or StringIndex) and int sorting will involve caching an
> int[].  Unique string values are shared in the array, but the String
> values plus the String[] will always take up more room than the int[].
> 
> -Yonik
> Now hiring -- http://forms.cnet.com/slink?231706
> 
> 
> On 11/9/05, Monsur Hossain <monsur@xanga.net> wrote:
> > Hi all.  I have a question about sorting.  Lucene in Action 
> says: "For
> > numeric types, each field being sorted for each document in 
> the index
> > requires that four bytes be cached.  For String types, each 
> unique term is
> > also cached for each document."
> >
> > I want to make sure I'm understanding this correctly.  Lets 
> say I have a
> > document with some text and a date; a typical document 
> might look like this:
> >
> > DOCUMENT #1:
> > text = hello world
> > date = 20050401
> >
> > Lets say I index 10,000 of these documents into a single 
> Lucene index.  I
> > then create two IndexSearchers on this index and do a 
> search.  The first
> > IndexSearcher sorts by date as an int, the other sorts by 
> date as a string:
> >
> > IndexSearcher #1 = date sort on INT
> > IndexSearcher #2 = date sort in STRING
> >
> > If I understand the quoted sentence correctly, 
> IndexSearcher #1 will have an
> > int array storing one date per document, while 
> IndexSearcher #2 will have a
> > string array with only unique dates?  If so, is there a 
> particular reason
> > why sorting as an int doesn't cache unique dates?
> >
> > The reason I ask this is consider an index with 10,000 
> documents, where I
> > store year, month, and day as separte fields (for 
> simplicity lets assume I
> > only store the years 2000 - 2005 only).  When searching as 
> an int, if each
> > field of each document needs to be cached, that's 10,000 
> documents * 3
> > fields = 30,000 cached ints.  If terms are uniquely cached, 
> that's just 6
> > (for each year) + 12 (for each month) + 31 (for each day) = 
> 49 cached ints.
> > Am I interpreting any of this correctly?
> >
> > Thanks,
> > Monsur
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message