lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject Re: Memory Usage
Date Thu, 03 Jul 2008 23:02:40 GMT
>
>
> (there are around 6,000,000 posts on the message board database)
>
> Date encoded as yyMMdd: appears to be using around 30M
> Date encoded as yyMMddHHmmss:  appears to be using more than 400M!
>
> I guess I would have understood if I was seeing the usage double for  
> sure, or even a little more; no idea how you guys encode the  
> indexes, if at all, but it's gone up over tenfold, which I can't  
> explain.

Sort memory cost is based on the total # of unique terms for the given  
field (multiplied by the number of locale's involved if you have to do  
that too! but in temporal sorting you don't).

This is easier than you think, just use 2 fields (date, time) and sort  
by both.  This means the Date field's unique term count grows only 1  
term per day.  The Time field can be set to minutes (if you can get  
away with that) meaning that you only have fairly insignificant total  
term count for the time field.  We use this at Aconex,  and have  
indexes with millions of records (weekly 'work' searcher refreshed  
every 5 seconds, archive searcher is held in memory, with a  
Multisearcher done over the 2) and it works a treat.  We regularly  
need to return million+ results from a search (don't ask) using this  
sort of sorting and the overall search time is only a few seconds.

On a related note, work hard not to need to use Locale sensitive  
sorting if you can for any other fields, for large results the CPU  
penalty is horrific (even once you get past the synchronization  
bottleneck in the CollationKey stuff).

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message