lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: One Byte is Seven bits too many? - A Design suggestion
Date Mon, 23 May 2005 06:52:41 GMT
On Monday 23 May 2005 02:04, Arvind Srinivasan wrote:
> One Byte is Seven bits too many? - A Design suggestion
> Hi,
> The norm takes up 1 byte of storage per document per field.  While this may 
> very small, a simple calculation shows that the IndexSearcher can consume 
lots of
> memory when it caches the norms. Further, the current implementation loads 
up the
> norms in memory as soon as the segments gets loaded.  Here are the 
> 	For Medium sized archives
> 	docs=40Million, Fields=2  =>  80MB memory
> 	docs=40Million, Fields=10 => 400MB memory
> 	docs=40Million, Fields=20 => 800MB memory
> 	For larger sized archives 
> 	docs=400Million, Fields=2  =>  800MB memory
> 	docs=400Million, Fields=10 =>  ~4GB memory
> 	docs=400Million, Fields=20 =>  ~8GB memory
> To further compound the issues, we have found JVM performance drops when the 
> that it manages increases.
> While the storage itself may not be concern, the runtime memory requirement 
can use
> some optimization, especially for large number of fields.  
> The fields itself may fall in one of 3 categories 
>  (a) Tokenized fields have huge variance in number of Tokens, 
>      example - HTML page, Mail Body etc.
>  (b) Tokenized fields with very little variance in number of token, 
>      example - HTML Page Title, Mail Subject etc.
>  (c) Fixed Tokenized Fields 
>      example - Department, City, State etc. 
> The one byte usage is very applicable for (a) and not for (b) or (c).  In 
> usage, field increases can be attributed to (b) and (c).  

(c) would also be a nice fit for the recently discussed constant scoring 
queries.  For (b) the relative variance and the influence and on the score
is still high. Perhaps a mixed form with a minimum field length in a single
bit could be considered there, but addressing that might be costly.

Paul Elschot.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message