lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ckirkendall <ckirkend...@hobsons-us.com>
Subject Re: Lucene's use of one byte to encode document length
Date Tue, 14 Jan 2003 22:59:27 GMT
If I recall Google divides up it's set of documents.  Meaning that a
group of boxes handles a set of documents (I am not sure how this
division is made. probably by keyword frequency).  This means that the
entire repository list is never located on one box.  

Creighton 

On Tue, 2003-01-14 at 17:10, Jonathan Baxter wrote:
> I didn't realise document-length-precision was that unimportant for 
> ranking. What does Google do? If they pull 1 byte per document into  
> memory then - at least according to their claim for the number of 
> documents indexed -  that's over 3G. I can't see them equipping their 
> 10,000 linux machines with more than 3G memory each.
> 
> Apologies if this is off-topic for this list.
> 
> Cheers,
> 
> Jonathan 
> 
> 
> On Wednesday 15 January 2003 04:21, Doug Cutting wrote:
> > Jonathan Baxter wrote:
> > > How important is it for I/O performance that Lucene uses only one
> > > byte to represent document length? Or are there reasons other
> > > than performance for using so few bits?
> >
> > To achieve good search performance, field-length normalization
> > factors must be memory-resident.  So not only must the entire
> > contents of these files be read when searching, it must also be
> > kept in memory.  With the one byte encoding this means that Lucene
> > requires a byte per indexed field per document.  So a 10M document
> > collection with five fields requires 50Mb of memory to be searched.
> >  Doubling these to two bytes would double this memory requirement. 
> > Is that acceptable?  It depends on who you ask.
> >
> > Why do you find this insufficient?  The one byte float format (used
> > in the current, unreleased sources) can actually represent a large
> > range of values.  Its precision is low, but high-precision isn't
> > usually required for length normalization or Google-style boosting.
> >
> > Are you trying to use this for some other purpose in your ranking?
> >
> > Doug
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message