Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <3E244E31.4050600@lucene.com>
Date: Tue, 14 Jan 2003 09:51:45 -0800
From: Doug Cutting <cutting@lucene.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021202
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: Lucene's use of one byte to encode document length
References: <200301141605.34218.jbaxter@panscient.com>
In-Reply-To: <200301141605.34218.jbaxter@panscient.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Jonathan Baxter wrote:
> How important is it for I/O performance that Lucene uses only one byte 
> to represent document length? Or are there reasons other than 
> performance for using so few bits?

To achieve good search performance, field-length normalization factors 
must be memory-resident.  So not only must the entire contents of these 
files be read when searching, it must also be kept in memory.  With the 
one byte encoding this means that Lucene requires a byte per indexed 
field per document.  So a 10M document collection with five fields 
requires 50Mb of memory to be searched.  Doubling these to two bytes 
would double this memory requirement.  Is that acceptable?  It depends 
on who you ask.

Why do you find this insufficient?  The one byte float format (used in 
the current, unreleased sources) can actually represent a large range of 
values.  Its precision is low, but high-precision isn't usually required 
for length normalization or Google-style boosting.

Are you trying to use this for some other purpose in your ranking?

Doug


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>