Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 49691 invoked from network); 14 Jan 2003 17:51:31 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 14 Jan 2003 17:51:31 -0000 Received: (qmail 7670 invoked by uid 97); 14 Jan 2003 17:52:52 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 7628 invoked by uid 97); 14 Jan 2003 17:52:50 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 7616 invoked by uid 98); 14 Jan 2003 17:52:49 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <3E244E31.4050600@lucene.com> Date: Tue, 14 Jan 2003 09:51:45 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021202 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Lucene's use of one byte to encode document length References: <200301141605.34218.jbaxter@panscient.com> In-Reply-To: <200301141605.34218.jbaxter@panscient.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Jonathan Baxter wrote: > How important is it for I/O performance that Lucene uses only one byte > to represent document length? Or are there reasons other than > performance for using so few bits? To achieve good search performance, field-length normalization factors must be memory-resident. So not only must the entire contents of these files be read when searching, it must also be kept in memory. With the one byte encoding this means that Lucene requires a byte per indexed field per document. So a 10M document collection with five fields requires 50Mb of memory to be searched. Doubling these to two bytes would double this memory requirement. Is that acceptable? It depends on who you ask. Why do you find this insufficient? The one byte float format (used in the current, unreleased sources) can actually represent a large range of values. Its precision is low, but high-precision isn't usually required for length normalization or Google-style boosting. Are you trying to use this for some other purpose in your ranking? Doug -- To unsubscribe, e-mail: For additional commands, e-mail: