Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 59709 invoked from network); 14 Jan 2003 22:03:05 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 14 Jan 2003 22:03:05 -0000 Received: (qmail 22736 invoked by uid 97); 14 Jan 2003 22:04:28 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 22720 invoked by uid 97); 14 Jan 2003 22:04:28 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 22707 invoked by uid 98); 14 Jan 2003 22:04:27 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Content-Type: text/plain; charset="iso-8859-1" From: Jonathan Baxter Reply-To: jbaxter@panscient.com Organization: Panscient To: "Lucene Developers List" Subject: Re: Lucene's use of one byte to encode document length Date: Wed, 15 Jan 2003 08:40:25 +1030 User-Agent: KMail/1.4.3 References: <200301141605.34218.jbaxter@panscient.com> <3E244E31.4050600@lucene.com> In-Reply-To: <3E244E31.4050600@lucene.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Message-Id: <200301150840.25439.jbaxter@panscient.com> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I didn't realise document-length-precision was that unimportant for=20 ranking. What does Google do? If they pull 1 byte per document into =20 memory then - at least according to their claim for the number of=20 documents indexed - that's over 3G. I can't see them equipping their=20 10,000 linux machines with more than 3G memory each. Apologies if this is off-topic for this list. Cheers, Jonathan=20 On Wednesday 15 January 2003 04:21, Doug Cutting wrote: > Jonathan Baxter wrote: > > How important is it for I/O performance that Lucene uses only one > > byte to represent document length? Or are there reasons other > > than performance for using so few bits? > > To achieve good search performance, field-length normalization > factors must be memory-resident. So not only must the entire > contents of these files be read when searching, it must also be > kept in memory. With the one byte encoding this means that Lucene > requires a byte per indexed field per document. So a 10M document > collection with five fields requires 50Mb of memory to be searched. > Doubling these to two bytes would double this memory requirement.=20 > Is that acceptable? It depends on who you ask. > > Why do you find this insufficient? The one byte float format (used > in the current, unreleased sources) can actually represent a large > range of values. Its precision is low, but high-precision isn't > usually required for length normalization or Google-style boosting. > > Are you trying to use this for some other purpose in your ranking? > > Doug -- To unsubscribe, e-mail: For additional commands, e-mail: