Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Content-Type: text/plain;
  charset="iso-8859-1"
From: Jonathan Baxter <jbaxter@panscient.com>
Reply-To: jbaxter@panscient.com
Organization: Panscient
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Subject: Re: Lucene's use of one byte to encode document length
Date: Wed, 15 Jan 2003 08:40:25 +1030
User-Agent: KMail/1.4.3
References: <200301141605.34218.jbaxter@panscient.com>
 <3E244E31.4050600@lucene.com>
In-Reply-To: <3E244E31.4050600@lucene.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Message-Id: <200301150840.25439.jbaxter@panscient.com>

I didn't realise document-length-precision was that unimportant for=20
ranking. What does Google do? If they pull 1 byte per document into =20
memory then - at least according to their claim for the number of=20
documents indexed -  that's over 3G. I can't see them equipping their=20
10,000 linux machines with more than 3G memory each.

Apologies if this is off-topic for this list.

Cheers,

Jonathan=20


On Wednesday 15 January 2003 04:21, Doug Cutting wrote:
> Jonathan Baxter wrote:
> > How important is it for I/O performance that Lucene uses only one
> > byte to represent document length? Or are there reasons other
> > than performance for using so few bits?
>
> To achieve good search performance, field-length normalization
> factors must be memory-resident.  So not only must the entire
> contents of these files be read when searching, it must also be
> kept in memory.  With the one byte encoding this means that Lucene
> requires a byte per indexed field per document.  So a 10M document
> collection with five fields requires 50Mb of memory to be searched.
>  Doubling these to two bytes would double this memory requirement.=20
> Is that acceptable?  It depends on who you ask.
>
> Why do you find this insufficient?  The one byte float format (used
> in the current, unreleased sources) can actually represent a large
> range of values.  Its precision is low, but high-precision isn't
> usually required for length normalization or Google-style boosting.
>
> Are you trying to use this for some other purpose in your ranking?
>
> Doug


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>