Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Wed, 24 Dec 2008 11:03:00 -0800
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search
Message-ID: <20081224190300.GA23787@rectangular.com>
References: <85d3c3b60812231751k60f00283r95b8d65b2b7adf45@mail.gmail.com>
 <20081224022229.GA17788@rectangular.com>
 <23E675E5-06AB-445F-B2E1-3755FCED8CBD@ix.netcom.com>
 <20081224032044.GA18006@rectangular.com>
 <A2F3CF25-F940-4E9F-B5F3-A74261B69A73@ix.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <A2F3CF25-F940-4E9F-B5F3-A74261B69A73@ix.netcom.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Marvin Humphrey <marvin@rectangular.com>

On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote:
> Seems doubtful you will be able to do this without increasing the  
> index size dramatically. Since it will need to be stored  
> "unpacked" (in order to have random access), yet the terms are  
> variable length - leading to using a maximum=minimum size for every  
> term.

Wow.  That's a spectacularly awful design.  Its worst case -- one outlier
term, say, 1000 characters in length, in a field where the average term length
is in the single digits -- would explode the index size and incur wasteful IO
overhead, just as you say.

Good thing we've never considered it.  :)

I'm hoping we can improve on this, but for now, we've ended up at a two-file
design for the term dictionary index.

  1) Stacked 64-bit file pointers.
  2) Variable length character and term info data, interpreted using a 
     pluggable codec.

In the index at least, each entry would contain the full term text, encoded as
UTF-8.  Probably the primary term dictionary would continue to use string
diffs. 

That design offers no significant benefits other than those that flow from
compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage
under multiple processes by way of buffer sharing.  IO bandwidth requirements
and speed are probably a little better, but lookups on the term dictionary
index are not a significant search-time bottleneck.

Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in 
<https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150>.

  1) Stacked 64-bit file pointers.
  2) Character data.
  3) Doc num to ord mapping.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org