lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
Date Wed, 10 Dec 2008 03:10:44 GMT


Marvin Humphrey commented on LUCENE-831:

> Marvin, does KS/Lucy have something like FieldCache? If so, what API do you
> use? Is it iterator-only? 

At present, KS only caches the docID -> ord map as an array.  It builds that
array by iterating over the terms in the sort field's Lexicon and mapping the
docIDs from each term's posting list.

Building the docID -> ord array is straightforward for a single-segment
SegLexicon.  The multi-segment case requires that several SegLexicons be
collated using a priority queue.  In KS, there's a MultiLexicon class which
handles this; I don't believe that Lucene has an analogous class.

Relying on the docID -> ord array alone works quite well until you get to the
MultiSearcher case.  As you know, at that point you need to be able to
retrieve the actual field values from the ordinal numbers, so that you can
compare across multiple searchers (since the ordinal values are meaningless).

Lex_Seek_By_Num(lexicon, term_num);
field_val = Lex_Get_Term(lexicon);

The problem is that seeking by ordinal value on a MultiLexicon iterator
requires a gnarly implementation and is very expensive.  I got it working, but
I consider it a dead-end design and a failed experiment.

The planned replacement for these iterator-based quasi-FieldCaches involves
several topics of recent discussion:

  1) A "keyword" field type, implemented using a format similar to what Nate 
     and I came up with for the lexicon index.
  2) Write per-segment docID -> ord maps at index time for sort fields.
  3) Memory mapping.
  4) Segment-centric searching.

We'd mmap the pre-composed docID -> ord map and use it for intra-segment
sorting.  The keyword field type would be implemented in such a way that we'd
be able to mmap a few files and get a per-segment field cache, which we'd then
use to sort hits from multiple segments.

> Complete overhaul of FieldCache API/Implementation
> --------------------------------------------------
>                 Key: LUCENE-831
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Hoss Man
>             Fix For: 3.0
>         Attachments:, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff,
fieldcache-overhaul.diff, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff,
LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch,
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
>     a) eliminate global static map keyed on IndexReader (thus
>         eliminating synch block between completley independent IndexReaders)
>     b) allow more customization of cache management (ie: use 
>         expiration/replacement strategies, disk backed caches, etc)
>     c) allow people to define custom cache data logic (ie: custom
>         parsers, complex datatypes, etc... anything tied to a reader)
>     d) allow people to inspect what's in a cache (list of CacheKeys) for
>         an IndexReader so a new IndexReader can be likewise warmed. 
>     e) Lend support for smarter cache management if/when
>         IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
>     the new implementation, so there is no redundent caching as client code
>     migrades to new API.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message