lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "FlexibleIndexing" by MikeMcCandless
Date Sat, 26 Sep 2009 12:51:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "FlexibleIndexing" page has been changed by MikeMcCandless:
http://wiki.apache.org/lucene-java/FlexibleIndexing?action=diff&rev1=7&rev2=8

  
  == Planning ==
  
- Lucene1458 is actively working on adding flexible indexing to Lucene.
- 
  == Related Information ==
  ConversationsBetweenDougMarvinAndGrant
  
+ == Further steps towards flexible indexing ==
+ 
+ This section describes the high-level design of [[https://issues.apache.org/jira/browse/LUCENE-1458|LUCENE-1458]].
+ 
+ The top goal is to make Lucene extensible, even at its lowest levels, on what it records
into the index, and how.  Your app should be able to easily store new things into the index,
or, alter how existing things (doc IDs, positions, payloads, etc.) are encoded.
+ 
+ While storing new things into the index is possible with this change, it hasn't really been
tested yet.  I've been focusing so far on alternate ways to encode the "normal postings" (terms,
doc, freq, pos, payload) that Lucene stores.
+ 
+ 
+ === Major pieces ===
+ 
+  1. New postings enumeration API
+ 
+ 
+  A new "4d" (four dimensional) enumeration API for reading postings data (FieldsEnum ->
TermsEnum -> DocsEnum -> PositionsEnum).  A consumer can choose to only iterate over
eg fields & terms (eg a MultiTermQuery), or over everything (eg SegmentMerger).  This
replaces today's TermEnum/TermDocs/TermPositions.
+ 
+ 
+  These classes extend AttributeSource, so that an app could plug in its own attributes.
 For example, payloads could [in theory] now be implemented externally to Lucene.
+ 
+  This API represents terms in RAM more efficiently, by 1) keeping them in UTF8 form (byte[]
instead of char[]) which is more efficient for ASCII-only terms data and trie terms, and 2)
allowing reuse of block byte[] with the TermRef class.  (Whereas Lucene today uses String
field (interned) and String text for every Term instance).
+ 
+  One important API is TermsEnum.docs, which returns the DocsEnum for the current term. 
That method now takes an arbitrary "skipDocs", of type Bits, a new interace with just the
method {{{public boolean get(int index)}}}.  And, IndexReader.getDeletedDocs now returns the
Bits.  The idea is to allow enumeration of the docs with a custom skip-list.  This will also
make it easier to implement random-access filters (LUCENE-1536).
+ 
+  2. Codec based pluggability for postings
+ 
+  Make the postings files (terms dict+index, freq/doc/pos/payload) writers and readers pluggable.
 A new Codec class hides all details of how the 4d data is written.
+ 
+  All index format specifics have been moved out of oal.index.* and under oal.index.codecs.*.
 For example there is no more TermInfo class.  SegmentReader is now given a Codec impl that
knows how to decode the files into the 4d API.
+ 
+  Separately, there is a Codecs class that is responsible for providing 1) the default writer
(when creating a new segment) and 2) lookup a given codec by its name (when reading segments
previoiusly written with different codecs).
+ 
+  3. A new "standard" (default) Codec, with improved terms dict index
+ 
+  The "standard" codec implements Lucene's default Codec for writing new segment files. 
The doc/freq/pos/payload format is nearly identical (except for a new header) to the format
today, but the terms dict/index is quite a bit more efficient in that it requires much less
RAM to load the terms index.
+ 
+  4. Some other interesting codecs
+ 
+  These are largely for testing, but some of them we will want to make available.  The pulsing
codec inlines postings for low-frequency terms directly into the terms dict.  The pfordelta
codec uses the PForDelta impl from [[https://issues.apache.org/jira/browse/LUCENE-1410|LUCENE-1410]]
to encode doc, freq, pos into their own files using PForDelta.
+ 
+ 
+ === Current status ===
+ 
+ All tests pass for all the codecs except pfordelta, which fails because it's unable to encode
negative ints.  But, Lucene only does this due to the deprecated bug from [[https://issues.apache.org/jira/browse/LUCENE-1542|LUCENE-1542]].
+ 
+ There are still many "nocommits" in the code, and more tests are needed.
+ 

Mime
View raw message