lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject need some advice for modifying index format which supporting PForDelta compression/decompression
Date Wed, 24 Nov 2010 10:16:51 GMT
hi all
    I want to improve our search engine throughput without any help of
hardward improvement. My task now is optimizing index format. And we
have some experience on modifying index format by reading codes of
lucene and write a little codes such as implementing bitmap for high
frequent terms, top positions infos with frq files and modifiable fdt
style fields to meet our need.
   I want to integrate PForDelta into index but can't waiting for
lucene4's release date and we can not migrate lucene 2.9/solr 1.4 to
new version in svn trunk .
   the index format in lucene 2.9.3 is
http://lucene.apache.org/java/2_9_3/fileformats.html#Frequencies

  FreqFile (.frq) --> <TermFreqs, SkipData> TermCount
  TermFreqs --> <TermFreq> DocFreq
  TermFreq --> DocDelta[, Freq?]
  SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1,
SkipLevel> <SkipDatum>
  SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
  SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
  DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt
  SkipChildLevelPointer --> VLong

  which can be summarized in this image
http://hi.csdn.net/attachment/201002/2/3634917_1265115903L6FU.jpg

  I just want to modify compression algorithm from VInt to PForDelta.
  But PForDelta is batching method that compress/decompress an integer
array while VInt decode/encode one by one.
  The main implementation of TermDocs is SegmentTermDocs.
  And some important methods are
  void seek(TermInfo ti, Term term)
  public int read(final int[] docs, final int[] freqs)
  public boolean next()
  public boolean skipTo(int target)

  There are 2 major type of search: disjunction and conjuction
  for conjuction, we decode all the docList and frequency so public
int read(final int[] docs, final int[] freqs) is needed
  for disjuction, in ConjunctionScorer doNext function does the major job
  private int doNext() throws IOException {
    int first = 0;
    int doc = scorers[scorers.length - 1].docID();
    Scorer firstScorer;
    while ((firstScorer = scorers[first]).docID() < doc) {
      doc = firstScorer.advance(doc);
      first = first == scorers.length - 1 ? 0 : first + 1;
    }
    return doc;
  }
  which finally call TermScorer.advance() which call
  boolean result = termDocs.skipTo(target);
  e.g. term1's docList 1 10 100
        term2's docList 1 2 3 .. 100
  then with skipList, we don't need decode all the docIds of term2

  That's my understanding of lucene 2.9.3's implementation

  Now I need integrating PForDelta
  My plan is as follows:
  1 if df<128 stay VINT not changed return
  2 compress 128 ints into groups. docList and tf are stored separately
         e.g.    docList 1--128, tf 1--128  |  docList 129-256,tf 129-256 | ...
  3 skipList's leaf point to group start position (in 2.9.3 leaf of
skipList point to the position of exact position of this document)
  4 for disjunction query, we decode the whole group for docID and tf
and cache the int array of this group
     for conjuction query, we also decode the whole group and cache
the group result of docID and tf(maybe tf is not needed because it
don't occur in other terms, but for now we don't  want to modify score
algorithm)

  Comparision with VINT
  for disjunction query, both need decode all the docIDs and tfs. So
PForDelta will be faster.
  for conjunction query,it's hard to analyze
  e.g. skipInterval is 1024 if we skipTo 10 and then 1100 then VINT
only need decode 10 docIDs while PForDelta need decode the 128 docIDs
  e.g. skipInteval is 1024 if we skipTo 10 and then 100. VINT need
decode 100 docIDs while PForDelta need also decode 128 docIDs
  I think PForDelta also support parial decode but don't know it's
speed is fast or slow comparing with VINT.

  So the group size is critical,if it's too large, conjuction query
may become slow. if it's too small, disjuction query may become slow.
  I found 128 in paper Performance of Compressed Inverted List Caching
in Search Engines
   any other good method to improve both disjunction and conjuction
query? Or any suggestion for me ?
  Thank you !

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message