lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: need some advice for modifying index format which supporting PForDelta compression/decompression
Date Thu, 25 Nov 2010 09:07:10 GMT
Because we did some modification to index files, migrating to trunk
4.0 is not an easy task.

I think Partial decode is possible. To speed up decode, PForDelta
first decode all the arrays including exceptions without judge whether
it's exception or not. There can make cpu predict next instruction
correctly. Then use the linkedlist to decode exceptions. Even in
decode non-exceptions, it can be stoped when we get our needed.

  e.g. we consider implementation in
http://code.google.com/p/integer-array-compress-kit/
  for decode 5 bit frames, it decodes 5 integers for a loop. which
contains 5*32/5=32 encoded integers
  it first decodes 6 integer from val1
  then 6 integer from val2(and partial bits from val1)
  ...
  So we can stop if we only need the value of the 5th integer and
store the status of the decoder
  and if next needed is the 12th integer, we can continue from current status

  exception is a linked list, it's also easy to stop and resume it.

  But I don't know the efficiency and need doing some experiments
which comparing it with vint decoder.

	static int fastDeCompressFor5Bit( int offSet, int[] encodedValue, int
dataNum, int decodeOffset, int[] decode ){
		int maxBlocks	=  dataNum >> 5;
		int rest		=  dataNum % 32;
		// block process
		for( int  block = 0 ; block < maxBlocks ; block++ ){
			int val1 = encodedValue[ offSet ++ ];
			int val2 = encodedValue[ offSet ++ ];
			int val3 = encodedValue[ offSet ++ ];
			int val4 = encodedValue[ offSet ++ ];
			int val5 = encodedValue[ offSet ++ ];

			decode[ decodeOffset ++ ]  =  ( val1 << 27 ) >>>  27 ;
			decode[ decodeOffset ++ ]  =  ( val1 << 22 ) >>>  27 ;
			decode[ decodeOffset ++ ]  =  ( val1 << 17 ) >>>  27 ;
			decode[ decodeOffset ++ ]  =  ( val1 << 12 ) >>>  27 ;
			decode[ decodeOffset ++ ]  =  ( val1 << 7  ) >>>  27 ;
			decode[ decodeOffset ++ ]  =  ( val1 << 2  ) >>>  27 ;

			decode[ decodeOffset ++ ]  =  ( ( val2 << 29 ) >>> 27 ) | ( val1 >>>
30 ) ;
			decode[ decodeOffset ++ ]  =  ( val2 << 24 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val2 << 19 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val2 << 14 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val2 << 9  ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val2 << 4  ) >>> 27 ;

			decode[ decodeOffset ++	]  =  ( ( val3 << 31 ) >>> 27 ) | ( val2 >>>
28 ) ;
			decode[ decodeOffset ++ ]  =  ( val3 << 26 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val3 << 21 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val3 << 16 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val3 << 11 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val3 << 6  ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val3 << 1  ) >>> 27 ;

			decode[ decodeOffset ++ ]  =  ( ( val4 << 28 ) >>> 27 ) | ( val3 >>>
31 ) ;
			decode[ decodeOffset ++ ]  =  ( val4 << 23 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val4 << 18 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val4 << 13 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val4 << 8  ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val4 << 3  ) >>> 27 ;

			decode[ decodeOffset ++ ]  =  ( ( val5 << 30 ) >>> 27 ) | ( val4 >>>
29 ) ;
			decode[ decodeOffset ++ ]  =  ( val5 << 25 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val5 << 20 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val5 << 15 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val5 << 10 ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  ( val5 << 5  ) >>> 27 ;
			decode[ decodeOffset ++ ]  =  val5 >>> 27 ;

		}

2010/11/25 Michael McCandless <lucene@mikemccandless.com>:
> Are you sure you can't use trunk?  Swapping in PForDelta as a codec is
> much easier and we are wanting to do that, anyway (as a hybrid codec
> w/ Pulsing as well), before releasing 4.0 :)
>
> Partial decode of PForDelta is challenging I think... because the
> exceptions are a linked list, you have to walk the exceptions up
> front?
>
> That's a nice idea to interleave the encoding of 128 docDeltas then
> 128 docFreqs... in trunk's Sep codec we currently put them in separate
> files.
>
> Mike
>
> On Wed, Nov 24, 2010 at 5:16 AM, Li Li <fancyerii@gmail.com> wrote:
>> hi all
>>    I want to improve our search engine throughput without any help of
>> hardward improvement. My task now is optimizing index format. And we
>> have some experience on modifying index format by reading codes of
>> lucene and write a little codes such as implementing bitmap for high
>> frequent terms, top positions infos with frq files and modifiable fdt
>> style fields to meet our need.
>>   I want to integrate PForDelta into index but can't waiting for
>> lucene4's release date and we can not migrate lucene 2.9/solr 1.4 to
>> new version in svn trunk .
>>   the index format in lucene 2.9.3 is
>> http://lucene.apache.org/java/2_9_3/fileformats.html#Frequencies
>>
>>  FreqFile (.frq) --> <TermFreqs, SkipData> TermCount
>>  TermFreqs --> <TermFreq> DocFreq
>>  TermFreq --> DocDelta[, Freq?]
>>  SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1,
>> SkipLevel> <SkipDatum>
>>  SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
>>  SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
>>  DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt
>>  SkipChildLevelPointer --> VLong
>>
>>  which can be summarized in this image
>> http://hi.csdn.net/attachment/201002/2/3634917_1265115903L6FU.jpg
>>
>>  I just want to modify compression algorithm from VInt to PForDelta.
>>  But PForDelta is batching method that compress/decompress an integer
>> array while VInt decode/encode one by one.
>>  The main implementation of TermDocs is SegmentTermDocs.
>>  And some important methods are
>>  void seek(TermInfo ti, Term term)
>>  public int read(final int[] docs, final int[] freqs)
>>  public boolean next()
>>  public boolean skipTo(int target)
>>
>>  There are 2 major type of search: disjunction and conjuction
>>  for conjuction, we decode all the docList and frequency so public
>> int read(final int[] docs, final int[] freqs) is needed
>>  for disjuction, in ConjunctionScorer doNext function does the major job
>>  private int doNext() throws IOException {
>>    int first = 0;
>>    int doc = scorers[scorers.length - 1].docID();
>>    Scorer firstScorer;
>>    while ((firstScorer = scorers[first]).docID() < doc) {
>>      doc = firstScorer.advance(doc);
>>      first = first == scorers.length - 1 ? 0 : first + 1;
>>    }
>>    return doc;
>>  }
>>  which finally call TermScorer.advance() which call
>>  boolean result = termDocs.skipTo(target);
>>  e.g. term1's docList 1 10 100
>>        term2's docList 1 2 3 .. 100
>>  then with skipList, we don't need decode all the docIds of term2
>>
>>  That's my understanding of lucene 2.9.3's implementation
>>
>>  Now I need integrating PForDelta
>>  My plan is as follows:
>>  1 if df<128 stay VINT not changed return
>>  2 compress 128 ints into groups. docList and tf are stored separately
>>         e.g.    docList 1--128, tf 1--128  |  docList 129-256,tf 129-256
| ...
>>  3 skipList's leaf point to group start position (in 2.9.3 leaf of
>> skipList point to the position of exact position of this document)
>>  4 for disjunction query, we decode the whole group for docID and tf
>> and cache the int array of this group
>>     for conjuction query, we also decode the whole group and cache
>> the group result of docID and tf(maybe tf is not needed because it
>> don't occur in other terms, but for now we don't  want to modify score
>> algorithm)
>>
>>  Comparision with VINT
>>  for disjunction query, both need decode all the docIDs and tfs. So
>> PForDelta will be faster.
>>  for conjunction query,it's hard to analyze
>>  e.g. skipInterval is 1024 if we skipTo 10 and then 1100 then VINT
>> only need decode 10 docIDs while PForDelta need decode the 128 docIDs
>>  e.g. skipInteval is 1024 if we skipTo 10 and then 100. VINT need
>> decode 100 docIDs while PForDelta need also decode 128 docIDs
>>  I think PForDelta also support parial decode but don't know it's
>> speed is fast or slow comparing with VINT.
>>
>>  So the group size is critical,if it's too large, conjuction query
>> may become slow. if it's too small, disjuction query may become slow.
>>  I found 128 in paper Performance of Compressed Inverted List Caching
>> in Search Engines
>>   any other good method to improve both disjunction and conjuction
>> query? Or any suggestion for me ?
>>  Thank you !
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message