lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex vB <>
Subject Re: New codecs keep Freq skip/omit Pos
Date Sat, 23 Apr 2011 18:06:01 GMT
Hi Robert,

the adapted codec is running but it seems to be incredible slow. Will take
some time ;)
Here are some performance results:

				Indexing scheme 
				Index Size 
				Avg. Query performance 
				Max. Query Performance 
				PforDelta2 W Freq W Pos 
				20.6 GB (3,3 GB w/o .pos) 
				81.97 ms 
				1295 ms 
				PforDelta2 W/O Freq W/O Pos 
				1.6 GB 
				63.33 ms 
				766 ms 
				Standard 4.0 W Freq W Pos 
				28.1 GB (8,1 GB w/o .prx) 
				77.71 ms 
				978 ms 
				Standard 4.0 W/O Freq W/O Pos 
				6.2 GB 
				59.93 ms 
				718 ms 
				Standard 3.0 W Freq W Pos 
				28.1 GB (8,1 GB w/o .prx) 
				71.41 ms 
				978 ms 
				Standard 3.0 WO Freq WO Pos 
				6.2 GB 
				72.72 ms 
				 845 ms 
				PforDelta W Freq W Pos 
				22 GB (5 GB w/o .pos) 
				67.98 ms 
				783 ms 
				PforDelta W/O Freq W/O Pos 
				3.1 GB 
				56.08 ms 
				596 ms 
				Huffman BL10 W Freq W/O Pos 
				2.6 GB 
				216.29 ms (Mem 14 ms) 
				1338 ms 
I am a little bit curious about the Lucene 3.0 performance results because
the larger index seems to
work faster?!? I already ran the test several times. Are my results
realistic at all? I thought PForDelta/2 would outperform the standard index
implementations in query processing. 

The last result is my own implementation. I am still looking to get it
smaller because I think I can improve compression further. For indexing I
use PForDelta2 in combination with payloads. Those are causing the higher
runtimes. In memory it looks nice. The gap between my solution and PForDelta
is already 700 MB. I would say it is an improvement. :D I will have a look
at it again after I got an index with your adapted implementation.

I still have another question. The basic idea in my implementation is to
create a "Two-Level" index structure. It is specialized for versioned
document collections. On the first level I create a posting list entry for a
document whenever a term occurs in one or more of its versions. The second
level holds corresponding term frequency informations. Is it possible to
build such a structure by creating a codec? For query processing it should
filter per boolean query on the first level and only fetch information from
the second level when the document is in the intersection of the first
level. At the moment I use payloads to "simulate" a two-level structure.
Normally all payloads corresponding to a query get fetched, right?

If this structure would be possible there are several more implementations
with promising results (Two-Level Diff/MSA in this paper

Regards Alex

View this message in context:
Sent from the Lucene - Java Users mailing list archive at
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message