lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex vB <>
Subject Re: New codecs keep Freq skip/omit Pos
Date Sat, 23 Apr 2011 20:39:44 GMT
> it depends upon the type of query.. what queries are you using for
> this benchmarking and how are you benchmarking?
> FYI: for benchmarking standard query types with wikipedia you might be
> interested in

I have 10000 queries from a AOL data set where the followed link lead to
I benchmark by warming up the indexSearcher with 5000 and perform the test
with the remaining 5000 queries. I just measure the time needed to execute
the queries. I use QueryParser.

> wait, you are indexing payloads for your tests with these other codecs
> when it says "W POS" ?

No only my last implementation uses payloads. All others not. Therefore I
use a payload aware query for Huffman.

> keep in mind that even adding a single payload to your index slows
> down the decompression of the positions tremendously, because payload
> lengths are intertwined with the positions. For block codecs payloads
> really need to be done differently so that blocks of positions are
> really just blocks of positions. This hasn't yet been fixed for the
> sep nor the fixed layouts, so if you add any payloads, and then
> benchmark positional queries then the results are not realistic.

Oh I know that payloads slow down query processing but I wasn't aware of the
block codec problem. I suggest you mean with not realistic they will be
slower? Some numbers for Huffman:
20 Bytes segements.gen
234.6 KB fdt
1.8 MB fdx
20 bytes fnm
626.1 MB pos
1.7 GB pyl
17.8 MB skp
39.8 MB tib
2028.5 KB tiv
268 Bytes Segments_2
214.6 MB doc

I used here for query processing PayloadQueryParser and adapt the similarity
according to my payloads.

> No they do not, only if you use a payload based query such as
> PayloadTermQuery. Normal non-positional queries like TermQuery and
> even normal positional queries like PhraseQuery don't fetch payloads
> at all...

Sorry my question was misleading. I already focused on a payload aware
query. When I use one how exactly are the payload informations fetched from
disk? For example if a query needs to read two posting lists. Are all
payloads fetched for them directly or is Lucene at first making a boolean
intersection and then retrieves the payloads for documents within that

> From the description of what you are doing I don't understand how
> payloads fit in because they are per-position? But, I haven't had the
> time to digest the paper you sent yet.

I will try to summarize it and how I adapted it to Lucene. 

I already mentioned the idea of two levels for versioned document
collections. When I parse Wikipedia I unite for one article all terms of all
versions. From this word bag I extract each distinct term and index it with
Lucene into one document. Frequency information is now "lost" for the first
level but will be stored on the second. This is what I meant with " The
first level contains a posting for a document when a term occurs at least in
one version". For example if an article has two versions like version1: "a b
b" and version2: "a a a c c" only 'a','b' and 'c' are indexed.

For the second level I collected term frequency information during my
parsing step. Those frequencies are stored as a vector in version order. For
the above example the frequency vector for 'a' would be [1,3].  I store
these vectors as payloads which I see as the "second level". Every distinct
term on first level receives a single frequency vector on its first
position. So I somehow abuse payloads.

For query processing I now need to retrieve the docs and payloads. It would
be optimal to process the posting lists first ignoring payloads and then
fetch payloads (frequency information) for the remaining docs. The term
frequency is then used for ranking purposes. At the moment I pick for
ranking the highest value from the freq vector which corresponds to the most
matching version.


To unsubscribe, e-mail:
For additional commands, e-mail:

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message