lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2492) Make PulsingCodec (wrapping StandardCodec) the default codec
Date Mon, 07 Jun 2010 17:19:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876318#action_12876318
] 

Shai Erera commented on LUCENE-2492:
------------------------------------

The thing is - there is a performance penalty to storing too many bytes in the terms dict
because it may affect terms lookup. docFreq may not be a very good decision. For example,
a term may have one posting element with a huge payload. Or a term may be assoicated with
few documents whose IDs are successive, thus they are compressed much better than a term with
one doc whose ID is 1M.

#bytes is also something you can measure. Lucene should behave the same if the entries are
20 bytes total, which is not a collection specific setting. Point is, if you've measured term
dict lookup when entries Re 20 bytes in length, you know how it performs, and it will perform
like that for every collection. But if you perf test with docFreq=3 it willperform differently
on different collections ...

Also #bytes limit makes it easy to compute the size consumed.

> Make PulsingCodec (wrapping StandardCodec) the default codec
> ------------------------------------------------------------
>
>                 Key: LUCENE-2492
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2492
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>
> PulsingCodec can provides good gains, by inlining the postings into the terms dict for
rare terms.  This is especially helpful for primary key like fields, since every term is rare
and batch lookups are common (see http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html
for a simple perf test), but it should also be a gain for ordinary fields, thanks to Zipf's
law.
> I think we should make it the default....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message