lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2492) Make PulsingCodec (wrapping StandardCodec) the default codec
Date Tue, 08 Jun 2010 17:07:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876732#action_12876732
] 

Michael McCandless commented on LUCENE-2492:
--------------------------------------------

bq. We can encode whether the posting is embedded or not by storing a byte or a negative pointer
for example. There are ways to do it with minimal to no more space.

Remember than vInt/Long don't handle negative numbers well (they take max # bytes, I think).

bq. The thing is - there is a performance penalty to storing too many bytes in the terms dict
because it may affect terms lookup. docFreq may not be a very good decision.

True, but I'd expect "typically" rare terms (occurring in 1 or 2 docs across the corpus) also
generally tend to have low frequency within that document.  Hmm, or maybe not -- maybe there's
only a single article about Dr. Froobalaz, but in that article Froobalaz is mentioned many
many times.

bq. For example, a term may have one posting element with a huge payload. 

True, though such apps (the exception not the rule) could override the codec.

Fixed #bytes might also allow for faster scanning, ie if we always leave a 20 byte slot we
know we can then seek +20 bytes ahead, vs pulsing codec which must decode the postings for
the term when scanning over it.  (Though if we thought this mattered we could also write the
#bytes up front).

Net/net I think we should pursue this; we should probably keep both options available and
then we can test.


> Make PulsingCodec (wrapping StandardCodec) the default codec
> ------------------------------------------------------------
>
>                 Key: LUCENE-2492
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2492
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>
> PulsingCodec can provides good gains, by inlining the postings into the terms dict for
rare terms.  This is especially helpful for primary key like fields, since every term is rare
and batch lookups are common (see http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html
for a simple perf test), but it should also be a gain for ordinary fields, thanks to Zipf's
law.
> I think we should make it the default....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message