lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5052) bitset codec for off heap filters
Date Tue, 25 Mar 2014 20:13:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13947085#comment-13947085
] 

Michael McCandless commented on LUCENE-5052:
--------------------------------------------

bq. this posting format should wrap the standard one, like pulsing;

I don't think we need to do that (I was convinced, above)?  I think it should just be its
own PF, and the app picks it to store all postings as bitsets.

bq. if IndexOptions.DOCS_ONLY is provided, this codec suppresses standard posting format and
write the bitset file (<<should-be>> explore fancy formats then);

I think it should ONLY accept DOCS_ONLY?  Ie, throw an exc if it gets anything else, because
it's mis-use.

bq. I wonder what’s the correct behavior if docEnum is requested with FLAG_FREQS, should
it silently returns 1 on freq() or throwing exception?

I think lie (return 1 from freq).

> bitset codec for off heap filters
> ---------------------------------
>
>                 Key: LUCENE-5052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5052
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Mikhail Khludnev
>              Labels: features
>             Fix For: 5.0
>
>         Attachments: LUCENE-5052.patch, bitsetcodec.zip, bitsetcodec.zip
>
>
> Colleagues,
> When we filter we don’t care any of scoring factors i.e. norms, positions, tf, but
it should be fast. The obvious way to handle this is to decode postings list and cache it
in heap (CachingWrappingFilter, Solr’s DocSet). Both of consuming a heap and decoding as
well are expensive. 
> Let’s write a posting list as a bitset, if df is greater than segment's maxdocs/8 
(what about skiplists? and overall performance?). 
> Beside of the codec implementation, the trickiest part to me is to design API for this.
How we can let the app know that a term query don’t need to be cached in heap, but can be
held as an mmaped bitset?
> WDYT?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message