lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: New Lucene features and Solr indexes
Date Wed, 13 Feb 2013 15:03:27 GMT
>>Instead of making other APIs to accomodate BloomFilter's current
>>brokenness: remove its custom per-field logic so it works with
>>PerFieldPostingsFormat, like every other PF.

Not looked at it in a while but I'm pretty certain, like every other PF, you can go ahead
and use PerFieldPF with Bloom filter just fine.

What was broken was (is?) that in this configuration PFPF isn't smart enough to avoid creating
twice as many files as is required - see Lucene 4093.
Until that is resolved (and I have noted my pessimism about that being fixed easily) BloomPF
contains an optimisation for those that want to avoid this inefficiency.
The use of that optimisation is entirely optional for users.
Internally to BloomPF, the implementation of that optimisation is trivial  - if a null bloom
set is returned for a given field it ignores the usual bloom filtering logic and delegates
directly to the wrapped codec. 
You can choose to implement a BloomFilterFactory that adds this field-choice optimisation
or, more simply run the default PerFieldPF-managed configuration and live with the increased
numbers of files.

Arguably, the inefficiencies of the PerFieldPF framework are the real issue to be addressed

>>I brought this up before it was committed, and i was ignored

You stopped engaging in the debate when I outlined the 3 proposed options for moving BloomPF
forward :
Those options were:
1) ignore the inefficiencies in PFPF
2) sort out the issues in PFPF (4093 but probably a more complex solution)
3) work around existing PFPF issues with a simple but entirely optional optimisation to BloomPF

I opted for 3) and gave notice that I 'd take it out if anyone objected. 
I don't think there's been any movement on 2) so I guess you're still happy with option 1)?
I recall you didn't think the business of extra files was that much of a concern:

(Incidentally, probably best following up on the relevant Jiras rather than here)


 From: Robert Muir <>
Sent: Wednesday, 13 February 2013, 13:01
Subject: Re: New Lucene features and Solr indexes
On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand <> wrote:
> Hi Shawn,
> On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey <> wrote:
>> Some of these, like compressed stored fields and compressed termvectors, are
>> being turned on by default, which is awesome.  I'm already running a 4.2
>> snapshot, so I've got those in place.
> Excellent!
>> One thing that I know I would like to do is use the new BloomFilter for a
>> couple of my fields that contain only unique values.  Last time I checked
>> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
>> had a BloomFilter postings format, but didn't have any way to specify the
>> underlying format.  See SOLR-3950 and LUCENE-4394.
> BloomFilterPostingsFormat is a little special compared to other
> postings formats because it can wrap any postings format. So maybe it
> should require special support, like an additional attribute in the
> field type definition?


Instead of making other APIs to accomodate BloomFilter's current
brokenness: remove its custom per-field logic so it works with
PerFieldPostingsFormat, like every other PF.

In other words, it should work just like pulsing.

I brought this up before it was committed, and i was ignored. Thats
fine, but I'll be damned if i let its incorrect design complicate
other parts of the codebase too. I'd rather it continue to stay
difficult to integrate and continue walking its current path to an
open source death instead.

To unsubscribe, e-mail:
For additional commands, e-mail:
View raw message