lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Woodward <>
Subject Re: Anticipating a benchmark for direct posting format
Date Mon, 07 Apr 2014 21:32:07 GMT
Does FilterDirectoryReader do what you want?

Alan Woodward

On 7 Apr 2014, at 22:19, Benson Margulies wrote:

> Typically, an app gets a directory reader, which is a composite
> reader. To get a filter down there into the leaves of the composite
> reader, does anyone have a suggestion about where to enter the
> modularity?
> I sort of want to insert myself at
> org.apache.lucene.index.StandardDirectoryReader#open(,
> org.apache.lucene.index.IndexCommit) wrapping the segment readers, or
> I could make a sort of filtering composite reader that wraps each of
> the segment readers in a filter.
> On Mon, Apr 7, 2014 at 1:02 PM, Shai Erera <> wrote:
>> Given that DPF delegates indexing to another PF anyway (currently Lucene41),
>> I think this might be the case. We would need to test of course. The key
>> point is that this FilterAtomicReader will be able to serve anything as
>> direct, even DV, so it might eliminate DVF too. We need to experiment and
>> benchmark!
>> Shai
>> On Apr 7, 2014 7:32 PM, ""
>> <> wrote:
>>> Aaaah, nice idea to simply use FilterAtomicReader — of course!  So this
>>> would ultimately be a new IndexReaderFactory that creates
>>> FilterAtomicReaders for a subset of the fields you want to do this on.
>>> Cool!  With that, I don’t think there would be a need for
>>> DirectPostingsFormat as a postings format, would there be?
>>> ~ David
>>> On Mon, Apr 7, 2014 at 10:58 AM, Shai Erera <> wrote:
>>>> The only problem is how the Codec makes a dynamic decision on whether to
>>>> use the wrapped Codec for reading vs pre-load data into in-memory
>>>> structures, because Codecs are loaded through reflection by the SPI loading
>>>> mechanism.
>>>> There is also a TODO in DirectPF to allow wrapping arbitrary PFs, just
>>>> mentioning in case you want to tackle DPF.
>>>> I think that if we allowed passing something like a CodecLookupService,
>>>> with an SPILookupService default impl, you could easily pass that to
>>>> DirectoryReader which will use your runtime logic to load the right PF (e.g.
>>>> DPF) instead of the one the index was created with.
>>>> But it sounds like the core problem is that when we load a Codec/PF/DVF
>>>> for reading, we cannot pass it any arguments, and so we must make an
>>>> index-time decision about how we're going to read the data later on. If we
>>>> could somehow support that, I think that will help you to achieve what you
>>>> want too.
>>>> E.g. currently it's an all-or-nothing decision, but if we could pass a
>>>> parameter like "50% available heap", the Codec/PF/DVF could cache the
>>>> frequently accessed postings instead of loading all of them into memory.
>>>> But, that can also be achieved at the IndexReader level, through a custom
>>>> FilterAtomicReader. And if you could reuse DPF's structures (like
>>>> DirectTermsEnum, DirectFields...), it should be easier to do this. So
>>>> perhaps we can think about a DirectAtomicReader which does that? I believe
>>>> it can share some code w/ DPF, as long as we don't make these APIs public,
>>>> or make them @super.experimental and
>>>> Just throwing some ideas...
>>>> Shai
>>>> On Mon, Apr 7, 2014 at 5:35 PM,
>>>> <> wrote:
>>>>> Benson, I like your idea.
>>>>> I think your idea can be achieved as a codec, one that wraps another
>>>>> codec that establishes the on-disk format.  By default the wrapped codec
>>>>> be Lucene’s default codec.  I think, if implemented, this would be
a change
>>>>> to DPF instead of an additional DPF-variant codec.
>>>>> ~ David
>>>>> On Mon, Apr 7, 2014 at 9:22 AM, Benson Margulies <>
>>>>> wrote:
>>>>>> On Mon, Apr 7, 2014 at 9:14 AM, Robert Muir <>
>>>>>>> On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies
>>>>>>> <> wrote:
>>>>>>>> My takeaway from the prior conversation was that various
>>>>>>>> didn't
>>>>>>>> entirely believe that I'd seen a dramatic improvement in
query perfo
>>>>>>>> using D-P-F, and so would not smile upon a patch intended
>>>>>>>> liberate
>>>>>>>> D-P-F from codecs. It could be that the effect I saw has
to do with
>>>>>>>> the fact that our system depends on hitting and scoring 50%
of the
>>>>>>>> documents in an index with a lot of documents.
>>>>>>> I dont understand the word "liberate" here. why is it such a
>>>>>>> that this is a codec?
>>>>>> I don't want to have to declare my intentions at the time I create
>>>>>> the index. I don't want to have to use D-P-F for all readers all
>>>>>> time. Because I want to be able to decide to open up an index with
>>>>>> arbitrary on-disk format and get the in-memory cache behavior of
>>>>>> D-P-F. Thus 'liberate' -- split the question of 'keep a copy in
>>>>>> memory' from the choice of the on-disk format.
>>>>>>> i do not think we should give it any more status than that, it
>>>>>>> too much ram.
>>>>>> It didn't seem like 'waste' when it solved a big practical for us.
>>>>>> had an application that was too slow, and had plenty of RAM available,
>>>>>> and we were able to trade space for time by applying D-P-F.
>>>>>> Maybe I'm going about this backwards; if I can come up with a small,
>>>>>> inconspicuous proposed change that does what I want, there won't
>>>>>> any disagreement.
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>>>>>>> For additional commands, e-mail:
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message