lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chrisbamford <chrisbamf...@chrisbamford.plus.com>
Subject RE: Filtering question
Date Mon, 16 Mar 2015 12:07:28 GMT
Hi Uwe

I have downloaded Lucene 5.0.0 source to look at the Filters you 
mention.  DocValuesTermsFilter looks promising, however I cannot find 
FieldCacheDocIdSet anywhere in Lucene 4.10.2 or in 5.0.0.  Where should 
I be looking?

I take your point about brute-forcing the DocValues search and all I 
can do is implement / test and decide if it is acceptable.  This is the 
main driver behind getting flitering working correctly!

Thanks for your continued help.

- Chris

On 12.03.2015 18:45, Uwe Schindler wrote:
> Hi Chris,
>
>
>> Hi Uwe, thanks for your suggestions.  I have tried a couple of 
>> things with no
>> luck yet:
>>
>> > Sorry,
>> > I just noticed, you are using TermFilter not TermsFilter: This one
>> > does not support random access (using bits()). Because of this the
>> > filtered docs cannot be passed down using acceptDocs.
>> >
>> TermsFilter made no difference, still no acceptDocs passed to the 
>> filter.
>
> I know, the problem is that a TermsFilter with one term behaves like
> a TermFilter :-) In any case you could use CachingWrapperFilter to
> forcefully create a bitset, but I don't think it's worth the hassle.
> It is still strange, I did not dig into it.
>
>> > The should
>> > clause in addition causes that the ConstantScoreQuery has to try 
>> all
>> > documents because there is nothing else that could drive the 
>> query.
>> >
>> As an experiment I tried MUST, this didn't help either.
>
> I checked your impl: You just create a Bitset, so it won't help.
> Please look at other DocValues filters like DocValuesRangeFilter how
> they implement the iterator. Creating a BitSet is just overhead and
> while doing so, you have no chance to take other query constraints
> into account (because the bitset is built *before* the query is
> executed).
>
> Instead you should implement a custom DocIdSet (Lucene 4.10 offers
> FieldCacheDocIdSet as base class; in 5.0 it was renamed to
> DocValuesDocIdSet, you can implement the abstract matchDoc() method
> there). This one automatically handles everything correctly, like
> acceptDocs or uses advance(). It does not build a bitset, it does
> everything by calling the abstract matchDoc() method on the fly. You
> just have to put the matching logic into matchDoc(int docId).
>
>> > An alternative approach would be (in Lucene 4.10 or 5.0) to add 
>> the
>> > TermFilter as ConstantScoreFilter(TermQuery) with boost=0 to the
>> > BooleanQuery. In that case it can drive the query and does not 
>> affect
>> > scoring. In later Lucene versions you may use the new
>> > BooleanQuery.Occur type "FILTER" which can add any query as 
>> filter.
>> > Filters will be deprecated once this is ready.
>> >
>> This is interesting and I will try it when I get a chance.
>
> I mean ConstantScoreQuery, not ConstantScoreFilter. But you need to
> implement your own DocIdSetIterator with DocIdSetIterator.advance(),
> otherwise it won't help (see above).
>
>> >> My goal is to slowly transform a particular field from 
>> StringField to
>> >> BinaryDocValues so that during the transition a doc may hold the
>> >> value either in the old location or the new. Therefore a query 
>> must
>> >> be able to say
>> >>     oldField:"foo" OR newField:"foo"
>> >> Where oldField is a StringField and newField is a 
>> BinaryDocValues.
>> >
>> > Why do you want to do this.
>> >
>> Good question!  In our architecture we build indexes by pulling data 
>> from
>> several sources and it is _expensive_.  Increasingly we are 
>> requested to
>> change one or two fields which currently requires a full re-index of 
>> the doc.
>> When I attended the Dublin Lucene conference I spoke to Shai Erera 
>> about
>> this problem and he pointed me at DocValues which allow you to 
>> update
>> fields without incurring the full doc reindex cost.  That is the 
>> appeal for us.
>> As I said before, we want to transform docs only as they are 
>> updated, where
>> transformation involves dropping the old TextField and creating a 
>> new
>> BinaryDocValuesField containing the same value.  Hence the need for 
>> the
>> query to be able to search 'old OR new'.
>>
>> > If you want to query like this on the field, it is a bad idea to 
>> use
>> > DocValues.
>> >
>> Why is it a bad idea?
>
> Indeed, DocValues are update-able. But they have the backside, that
> they don't provide a way to query the index for a term and it tells
> you which documents have the term (our inverted index - the reason 
> why
> we use Lucene!). DocValues are just a large array with random access.
> If you want to query on it, you have to brute force, unless there is
> something else in the query structure that can "drive" your query
> (advance() on the filter's iterator). On a BooleanQuery containing of
> 2 should clauses, nothing can drive the query, so there is only the
> possibility to do a full scan of the docvalues doc-by-doc.
>
> Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message