lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Filtering question
Date Thu, 12 Mar 2015 18:45:27 GMT
Hi Chris,

> Hi Uwe, thanks for your suggestions.  I have tried a couple of things with no
> luck yet:
> > Sorry,
> > I just noticed, you are using TermFilter not TermsFilter: This one
> > does not support random access (using bits()). Because of this the
> > filtered docs cannot be passed down using acceptDocs.
> >
> TermsFilter made no difference, still no acceptDocs passed to the filter.

I know, the problem is that a TermsFilter with one term behaves like a TermFilter :-) In any
case you could use CachingWrapperFilter to forcefully create a bitset, but I don't think it's
worth the hassle. It is still strange, I did not dig into it.

> > The should
> > clause in addition causes that the ConstantScoreQuery has to try all
> > documents because there is nothing else that could drive the query.
> >
> As an experiment I tried MUST, this didn't help either.

I checked your impl: You just create a Bitset, so it won't help. Please look at other DocValues
filters like DocValuesRangeFilter how they implement the iterator. Creating a BitSet is just
overhead and while doing so, you have no chance to take other query constraints into account
(because the bitset is built *before* the query is executed).

Instead you should implement a custom DocIdSet (Lucene 4.10 offers FieldCacheDocIdSet as base
class; in 5.0 it was renamed to DocValuesDocIdSet, you can implement the abstract matchDoc()
method there). This one automatically handles everything correctly, like acceptDocs or uses
advance(). It does not build a bitset, it does everything by calling the abstract matchDoc()
method on the fly. You just have to put the matching logic into matchDoc(int docId).

> > An alternative approach would be (in Lucene 4.10 or 5.0) to add the
> > TermFilter as ConstantScoreFilter(TermQuery) with boost=0 to the
> > BooleanQuery. In that case it can drive the query and does not affect
> > scoring. In later Lucene versions you may use the new
> > BooleanQuery.Occur type "FILTER" which can add any query as filter.
> > Filters will be deprecated once this is ready.
> >
> This is interesting and I will try it when I get a chance.

I mean ConstantScoreQuery, not ConstantScoreFilter. But you need to implement your own DocIdSetIterator
with DocIdSetIterator.advance(), otherwise it won't help (see above).

> >> My goal is to slowly transform a particular field from StringField to
> >> BinaryDocValues so that during the transition a doc may hold the
> >> value either in the old location or the new. Therefore a query must
> >> be able to say
> >>     oldField:"foo" OR newField:"foo"
> >> Where oldField is a StringField and newField is a BinaryDocValues.
> >
> > Why do you want to do this.
> >
> Good question!  In our architecture we build indexes by pulling data from
> several sources and it is _expensive_.  Increasingly we are requested to
> change one or two fields which currently requires a full re-index of the doc.
> When I attended the Dublin Lucene conference I spoke to Shai Erera about
> this problem and he pointed me at DocValues which allow you to update
> fields without incurring the full doc reindex cost.  That is the appeal for us.
> As I said before, we want to transform docs only as they are updated, where
> transformation involves dropping the old TextField and creating a new
> BinaryDocValuesField containing the same value.  Hence the need for the
> query to be able to search 'old OR new'.
> > If you want to query like this on the field, it is a bad idea to use
> > DocValues.
> >
> Why is it a bad idea?

Indeed, DocValues are update-able. But they have the backside, that they don't provide a way
to query the index for a term and it tells you which documents have the term (our inverted
index - the reason why we use Lucene!). DocValues are just a large array with random access.
If you want to query on it, you have to brute force, unless there is something else in the
query structure that can "drive" your query (advance() on the filter's iterator). On a BooleanQuery
containing of 2 should clauses, nothing can drive the query, so there is only the possibility
to do a full scan of the docvalues doc-by-doc.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message