lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <>
Subject [jira] [Commented] (SOLR-3763) Make solr use lucene filters directly
Date Tue, 28 Aug 2012 15:10:07 GMT


Yonik Seeley commented on SOLR-3763:

Interesting work Greg!  A few points:

bq. Another issue here is that filters currently cache sub-optimally given the changes in
lucene towards atomic readers.

This really depends on the problem - sometimes top-level cache is more optimal, and sometimes
per-segment caches are more optimal.  IMO, we shouldn't force either, but add the ability
to cache per-segment.

There are already issues open for caching disjunction clauses separately too - it's a rather
orthogonal issue.

It might be a better idea to start off small: we could make a QParser that creates a CachingWrapperFilter
wrapped in a FilteredQuery and hence will cache per-segment.  That should be simple and non-invasive
enough to make it into 4.0
> Make solr use lucene filters directly
> -------------------------------------
>                 Key: SOLR-3763
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.0, 4.1, 5.0
>            Reporter: Greg Bowyer
>            Assignee: Greg Bowyer
>         Attachments: SOLR-3763-Make-solr-use-lucene-filters-directly.patch
> Presently solr uses bitsets, queries and collectors to implement the concept of filters.
This has proven to be very powerful, but does come at the cost of introducing a large body
of code into solr making it harder to optimise and maintain.
> Another issue here is that filters currently cache sub-optimally given the changes in
lucene towards atomic readers.
> Rather than patch these issues, this is an attempt to rework the filters in solr to leverage
the Filter subsystem from lucene as much as possible.
> In good time the aim is to get this to do the following:
> ∘ Handle setting up filter implementations that are able to correctly cache with reference
to the AtomicReader that they are caching for rather that for the entire index at large
> ∘ Get the post filters working, I am thinking that this can be done via lucenes chained
filter, with the ‟expensive” filters being put towards the end of the chain - this has
different semantics internally to the original implementation but IMHO should have the same
result for end users
> ∘ Learn how to create filters that are potentially more efficient, at present solr
basically runs a simple query that gathers a DocSet that relates to the documents that we
want filtered; it would be interesting to make use of filter implementations that are in theory
faster than query filters (for instance there are filters that are able to query the FieldCache)
> ∘ Learn how to decompose filters so that a complex filter query can be cached (potentially)
as its constituent parts; for example the filter below currently needs love, care and feeding
to ensure that the filter cache is not unduly stressed
> {code}
>   'category:(100) OR category:(200) OR category:(300)'
> {code}
> Really there is no reason not to express this in a cached form as 
> {code}
> BooleanFilter(
>     FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
>     FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
>     FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
>   )
> {code}
> This would yeild better cache usage I think as we can resuse docsets across multiple
queries as well as avoid issues when filters are presented in differing orders
> ∘ Instead of end users providing costing we might (and this is a big might FWIW), be
able to create a sort of execution plan of filters, leveraging a combination of what the index
is able to tell us as well as sampling and ‟educated guesswork”; in essence this is what
some DBMS software, for example postgresql does - it has a genetic algo that attempts to solve
the travelling salesman - to great effect
> ∘ I am sure I will probably come up with other ambitious ideas to plug in here .....
> Patches obviously forthcoming but the bulk of the work can be followed here

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message