lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3763) Make solr use lucene filters directly
Date Tue, 28 Aug 2012 15:10:07 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443202#comment-13443202
] 

Yonik Seeley commented on SOLR-3763:
------------------------------------

Interesting work Greg!  A few points:

bq. Another issue here is that filters currently cache sub-optimally given the changes in
lucene towards atomic readers.

This really depends on the problem - sometimes top-level cache is more optimal, and sometimes
per-segment caches are more optimal.  IMO, we shouldn't force either, but add the ability
to cache per-segment.

There are already issues open for caching disjunction clauses separately too - it's a rather
orthogonal issue.

It might be a better idea to start off small: we could make a QParser that creates a CachingWrapperFilter
wrapped in a FilteredQuery and hence will cache per-segment.  That should be simple and non-invasive
enough to make it into 4.0
                
> Make solr use lucene filters directly
> -------------------------------------
>
>                 Key: SOLR-3763
>                 URL: https://issues.apache.org/jira/browse/SOLR-3763
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.0, 4.1, 5.0
>            Reporter: Greg Bowyer
>            Assignee: Greg Bowyer
>         Attachments: SOLR-3763-Make-solr-use-lucene-filters-directly.patch
>
>
> Presently solr uses bitsets, queries and collectors to implement the concept of filters.
This has proven to be very powerful, but does come at the cost of introducing a large body
of code into solr making it harder to optimise and maintain.
> Another issue here is that filters currently cache sub-optimally given the changes in
lucene towards atomic readers.
> Rather than patch these issues, this is an attempt to rework the filters in solr to leverage
the Filter subsystem from lucene as much as possible.
> In good time the aim is to get this to do the following:
> ∘ Handle setting up filter implementations that are able to correctly cache with reference
to the AtomicReader that they are caching for rather that for the entire index at large
> ∘ Get the post filters working, I am thinking that this can be done via lucenes chained
filter, with the ‟expensive” filters being put towards the end of the chain - this has
different semantics internally to the original implementation but IMHO should have the same
result for end users
> ∘ Learn how to create filters that are potentially more efficient, at present solr
basically runs a simple query that gathers a DocSet that relates to the documents that we
want filtered; it would be interesting to make use of filter implementations that are in theory
faster than query filters (for instance there are filters that are able to query the FieldCache)
> ∘ Learn how to decompose filters so that a complex filter query can be cached (potentially)
as its constituent parts; for example the filter below currently needs love, care and feeding
to ensure that the filter cache is not unduly stressed
> {code}
>   'category:(100) OR category:(200) OR category:(300)'
> {code}
> Really there is no reason not to express this in a cached form as 
> {code}
> BooleanFilter(
>     FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
>     FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
>     FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
>   )
> {code}
> This would yeild better cache usage I think as we can resuse docsets across multiple
queries as well as avoid issues when filters are presented in differing orders
> ∘ Instead of end users providing costing we might (and this is a big might FWIW), be
able to create a sort of execution plan of filters, leveraging a combination of what the index
is able to tell us as well as sampling and ‟educated guesswork”; in essence this is what
some DBMS software, for example postgresql does - it has a genetic algo that attempts to solve
the travelling salesman - to great effect
> ∘ I am sure I will probably come up with other ambitious ideas to plug in here .....
:S 
> Patches obviously forthcoming but the bulk of the work can be followed here https://github.com/GregBowyer/lucene-solr/commits/solr-uses-lucene-filters

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message