lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Drapkin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2447) Add support for subsets of searchables inside a MultiSearcher/ParallelMultiSearcher instance's methods at runtime
Date Thu, 06 May 2010 01:54:48 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864619#action_12864619
] 

Edward Drapkin commented on LUCENE-2447:
----------------------------------------

I think that's where I have a misunderstanding and disagreement with the function of the multisearcher.
 Because it's mostly stateless, I can't help but thing that it's designed to be instantiated
on demand, every time that it needs to be used.  While with the "traditional" MultiSearcher,
this doesn't seem to be much of a problem (3us is no time at all), it gets progressively heavier
and more cumbersome to keep creating Executors every time with ParallelMultiSearcher.  OTOH,
given the ability to specify an Executor at object creation "solves" this issue, making ParallelMultiSearcher
almost as light as MultiSearcher itself.

Not to digress into architectural theory, but I have a disagreement with the way that these
classes are designed to be used.  In order to make MultiSearcher even functional, the application
needs to maintain its list of Searchables outside MultiSearcher itself; given the ability
to specify an Executor at construction time (for ParallelMultiSearcher), you're now maintaining
an array of Searchables (because we it's expensive and wasteful to create _those_ every time
they're needed) and a thread pool management object in the calling object.  This reeks of
state leakage to me, where the state of [Parallel]MultiSearcher is being maintained by an
external, calling object and is being re-created every time it's needed, violating encapsulation
conventions and practice.  Furthermore, (with the caveat that I'm relatively new to lucene)
MultiSearcher is itself a Searchable and this behavior is inconsistent with the way that I've
seen other Searchables handled.  I'm not sure how much sense it makes to trade maintaining
a reference to one Searchable (that encapsulates several) to maintaining a list/array/collection
of references to other Searchables... especially when you look into multi-threaded apps and
non-final collections of Searchables that may start to modify the state (however transient)
of your MultiSearcher outside of the class itself.

I'm not sure how clear it is from my previous comments and the code itself, but the idea behind
the patch was that the user (in this case, me) wouldn't be maintaining anything for the state
of the [Parallel]MultiSearcher except for the instance of the class itself.  Right now, it's
possible to do this (not keep any permanent references to anything that's fed into the MultiSearcher
constructor) but only if you intend to always search all of your Searchables.  The patch takes
a few steps in the direction of making keeping references to the Searchables outside of the
MultiSearcher unnecessary (although, come to think about it, if this is the direction this
class heads in, getSearchables() needs to return a [deep] clone of multiSearcher.searchables
rather than the array itself), but without any method defined in Searchable that allows you
to identify a Searchable without a reference, I'm not sure that there is much more that can
be done.

> Add support for subsets of searchables inside a MultiSearcher/ParallelMultiSearcher instance's
methods at runtime
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2447
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>         Environment: Irrelevant
>            Reporter: Edward Drapkin
>            Priority: Minor
>         Attachments: LUCENE-2447-predicate.patch, LUCENE-2447.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Here's the situation: We have a site with a fair few amount of indexes that we're using
MultiSearcher/ParallelMultiSearcher for, but the users can select an arbitrary permutation
of indexes to search.  For example (contrived, but illustratory): the site has indexes numbered
1 - 10; user A wants to search in all 10; user B wants to search indexes 1, 2 and 3, user
C wants to search even-numbered indexes.  From Lucene 3.0.1, the only way to do this is to
continually instantiate a new MultiSearcher based on every permutation of indexes that a user
wants, which is not ideal at all.
> What I've done is add a new parameter to all methods in MultiSearcher that use the searchables
array (docFreq, search, rewrite and createDocFrequencyMap), a Set<Searchable> which
is checked for isEmpty() and contains() for every iteration over the searchables[].  The actual
logic has been moved into these methods and the old methods have become overloads that pass
a Collections.emptySet() into those methods, so I do not expect there to be a very noticeable
performance impact as a result of this modification, if it's measurable at all.
> I didn't modify the test for MultiSearcher very much, just enough to illustrate the that
subsetting of the search results works, since no other logic has changed.  If I need to do
more for the testing, let me know and I'll do it.
> I've attached the patches for MultiSearcher.java, ParallelMultiSearcher.java and TestMultiSearcher.java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message