lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"
Date Mon, 24 Aug 2009 10:44:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746809#action_12746809
] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

bq. Actually sorting (during collection) already gives you the docBase so shouldn't your app
already have the context needed for this?

Yes, i get the docbase and all during collection, so doing sorting with a top level cache
will be no problem.
I was mainly using sorting as an example of some of the pain caused by per-segment searching/caches
(the Collector API makes it easy enough to do sorting
on the top level or per segment, so i'm not concerned about integration here)

For my app, i plan to allow sorting to be either "per-segment" or "top-level" in order to
allow people to choose thier poison: faster commit/less memory vs faster sorting
I also plan to do faceting likewise
certain features will always require a top-level cache (but those are advanced features anyway
and should be expected to have impacts on commit time/first search time)

bq. Hmm... is advance in fact costly for your DocIdSets?

Think how costly it would be to do advance for the SortedVInt DocIdSet (linear search over
compressed values)
for a bitset, this is instantaneous, but to conserve memory, its better to use a sorted int[]
(or the SortedVInt stuff 2.9 provides)

in the end, i plan to bucketize the collected docs per segment, so in the end this should
hopefully be less of an issue
nice thing about that approach is that i can have a bitset for one segment (lost of matches
in this segment) and a very small int[] for a different segment based on the matches per segment.
Biggest difficulty is doing the mapping to the per-segment "DocIdSet" (which will probably
have to be slower)

bq. this one method would allow you to not have to subclass IndexSearcher.

I already have to subclass index searcher (i do a lot of extra stuff)
however, the IndexSearcher doesn't provide any protected access to its sub readers and doc
starts, so i have to do this myself in my subclass's constructor (in the same way IndexSearcher
is doing this

I would really like to see getIndexReaderBase() added to 2.9's IndexSearcher
I would also like to see the subreaders and docstarts either made protected or given protected
accessor methods (so i don't have to recreate the same set of sub readers (and make sure i
do this the same way for future versions of lucene)
Would also be nice to see a protected constructor on IndexSearcher like so:
{code}
  protected IndexSearcher(IndexReader reader, IndexReader[] subReaders, int[] docStarts) {
   ...
  }
{code}

This would allow creating "temporary" IndexSearchers much faster (don't need to gather sub
readers)
This would allow:
* easily creating IndexSearcher that is "top-level" (subReaders[] would be length 1 and just
contain reader)
* create a "temporary" IndexSearcher off another IndexSearcher that contains some "short lived"
context (i have this use case)




> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know
the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there
is now no way to index into them properly from inside a Scorer because the scorer is not passed
the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method
to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher
(casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders
in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message