lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"
Date Wed, 19 Aug 2009 00:12:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744793#action_12744793
] 

Tim Smith commented on LUCENE-1821:
-----------------------------------

My current plan of attack for this use case will be to:
* pull the cache using the MultiReader at createWeight() time (index into cache will be MultiReader
docid)
* pull the base offset for the IndexReader at scorer() creation time (will need to add the
getIndexReaderBase() method to my searcher to do so)
* when the scorer needs to hit the cache, it'll add the base to the scorer's docid to get
the key for the cache lookup

I should be able to do this easily enough with a customized IndexSearcher (subclass)

there are use cases where documents from one segment need to be aware of documents from other
segments
sorting is such a use case (this is just done at the Collector level, so there are more hooks
to do the needed base offset stuff)
duplicate removal is another such use case (only return the first document for docs sharing
a field value)

both these use cases can be done at the Collector level, however Duplicate Removal could potentially
be done at the Query level in order to perform duplicate removal at any location in the query
matching
also, efficient duplicate removal for a String field would require the int[] ord index in
order to reduce overall memory requirements
Using the int[] ord index allows using a BitSet for the hash set required to mark if a document
for a specified value has been encountered (would need a HashSet<String> otherwise (ugh))

my particular use case must be done at the query level in order to have full boolean query
support, and the ability to layer multiple queries with all combinations of AND/OR/NOT, and
any other query operators, and sadly i have yet to come up with any way to create a cache
on a per segment level (without creating the cache at the MultiReader level)


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know
the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there
is now no way to index into them properly from inside a Scorer because the scorer is not passed
the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method
to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message