lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <jej2...@gmail.com>
Subject Filtered docs and positions enum
Date Fri, 14 Aug 2015 10:19:08 GMT
First sorry for the post to here and the solr list, not sure where this is
most appropriately asked but since there is no response there I figured I'd
try here...

I have what I believe to be a fairly unique use case (as i have not seen it
mentioned before) that I'm looking for some thoughts on.  I currently have
a need to filter terms based on a users authorizations, the implementation
is currently based on
https://github.com/jej2003/lucure-core/blob/master/src/main/java/com/lucure/core/codec/AccessFilteredDocsAndPositionsEnum.java

The current implementation that we're using wraps a DocsAndPositionsEnum,
but there is a bit of an unknown that I am not sure is or is not an issue
around freq() and positions for a particular term.  Specifically right now
freq() is unmodified as is provided by the wrapped DocsAndPositionsEnum,
but when a caller calls nextPosition and encounters a term with
authorizations they don't have access to we simply call nextPosition on the
wrapped DocsAndPositionsEnum.  In this scenario we've said for instance
that freq() was 2, but the caller only had access to 1.  Currently there is
no equivalent to the no more docs constant for positions so we are
currently returning -1 (though we're considering changing to MAX_INTEGER).
We've already seen possible issues with this in the phrase scorer (thus the
reason we were considering returning MAX_INTEGER), but the only way I can
truly see to remedy this in the current implementation is to get freq()
right from the start, I unfortunately can't see how to do that without
processing all of the items up front to get freq correct given the users
authorizations.

Ok, that was long so now for the question.  Is returning a huge number (say
MAX_INTEGER) from nextPosition() ok for situations like this?  Is there
specific places we should be looking to verify?

I know ideally we instead would look to get the frequencies correct given
the authorizations, but if there aren't any negative consequences to the
current approach I would prefer to avoid the upfront processing.

As always any feedback would be appreciated

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message