lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term
Date Thu, 23 Jun 2011 13:35:54 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053844#comment-13053844
] 

Michael McCandless commented on LUCENE-3225:
--------------------------------------------

bq. Mike this seems like a good improvement but I think letting a user change the behavior
of method X by passing true / false to method Y is no good. I think this is kind of error
prone plus its cluttering the seek method though. Once Boolean is enough here. I think we
should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does
only support exact matches and throws a clear exception if next is called. I know that makes
things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances
but I think that is better here.

Well, it only means the enum is unpositioned if you get back
NOT_FOUND?  Ie, it's just like if you get back null from next(), or
END from seek(): in these cases, the enum is unpositioned and you need
to call seek again.

My worry if we force an up-front decision here ("exact only" enum vs
"non-exact only enum") is we prevent legitimate use cases where the
caller wants to mix & match with one enum.

EG, when AutomatonQuery intersects w/ the terms, when it hits are
region where terms are denser than what the automaton will accept
(such as an "infinite" part), it should use exact seeking, but then
when it's in a region where terms are less dense (eg a "finite" part)
it should use exact seeking.... I'll open a separate issue for this.

The TermsEnum impls can be efficient in this case, ie re-using
internal seek state for the exat and non-exact cases (MemoryCodec does
this).

But I agree another boolean to seek isn't great; maybe instead we can
make a seperate seekExact method?  Default impl would just call seek
(and get no perf gains).

BTW, similarly, I think we have a missing API in DISI (for
scoring): advance always does a next() if the target doc doesn't
match.  But we can get substantial performance gains in some cases
(see LUCENE-1536) if we had an advanceExact that would not do the
next and simply tell us if this doc matched or not.

bq. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy?

Not sure what you meant here?


> Optimize TermsEnum.seek when caller doesn't need next term
> ----------------------------------------------------------
>
>                 Key: LUCENE-3225
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3225
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-3225.patch
>
>
> Some codecs are able to save CPU if the caller is only interested in
> exact matches.  EG, Memory codec and SimpleText can do more efficient
> FSTEnum lookup if they know the caller doesn't need to know the term
> following the seek term.
> We have cases like this in Lucene, eg when IW deletes documents by
> Term, if the term is not found in a given segment then it doesn't need
> to know the ceiling term.  Likewise when TermQuery looks up the term
> in each segment.
> I had done this change as part of LUCENE-3030, which is a new terms
> index that's able to save seeking for exact-only lookups, but now that
> we have Memory codec that can also save CPU I think we should commit
> this today.
> The change adds a "boolean onlyExact" param to seek(BytesRef).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message