lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance
Date Wed, 22 Dec 2010 14:47:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974223#action_12974223
] 

Robert Muir commented on LUCENE-2829:
-------------------------------------

bq. edit: and as robert previously pointed out, if we cached misses as well, then we could
avoid needless seeks on segments that don't contain the term.

True, this is a good idea, just a little tricker:
* In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting to true.
* FilteredTermsEnum passes false here, so the multitermqueries don't populate the cache with

  garbage while enumerating (eg foo*),  only explicitly at the end with cacheTerm() (per-segment)

  for the ones that were actually accepted. They sum up their docFreq themselves to prevent
the 
  first wasted seek in TermQuery. 
* So this solution would make MTQ worse, as it would cause them to trash the caches in the

  second wasted seek (the docsenum) where they do not today, with negative entries for the

  segments where the term doesn't exist. Today they do this wasted seek, but they don't 
  trash the cache here. The only solution to prevent that is the PerReaderTermState 
  (or something equally complicated).
* We would have to look at other places where negative entries would hurt, for example 
  rebuilding spellcheck indexes uses this 'termExists()' method implemented with docFreq.

  So we would have to likely change spellcheck's code to use a TermsEnum and 
  seek(term, false)... using a termsenum in parallel with the spellcheck dictionary would

  obviously be more efficient for the index-based spellcheck case (forget about caching)
  versus docFreq()'ing every term... *but* we cannot assume the spellcheck "Dictionary" 
  is actually in term order, (imagine the File-based dictionary case), so we can't 
  implement this today.

On 3.x i think its slightly less complicated as there is already a hack in the cache to 
prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all the MTQs 
just enumerate sequentially anyway... (except NRQ which doesn't enum many terms 
anyway, likely not a problem).

But we would have to at least fix the spellcheck case there too I think.

Not saying I don't like your idea... just saying there's more work to do it.


> improve termquery "pk lookup" performance
> -----------------------------------------
>
>                 Key: LUCENE-2829
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2829
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>         Attachments: LUCENE-2829.patch
>
>
> For things that are like primary keys and don't exist in some segments (worst case is
primary/unique key that only exists in 1)
> we do wasted seeks.
> While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we
could every backport that to 3.1 for example.
> This is a simpler solution here just to solve this one problem in termquery... we could
just revert it in trunk when we resolve LUCENE-2694,
> but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message