lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex
Date Sun, 27 Feb 2011 04:45:04 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999880#comment-12999880
] 

Robert Muir commented on LUCENE-2939:
-------------------------------------

i don't know why you get this null pointer exception (maybe you triggered a bug), but...

just a quick glance:
# why use offsets for this calculation? This seems a bit dangerous versus other approaches.
# either way, the reset() method should clear any state such as counters in the tokenstream.

As far as what i meant above... the whole maxDocCharsToAnalyze seems like the wrong measure.
Why not specify this just as max tokens, and use LimitTokenCountAnalyzer, which is already
implemented.

using arbitrary chars and offsets is going to create fake tokens (e.g. truncate words) and
other problems.
besides, its not unicode safe since a codepoint might span multiple chars.


> Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when
adding a new field to MemoryIndex
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2939
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2939
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Minor
>         Attachments: LUCENE-2939.patch
>
>
> huge documents can be drastically slower than need be because the entire field is added
to the memory index
> this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze
> the cost is still not fantastic, but is at least improved in many situations and can
be influenced with this change

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message