lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Stoppelman" <stop...@gmail.com>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Thu, 19 Jul 2007 01:14:09 GMT
Might be nice to add a line of documentation to the highlighter on the
possible
performance hit if one uses StandardAnalyzer which probably is a common
case.
Thanks for the speedy response.

-M

On 7/18/07, Mark Miller <markrmiller@gmail.com> wrote:
>
> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
> limited by JavaCC speed. You cannot shave much more performance out of
> the grammar as it is already about as simple as it gets. You should
> first see if you can get away without it and use a different Analyzer,
> or if you can re-implement just the functionality you need in a custom
> Analyzer. Do you really need the support for abbreviations, companies,
> email address, etc?
>
> If so:
>
> You can use the TokenSources class in the highlighter package to rebuild
> a TokenStream without re-analyzing if you store term offsets and
> positions in the index. I have not found this to be super beneficial,
> even when using the StandardAnalyzer to re-analyze, but it certainly
> could be faster if you have large enough documents.
>
> Your best bet is probably to use
> https://issues.apache.org/jira/browse/LUCENE-644, which is a
> non-positional Highlighter that finds offsets to highlight by looking up
> query term offset information in the index. For larger documents this
> can be much faster than using the standard contrib Highlighter, even if
> your using TokenSources. LUCENE-644 has a much flatter curve than the
> contrib Highlighter as document size goes up.
>
> - Mark
>
> Michael Stoppelman wrote:
> > Hi all,
> >
> > I was tracking down slowness in the contrib highlighter code and it
> seems
> > the seemingly simple tokenStream.next() is the culprit.
> > I've seen multiple posts about this being a possible cause. Has anyone
> > looked into how to speed up StandardTokenizer? For my
> > documents it's taking about 70ms per document that's a big ugh! I was
> > thinking I might just cache the TermVectors in memory if
> > that will be faster. Anyone have another approach to solving this
> > problem?
> >
> > -M
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message