lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Thu, 19 Jul 2007 00:57:48 GMT
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really 
limited by JavaCC speed. You cannot shave much more performance out of 
the grammar as it is already about as simple as it gets. You should 
first see if you can get away without it and use a different Analyzer, 
or if you can re-implement just the functionality you need in a custom 
Analyzer. Do you really need the support for abbreviations, companies, 
email address, etc?

If so:

You can use the TokenSources class in the highlighter package to rebuild 
a TokenStream without re-analyzing if you store term offsets and 
positions in the index. I have not found this to be super beneficial, 
even when using the StandardAnalyzer to re-analyze, but it certainly 
could be faster if you have large enough documents.

Your best bet is probably to use, which is a 
non-positional Highlighter that finds offsets to highlight by looking up 
query term offset information in the index. For larger documents this 
can be much faster than using the standard contrib Highlighter, even if 
your using TokenSources. LUCENE-644 has a much flatter curve than the 
contrib Highlighter as document size goes up.

- Mark

Michael Stoppelman wrote:
> Hi all,
> I was tracking down slowness in the contrib highlighter code and it seems
> the seemingly simple is the culprit.
> I've seen multiple posts about this being a possible cause. Has anyone
> looked into how to speed up StandardTokenizer? For my
> documents it's taking about 70ms per document that's a big ugh! I was
> thinking I might just cache the TermVectors in memory if
> that will be faster. Anyone have another approach to solving this 
> problem?
> -M

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message