lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Thu, 19 Jul 2007 15:16:49 GMT
I think it goes without saying that a semi-complex NFA or DFA is going 
to be quite a bit slower than say, breaking on whitespace. Not that I am 
against such a warning.

To support my point on writing a custom solution that is more exact 
towards your needs:

If you just remove the <NUM> recognizer in StandardTokenizer.jj you will 
gain 20-25% speed in my tests of small and large documents.

Limiting what is considered a letter to just the language/encodings you 
need might also get some good returns.

- Mark

Michael Stoppelman wrote:
> Might be nice to add a line of documentation to the highlighter on the
> possible
> performance hit if one uses StandardAnalyzer which probably is a common
> case.
> Thanks for the speedy response.
> -M
> On 7/18/07, Mark Miller <> wrote:
>> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
>> limited by JavaCC speed. You cannot shave much more performance out of
>> the grammar as it is already about as simple as it gets. You should
>> first see if you can get away without it and use a different Analyzer,
>> or if you can re-implement just the functionality you need in a custom
>> Analyzer. Do you really need the support for abbreviations, companies,
>> email address, etc?
>> If so:
>> You can use the TokenSources class in the highlighter package to rebuild
>> a TokenStream without re-analyzing if you store term offsets and
>> positions in the index. I have not found this to be super beneficial,
>> even when using the StandardAnalyzer to re-analyze, but it certainly
>> could be faster if you have large enough documents.
>> Your best bet is probably to use
>>, which is a
>> non-positional Highlighter that finds offsets to highlight by looking up
>> query term offset information in the index. For larger documents this
>> can be much faster than using the standard contrib Highlighter, even if
>> your using TokenSources. LUCENE-644 has a much flatter curve than the
>> contrib Highlighter as document size goes up.
>> - Mark
>> Michael Stoppelman wrote:
>> > Hi all,
>> >
>> > I was tracking down slowness in the contrib highlighter code and it
>> seems
>> > the seemingly simple is the culprit.
>> > I've seen multiple posts about this being a possible cause. Has anyone
>> > looked into how to speed up StandardTokenizer? For my
>> > documents it's taking about 70ms per document that's a big ugh! I was
>> > thinking I might just cache the TermVectors in memory if
>> > that will be faster. Anyone have another approach to solving this
>> > problem?
>> >
>> > -M
>> >
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message