lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Stoppelman" <stop...@gmail.com>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Thu, 19 Jul 2007 18:28:23 GMT
On 7/19/07, Mark Miller <markrmiller@gmail.com> wrote:
>
> I think it goes without saying that a semi-complex NFA or DFA is going
> to be quite a bit slower than say, breaking on whitespace. Not that I am
> against such a warning.


This is true to those very familiar with the code base and the Tokenizer
source code. I think having a comment
about using complex a semi-complex NFA/DFA possibly being a major
performance hit in the highlighting code
would save others time, imho.

To support my point on writing a custom solution that is more exact
> towards your needs:
>
> If you just remove the <NUM> recognizer in StandardTokenizer.jj you will
> gain 20-25% speed in my tests of small and large documents.
>
> Limiting what is considered a letter to just the language/encodings you
> need might also get some good returns.


Both good ideas. I just released that the tokenizer for highlighting doesn't
need
to be the same as the tokenizer for indexing so I can make the highlighting
tokenizer
much simpler. Everything will be fast and happy soon.

-M

- Mark
>
> Michael Stoppelman wrote:
> > Might be nice to add a line of documentation to the highlighter on the
> > possible
> > performance hit if one uses StandardAnalyzer which probably is a common
> > case.
> > Thanks for the speedy response.
> >
> > -M
> >
> > On 7/18/07, Mark Miller <markrmiller@gmail.com> wrote:
> >>
> >> Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
> >> limited by JavaCC speed. You cannot shave much more performance out of
> >> the grammar as it is already about as simple as it gets. You should
> >> first see if you can get away without it and use a different Analyzer,
> >> or if you can re-implement just the functionality you need in a custom
> >> Analyzer. Do you really need the support for abbreviations, companies,
> >> email address, etc?
> >>
> >> If so:
> >>
> >> You can use the TokenSources class in the highlighter package to
> rebuild
> >> a TokenStream without re-analyzing if you store term offsets and
> >> positions in the index. I have not found this to be super beneficial,
> >> even when using the StandardAnalyzer to re-analyze, but it certainly
> >> could be faster if you have large enough documents.
> >>
> >> Your best bet is probably to use
> >> https://issues.apache.org/jira/browse/LUCENE-644, which is a
> >> non-positional Highlighter that finds offsets to highlight by looking
> up
> >> query term offset information in the index. For larger documents this
> >> can be much faster than using the standard contrib Highlighter, even if
> >> your using TokenSources. LUCENE-644 has a much flatter curve than the
> >> contrib Highlighter as document size goes up.
> >>
> >> - Mark
> >>
> >> Michael Stoppelman wrote:
> >> > Hi all,
> >> >
> >> > I was tracking down slowness in the contrib highlighter code and it
> >> seems
> >> > the seemingly simple tokenStream.next() is the culprit.
> >> > I've seen multiple posts about this being a possible cause. Has
> anyone
> >> > looked into how to speed up StandardTokenizer? For my
> >> > documents it's taking about 70ms per document that's a big ugh! I was
> >> > thinking I might just cache the TermVectors in memory if
> >> > that will be faster. Anyone have another approach to solving this
> >> > problem?
> >> >
> >> > -M
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message