lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski (JIRA)" <>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Thu, 02 Aug 2007 21:09:52 GMT


Stanislaw Osinski commented on LUCENE-966:

Okkkk -- only now I realized I made a really silly mistake :) When using Mark's examples I
somehow took the ",type" substring as part of the token image, which made the JavaCC tokenizer
look "buggy"...  Apologies for the confusion, tomorrow in the morning I'll correct my tests
and will see what's happening.

One more important clarification -- the tokenizer from the last patch (jflex-analyzer-r561693-compatibility.txt)
has a completely different definition of the <NUM> token -- it allows digits in any
segment, hence the totally different results. If we want to be compatible with the StandardAnalyzer,
we should forget about that patch.

Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff
test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC
spec -- for further patches I did some "optimizations", which seem to have broken the compatibility...

Incidentally, what was the motivation for requiring the <NUM> token to have numbers
only in every second segment and not in any segment?

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>                 Key: LUCENE-966
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>         Attachments:, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
jflex-analyzer-r561292-patch.txt, jflex-analyzer-r561693-compatibility.txt
> JFlex ( can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message