lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Wed, 01 Aug 2007 07:53:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893
] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

When digging deeper into the issues of compatibility with the original StandardAnalyzer, I
stumbled upon something strange. Take the following text:

78academyawards/rules/rule02.html,7194,7227,type

which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the
definition of the <NUM> token:

// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+
        )

you'll see that, as explained in the comment, every other segment must have at least one digit.
But actually, according to my understanding, this rule should not match the above text as
a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters,
and it looks like there is no way of splitting this text into alternating segments, every
second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):

78academyawards   /   rules   /   rule02   .   html   ,   7194   ,   7227   ,   type
                H                  P      A     P       H       P     A     P     H      P
     A     P    H?*     (starting from the beginning)
                                                                              H?*    P   
 A      P      H     P     A       (starting from the end)

* (would have to be H, but no digits in substring "type" or "html")

I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly"
here. 

Now I can see two solutions:

* try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most
of them...)
* relax the <NUM> rule a little bit (JFlex notation):

// there must be at least one segment with a digit
NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | {ALPHANUM}))*

With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines
of:

2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
78academyawards/rules/rule02.html,7194,7227,type
978-0-94045043-1,86408,86424,type
62.46,37004,37009,type    (this one was parsed as <HOST> by the original analyzer)

get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version
would do. I will attach a corresponding patch in a second.



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message