lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Tue, 31 Jul 2007 17:37:52 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516752
] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------

I tracked down at least some differences between the JavaCC vs JFlex
versions of StandardAnalyzer.

I think we should resolve these before committing.

I just printed all tokens for the first 20 Wikipedia docs and diff'd
the outputs.

Here are the categories of differences that I saw:

  * Only the type differs on a filename-like token:

      OLD: (2004.jpg,34461,34469,type=<HOST>)
      NEW: (2004.jpg,34461,34469,type=<NUM>)

    In this case the old StandardAnalyzer called "2004.jpg" a HOST and
    the new one calls it a NUM.  Seems like neither one is right!

  * Only the type differs on a number token:

      OLD: (62.46,37004,37009,type=<HOST>)
      NEW: (62.46,37004,37009,type=<NUM>)

    The new tokenizer looks right here.  I guess the decimal point
    confuses the JavaCC (old) one.

  * Different number of tokens produced for number-like-token:

      OLD: (978-0-94045043-1,86408,86424,type=<NUM>)
      NEW: (978-0-94045043,86408,86422,type=<NUM>)
           (1,86423,86424,type=<ALPHANUM>)

    The new one split off the final "-1" as its own token, and called
    it ALPHANUM not NUM.  I think the old behavior is correct.

  * Different number of tokens produced for filename:

      OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>)
      NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>)
           (html,7223,7227,type=<ALPHANUM>)

    I think the old one is better, though it should not be called a
    NUM (maybe we need a new "FILENAME" token type?).

  * Same as above, but split on final '_' instead of '.' ('-' also
    shows this behavior):

      OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>)
      new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>)
           (ozabs,2118,2123,type=<ALPHANUM>)


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message