lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Tue, 31 Jul 2007 17:37:52 GMT


Michael McCandless commented on LUCENE-966:

I tracked down at least some differences between the JavaCC vs JFlex
versions of StandardAnalyzer.

I think we should resolve these before committing.

I just printed all tokens for the first 20 Wikipedia docs and diff'd
the outputs.

Here are the categories of differences that I saw:

  * Only the type differs on a filename-like token:

      OLD: (2004.jpg,34461,34469,type=<HOST>)
      NEW: (2004.jpg,34461,34469,type=<NUM>)

    In this case the old StandardAnalyzer called "2004.jpg" a HOST and
    the new one calls it a NUM.  Seems like neither one is right!

  * Only the type differs on a number token:

      OLD: (62.46,37004,37009,type=<HOST>)
      NEW: (62.46,37004,37009,type=<NUM>)

    The new tokenizer looks right here.  I guess the decimal point
    confuses the JavaCC (old) one.

  * Different number of tokens produced for number-like-token:

      OLD: (978-0-94045043-1,86408,86424,type=<NUM>)
      NEW: (978-0-94045043,86408,86422,type=<NUM>)

    The new one split off the final "-1" as its own token, and called
    it ALPHANUM not NUM.  I think the old behavior is correct.

  * Different number of tokens produced for filename:

      OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>)
      NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>)

    I think the old one is better, though it should not be called a
    NUM (maybe we need a new "FILENAME" token type?).

  * Same as above, but split on final '_' instead of '.' ('-' also
    shows this behavior):

      OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>)
      new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>)

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>                 Key: LUCENE-966
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>         Attachments:, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
> JFlex ( can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message