lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski" <>
Subject Re: [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Thu, 02 Aug 2007 21:37:12 GMT
> > Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with
> your wikipedia diff test? That's the early one whose grammar was "dot for
> dot" translated from the original JavaCC spec -- for further patches I did
> some "optimizations", which seem to have broken the compatibility...
> >
> The test is Mike's and I think it is off your latest patch.

Oops again -- I should stop working late at night :)

The latest patch was not too compatible with JavaCC, the
jflex-analyzer-r560135-patch.txt patch should be best here. Maybe I should
delete the other attachments from JIRA to avoid further confusion?

Looks like the optimizations might have to go then?

 Definitely, but the base version (jflex-analyzer-r560135-patch.txt) is
still much faster than JavaCC.

> Incidentally, what was the motivation for requiring the <NUM> token to
> have numbers only in every second segment and not in any segment?
> >
> I don't think the rule is "every second segment" but "at least every
> other segment". Why this rule was made, I am not sure; I am guessing it
> was just a good rule of thumb to catch a lot of serial numbers, model
> numbers, etc but without going too overboard in the matching.

Ok -- seems reasonable.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message