lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
Date Thu, 03 Dec 2009 17:47:28 GMT


Steven Rowe commented on LUCENE-2074:

bq. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the moment?

I think it's fine to do that.

bq. The new parsers (see patch) are pre-generated in SVN, so somebody compiling lucene from
source does need to use jflex. And the parsers for StandardTokenizer are verified to work
correct and are even identical (DFA wise) for the old Java 1.4 / Unicode 3.0 case.

Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms - the only
JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], which under JFlex <1.5
were expanded using the scanner-generation-time JVM's Character.isLetter() and .isDigit()
definitions, but under JFlex 1.5-SNAPSHOT depend on the declared Unicode version definitions
(i.e., [:letter:] = \p{Letter}).

I'm actually surprised that the DFAs are identical, since I'm almost certain that the set
of characters matching [:letter:] changed between Unicode 3.0 and Unicode 4.0 (maybe [:digit:]
too).  I'll take a look this weekend.

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>                 Key: LUCENE-2074
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>         Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch,
LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch,
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according
to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that
we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message