lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
Date Thu, 29 Nov 2007 13:37:43 GMT


Shai Erera updated LUCENE-1068:

    Attachment: StandardTokenizerImpl-2.patch

I've found a way to do it (I think):
I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current
ACRONYM to identify proper ones.
I also marked ACRONYM_DEP as deprecated.
I added code to StandardTokenizer to set the type of a token to HOST if the type returned
is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM,
in case there are applications that count on the Token type.

I wrote these 4 lines of code to verify it works:
	public static void main(String[] args) throws Exception {

	public static void parse(String text) throws Exception {
		Analyzer analyzer = new StandardAnalyzer();
		TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
		Token t;
		while ((t = != null) {
And the output is: 

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>                 Key: LUCENE-1068
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: StandardTokenizerImpl-2.patch, standardTokenizerImpl.jflex.patch,
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = != null) {
>             System.out.println(t);
>         }
> If you pass "", the output is (,0,11,type=<HOST>) (which
is correct in my opinion).
> However, if you pass "" (notice the extra '.' at the end), the output is
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I
changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message