Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <10309643.1196343463276.JavaMail.jira@brutus>
Date: Thu, 29 Nov 2007 05:37:43 -0800 (PST)
From: "Shai Erera (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-1068) Invalid behavior of
 StandardTokenizerImpl
In-Reply-To: <29758935.1196164483461.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: StandardTokenizerImpl-2.patch

I've found a way to do it (I think):
I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones.
I also marked ACRONYM_DEP as deprecated.
I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type.

I wrote these 4 lines of code to verify it works:
	public static void main(String[] args) throws Exception {
		parse("www.abc.com.");
		parse("www.abc.com");
		parse("I.B.M.");
	}

	public static void parse(String text) throws Exception {
		Analyzer analyzer = new StandardAnalyzer();
		TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
		Token t;
		while ((t = ts.next()) != null) {
			System.out.println(t);
		}
	}
And the output is: 
(www.abc.com.,0,12,type=<HOST>)
(www.abc.com,0,11,type=<HOST>)
(ibm,0,6,type=<ACRONYM>)

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: StandardTokenizerImpl-2.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org