lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Lassau (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl
Date Wed, 03 Sep 2008 01:01:46 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627876#action_12627876
] 

Mark Lassau commented on LUCENE-1068:
-------------------------------------

Causes JIRA issue [JRA-15484|http://jira.atlassian.com/browse/JRA-15484].

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch,
StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch,
standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which
is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is
(wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly
legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not
ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I
changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message