lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Fri, 30 Apr 2010 22:18:57 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Rowe updated LUCENE-2167:
--------------------------------

              Summary: Implement StandardTokenizer with the UAX#29 Standard  (was: StandardTokenizer
Javadoc does not correctly describe tokenization around punctuation characters)
           Issue Type: New Feature  (was: Bug)
             Assignee: Steven Rowe
    Affects Version/s: 4.0.0
                           (was: 2.9)
                           (was: 3.0)
                           (was: 2.4.1)
                           (was: 2.9.1)
          Description: 
It would be really nice for StandardTokenizer to adhere straight to the standard as much as
we can with jflex. Then its name would actually make sense.

Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as
its javadoc claims:

bq. This should be a good tokenizer for most European-language documents

The new StandardTokenizer could then say

bq. This should be a good tokenizer for most languages.

All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with
that EuropeanTokenizer, and it could be used by the european analyzers.


  was:
The Javadoc for StandardTokenizer states:

{quote}
Splits words at punctuation characters, removing punctuation. 
However, a dot that's not followed by whitespace is considered part of a token.

Splits words at hyphens, unless there's a number in the token, in which case the whole 
token is interpreted as a product number and is not split.
{quote}

This is not accurate. The actual JFlex implementation treats hyphens interchangeably with
punctuation. So, for example "video,mp4,test" results in a *single* token and not three tokens
as the documentation would suggest.

Additionally, the documentation suggests that "video-mp4-test-again" would become a single
token, but in reality it results in two tokens: "video-mp4-test" and "again".

IMHO the parser implementation is fine as is since it is hard to keep everyone happy, but
it is probably
worth cleaning up the documentation string. 

The patch included here updates the documentation string and adds a few test cases to confirm
the cases described above.

          Component/s: contrib/analyzers

(stole Robert's comment to change the issue description)

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 4.0.0
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message