lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
Date Wed, 24 Feb 2010 03:41:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837603#action_12837603
] 

Robert Muir commented on LUCENE-2167:
-------------------------------------

bq. Clearly, much of this is an opinion, so I finally stuck to the one minor change that I
believe is arguably an improvement. Previously comma separated fields containing digits would
be mistaken for numbers and combined into a single token. I believe this is a mistake because
part numbers etc. are rarely comma separated, and regular text that is comma separated is
not uncommon.

I don't think it really has to be, i actually am of the opinion StandardTokenizer should follow
unicode standard tokenization. then we can throw subjective decisions away, and stick with
a standard.

In this example, i think the change would be bad, as the comma is treated differently depending
upon context, as it is a decimal separator and thousands separator in many languages, including
English. so, the treatment of the comma depends upon the previous character.

this is why in unicode, the comma has the Mid_Num Word_Break property.


> StandardTokenizer Javadoc does not correctly describe tokenization around punctuation
characters
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
>            Reporter: Shyamal Prasad
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The Javadoc for StandardTokenizer states:
> {quote}
> Splits words at punctuation characters, removing punctuation. 
> However, a dot that's not followed by whitespace is considered part of a token.
> Splits words at hyphens, unless there's a number in the token, in which case the whole

> token is interpreted as a product number and is not split.
> {quote}
> This is not accurate. The actual JFlex implementation treats hyphens interchangeably
with
> punctuation. So, for example "video,mp4,test" results in a *single* token and not three
tokens
> as the documentation would suggest.
> Additionally, the documentation suggests that "video-mp4-test-again" would become a single
> token, but in reality it results in two tokens: "video-mp4-test" and "again".
> IMHO the parser implementation is fine as is since it is hard to keep everyone happy,
but it is probably
> worth cleaning up the documentation string. 
> The patch included here updates the documentation string and adds a few test cases to
confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message