lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2167) StandardTokenizer Javadoc does not correctly describe tokenization around punctuation characters
Date Wed, 24 Feb 2010 23:07:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838068#action_12838068
] 

Robert Muir commented on LUCENE-2167:
-------------------------------------

bq. I'll take a crack at understanding unicode standard tokenization, as you'd suggested originally,
and try and produce something as soon as I get a chance.

I would love it if you could produce a grammar that implemented UAX#29!

If so, in my opinion it should become the StandardAnalyzer for the next lucene version. If
I thought I could do it correctly, I would have already done it, as the support for the unicode
properties needed to do this is now in the trunk of Jflex!

here are some references that might help: 
The standard itself: http://unicode.org/reports/tr29/

particularly the "Testing" portion: http://unicode.org/reports/tr41/tr41-5.html#Tests29

Unicode provides a WordBreakTest.txt file, that we could use from Junit, to help verify correctness:
http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt

I'll warn you I think it might be hard, but perhaps its not that bad. In particular the standard
is defined in terms of "chained" rules, and Jflex doesnt support rule chaining, but I am not
convinced we need rule chaining to implement WordBreak (maybe LineBreak, but maybe WordBreak
can be done easily without it?) 

Steven Rowe is the expert on this stuff, maybe he has some ideas.

> StandardTokenizer Javadoc does not correctly describe tokenization around punctuation
characters
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4.1, 2.9, 2.9.1, 3.0
>            Reporter: Shyamal Prasad
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The Javadoc for StandardTokenizer states:
> {quote}
> Splits words at punctuation characters, removing punctuation. 
> However, a dot that's not followed by whitespace is considered part of a token.
> Splits words at hyphens, unless there's a number in the token, in which case the whole

> token is interpreted as a product number and is not split.
> {quote}
> This is not accurate. The actual JFlex implementation treats hyphens interchangeably
with
> punctuation. So, for example "video,mp4,test" results in a *single* token and not three
tokens
> as the documentation would suggest.
> Additionally, the documentation suggests that "video-mp4-test-again" would become a single
> token, but in reality it results in two tokens: "video-mp4-test" and "again".
> IMHO the parser implementation is fine as is since it is hard to keep everyone happy,
but it is probably
> worth cleaning up the documentation string. 
> The patch included here updates the documentation string and adds a few test cases to
confirm the cases described above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message