lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 32687] - org.apache.lucene.analysis.cn.ChineseTokenizer missing offset decrement
Date Wed, 15 Dec 2004 03:01:06 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=32687>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=32687





------- Additional Comments From saturnism@gmail.com  2004-12-15 04:01 -------
Created an attachment (id=13758)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=13758&action=view)
Testcase that tests ChineseTokenizer and OTHER_LETTER offsets

The problem arises when OTHER_LETTER characters and the rest of the characters
are mixed together.  When given a string "a&#22825;b", tokens and corresponding
offsets should be the following:
a : (0, 1)
&#22825; : (1, 2)
b : (2, 3)

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message