lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Re: Best way to create own version of StandardTokenizer ?
Date Mon, 07 Sep 2009 10:07:44 GMT
Robert Muir wrote:
> Paul, thanks for the examples. In my opinion, only one of these is a
> tokenizer problem :)
> none of these will be affected by a unicode upgrade.
>> Things like:
> another approach is using ibm ICU library for this case, as the
> builtin Katakana-Hiragana works well.
> you don't need to write the rules, as its built in, but if you are
> curious they are defined here:
> if CharFilter/the static mappings I described do not meet your
> requirements, and you want a filter that does this via the rules
> above, I can give you some code.
I think we would like to implement the complete unicode rules, so if you 
could provide us with some code that would be great.
> in this case, it appears you want to do fullwidth-halfwidth conversion
> (hard to tell from the ticket but it claims that solves the issue)
> you could use a similar CharFilter approach as I described above for this one.
If there is a mapping from halfwidth / fullwidth that would work so 
converted to fullwidth for indexing and searching, but having read the 
details it would seem to convert a half width character you would have 
to know you were looking at chinese (or korean/japanses ecetera) , but 
as the Musicbrainz system supports any language and the user doesn't 
specify the language being used  when searching I cannot safetly  
convert these characters because they may just be latin ecetera. However 
when the entity is added to the database the language is specified so I 
could do a conversion like this to ensure all chinese albums were always 
indexed as full width, and then educate users to use full width charcters.
> alternatively, you could write java code. this kind of mapping is done
> within the CJKTokenizer in Lucene's contrib, and you could steal some
> code from there.
Not really going to work for me because need to handle all scripts, if I 
ad extra chinese handling to tokenizer I expect I'll break handling for 
other languages
> but a different way to look at this, is that its just one example of
> Unicode normalization (compatibility decomposition)
> so you could say, implement a tokenfilter that normalizes your text to
> NFKC and solve this problem, as well as a bunch of other issues in a
> bunch of other languages.
> if you want code to do this, there are several open jira tickets in
> lucene with different implementations.
I assume once again you have to know the script being used in order to 
do this
> this is a tokenization issue. its also not unicode standard (as really
> geresh/gershayim etc should be used).
> in the unicode standard (uax #29 segmentation), this issue is
> specifically mentioned:
> For Hebrew, a tailoring may include a double quotation mark between
> letters, because legacy data may contain that in place of U+05F4 (״)
> gershayim. This can be done by adding double quotation mark to
> MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
> in a tailoring.
> So the easiest way for you to get this, would be to modify jflex rules
> for these characters to behave differently, perhaps only when
> surrounded by hebrew context.
I think there are two issues, firstly the data needs to be indexed to 
always use gerhayim is this what you are suggesting I couldn't follow 
how to change jflex.
Then its an issue for the query parser that the user uses a " for 
searching but doesn't escape it, but I cannot automatically escape it 
because it may not be Hebrew.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message