lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Re: Best way to create own version of StandardTokenizer ?
Date Mon, 07 Sep 2009 14:47:11 GMT
Robert Muir wrote:
>> I think we would like to implement the complete unicode rules, so if you
>> could provide us with some code that would be great.
> ok, I will followup... what version of lucene are you using, 2.9?
> ...
>> but having read the
>> details it would seem to convert a half width character you would have to
>> know you were looking at chinese (or korean/japanses ecetera) , but as the
>> Musicbrainz system supports any language and the user doesn't specify the
>> language being used  when searching
> no, theres no language involved... why would you not simply apply the
> filter all the time.
> if i am looking at T (fullwidth character T), it should indexed as T
> everytime (or later probably t if you are going to apply
> lowercasefilter)
I'm obviously misunderstanding I thought that Halfwidth  was an encoding 
to allow storing the most common Chinese characters in a single byte, 
therefore the charcters would be read as different characters if you 
assumed they were using the HalfWidth Encoding rather than Latin 
Encoding. But are you saying Halfwidth characters are actually valid 
Unicode characters with their own distinct unicode value   so can just  
use a CharFilter again to map these, please confirm.
>> I assume once again you have to know the script being used in order to do
>> this
> this is ok, because normalization, if you want to do it that way, is
> definitely not language dependent!
> its not like collation, where you have a locale 'parameter', its a
> language-independent process.
>> I think there are two issues, firstly the data needs to be indexed to always
>> use gerhayim is this what you are suggesting I couldn't follow how to change
>> jflex.
> you are right, for you there are a couple issues.
> first, i do not know what standardtokenizer does with
> geresh/gershayim, forget about single quote/double quote.
> but to fix the tokenization (gershayim example), you want to ensure
> you do not split on these.
> since this is used in hebrew acronym, i would modify the acronym rule to allow
> [hebrew letter]+ (" | ״) [hebrew letter]+
> next, if you want these to be indexed the same so that ארה"ב and ארה״ב
> will match, you will need to create a tokenfilter
> to standardize " to ״ for acronyms.
Oh I see , so we convert one to the other, but only when matches 
>> Then its an issue for the query parser that the user uses a " for searching
>> but doesn't escape it, but I cannot automatically escape it because it may
>> not be Hebrew.
> yes, you have a queryparser parsing ambiguity because " is also the
> phrase operator.
> I don't know what to recommend here off the top of my head... do you
> allow phrase queries?
Yes we do , we allow full Lucene syntax if the 'Advanced Query' option 
is selected at
> also as an fyi, when i say according to unicode they should be using
> gershayim instead of double-quote, this is a little theoretical.
> its not very user-friendly to expect users to use gershayim for input,
> when its not even on hebrew keyboard layout...!
Understood, so I think users will continue to use the Double Quotes 
Character in their searches

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message