lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: deprecating Versions
Date Mon, 29 Nov 2010 19:48:23 GMT
On 11/29/2010 01:43 PM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 12:51 PM, DM Smith<dmsmith555@gmail.com>  wrote:
>>> Instead, you should use a Tokenizer that respects canonical
>>> equivalence (tokenizes text that is canonically equivalent in the same
>>> way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
>>> your filters too, will respect this equivalence, and you can finally
>>> normalize a single time at the *end* of processing.
>> Should it be normalized at all before using these? NFKC?
>>
> Sorry, i wanted to answer this one too :)

Thanks!

> NFKC is definitely a case where its likely what you want for search,
> but you don't want to normalize your documents to this... it removes
> certain distinctions important to display.
I have found that for everything but Hebrew, NFC is really good for 
display. For some reason, Hebrew does better with NFD. But since I can't 
see some of the nuances of some scripts, e.g. farsi/arabic (parochial 
vision at work), it's not saying much. I agree, the K forms are terrible 
for display.

In the context of my app, the document is accepted as is and is not 
stored in the index. As we are not to 3.x yet, and I've not backported 
your tokenizers, I'm stuck with a poor 2.x implementation. And at this 
time we do not normalize the stream as it is indexed or searched. The 
result is terrible. For example, the user can copy displayed Farsi text 
and then search it, but when they compose it from the keyboard, it 
doesn't work. Normalizations of the text as it is passed to index and to 
search improve the situation greatly. While the results do vary by form, 
they eclipse the bad results.

I appreciate your input as I'm working on making the change and the 
upgrade to 3.x/Java 5.

> If you are going to normalize to NFK[CD], thats a good reason to to
> deal with normalization in the analysis process, instead of
> normalizing your docs to these destructive lossy forms. (I do, however
> think its ok to normalize the docs to NFC for display, this is
> probably a good thing, because many rendering engines+fonts will
> display it better).
>
> The ICUTokenizer/UAX29Tokenizer/StandardTokenizer only respects
> canonical equivalence, not compatibility equivalence, but I think this
> is actually good. Have a look at the examples in
> http://unicode.org/reports/tr15/, such as fractions and subscripts.
> Its sorta up to the app to determine how it wants to deal with these,
> so treating 2⁵ the same as "25" by default (thats what NFKC will do!)
> early in the analysis process is dangerous. An app might want to
> normalize this to "32".
I don't know if it is still there but IBM had a web form where one could 
submit input and have it transformed to the various forms. I found it 
very educational.

> So it can be better to normalize towards the end of your analysis
> process, e.g. have a look at ICUNormalizer2Filter: which supports the
> NFKC_CaseFold normal form (NFKC + CaseFold + removing Ignorables) in
> additional to the standard ones, and ICUFoldingFilter, which is just
> like that, except it does additional folding for search (like removing
> diacritics). These foldings are computed recursively up front so they
> give a stable result.
Many thanks. This is very helpful.

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message