lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: deprecating Versions
Date Mon, 29 Nov 2010 17:51:35 GMT
On 11/29/2010 09:40 AM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 9:05 AM, DM Smith<dmsmith555@gmail.com>  wrote:
>> In my project, I don't use any of the Analyzers that Lucene provides, but I have
variants of them. (Mine allow take flags indicating whether to filter stop words and whether
to do stemming). The effort recently has been to change these analyzers to follow the new
reuse pattern to improve performance.
>>
>> Having a declarative mechanism and I wouldn't have needed to make the changes.
> Right, this is I think what we want?
It's what I want. I think non-power users would like it as well.

The other thing I'd like is for the spec to be save along side of the 
index as a manifest. From earlier threads, I can see that there might 
need to be one for writing and another for reading. I'm not interested 
in using it to construct an analyzer, but to determine whether the index 
is invalid wrt to the analyzer currently in use.

>   To just provide examples so the
> user can make what they need to suit their application.
>
>> WRT to an analyzer, if any of the following changes, all bets are off:
>>     Tokenizer (i.e. which tokenizer is used)
>>     The rules that a tokenizer uses to break into tokens. (E.g. query parser, break
iterator, ...)
>>     The type associated with each token (e.g. word, number, url, .... )
>>     Presence/Absence of a particular filter
>>     Order of filters
>>     Tables that a filter uses
>>     Rules that a filter encodes
>>     The version and implementation of Unicode being used (whether via ICU, Lucene
and/or Java)
>>     Bugs fixed in these components.
>> (This list is adapted from an email I wrote to a user's group explaining why texts
need to be re-indexed.)
>>
> Right, i agree, and some of these things (such as JVM unicode version)
> are completely outside of our control.
> But for the things inside our control, where are the breaks that
> caused you any reindexing?
The JVM version is not entirely out of our control: 3.x requires Java 5 
JVM. So going from 2.9.x to 3.1 (I can skip 3.0) requires a different 
Unicode. I bet most desktop applications using Lucene 2.9.x are using 
Java 5 or Java 6, so upgrading to 3.1 won't be an issue for them. This 
issue really only regards MacOSX.

But this is also a problem today outside of our control. A user of a 
desktop application under 2.x can have an index with Java 1.4.2 and then 
upgrade to Java 5 or 6. Unless the desktop application knew to look for 
this and "invalidate" the index, tough.

I'd have to look to be sure: IIRC, Turkish was one. The treatment of 'i' 
was buggy. Russian had it's own encoding that was replaced with UTF-8. 
The QueryParser had bug fixes. There is some effort to migrate away from 
stemmer to snowball, but at least the Dutch one is not "identical".

Maybe, I'm getting confused by lurking as to what is in which release 
and everything is just fine.

>> Additionally, it is the user's responsibility to normalize the text, probably to
NFC or NFKC, before index and search. (It may need to precede the Tokenizer if it is not Unicode
aware. E.g. what does a LetterTokenizer do if input is NFD and it encounters an accent?)
> I would not recommend this approach: NFC doesnt mean its going to take
> letter+accent combinations and compose them into a 'composed'
> character with the letter property... especially for non-latin
> scripts!
>
> In some cases, NFC will even cause the codepoint to be expanded: the
> NFC form of 0958 (QA) is 0915 + 093C (KA+NUKTA)... of course if you
> use LetterTokenizer with any language in this script, you are screwed
> anyway :)
>
> But even for latin scripts this won't work... not all combinations
> have a composed form and i think composed forms are in general not
> being added anymore.
I knew that NFC does not have a single codepoint for some glyphs.

I'm also seeing the trend you mention.

I'm always fighting my personal, parochial bias toward English;)

As an aside, my daughter is a linguist, who in summer 2009, worked on 
the development and completion of alphabets for 3 African languages. 
This was not an academic exercise but an effort to develop literacy 
among those people groups. Some of the letters in these languages are 
composed of multiple glyphs and some of the glyphs have decorations. 
It'd be interesting to see how these would be handled in Unicode (if 
they get added).

> For example, see the lithuanian sequences in
> http://www.unicode.org/Public/6.0.0/ucd/NamedSequences.txt:
>
> LATIN SMALL LETTER A WITH OGONEK AND TILDE;0105 0303
>
> You can normalize this all you want, but there is no single composed
> form, in NFC its gonna be 0105 0303.
>
> Instead, you should use a Tokenizer that respects canonical
> equivalence (tokenizes text that is canonically equivalent in the same
> way), such as UAX29Tokenizer/StandardTokenizer in branch_3x. Ideally
> your filters too, will respect this equivalence, and you can finally
> normalize a single time at the *end* of processing.
Should it be normalized at all before using these? NFKC?

> For example, don't
> use LowerCaseFilter + ASCIIFoldingFilter or something like that to
> lowercanse&  remove accents, but use ICUFoldingFilter instead, which
> handles all this stuff consistently, even if your text doesnt conform
> to any unicode normalization form...
Sigh. This is my point. The old contrib analyzers, which had no backward 
compatibility guarantee, except on an individual contribution basis, 
though it was treated with care, had weak non-english analyzers. Most of 
my texts are non-Latinate, let alone non-English.

The result is that Lucene sort-of works for them. The biggest hurdle has 
been that my lack of knowledge, but second to that the input to index 
and to search don't treat canonical equivalences as equivalent. By 
normalizing to NFC before index and search, I have found the results to 
be far better.

I have learned a lot from lurking on this list about handling 
non-English/Latinate text with care. And as a result, with each release 
of Lucene, I want to work those fixes/improvements into my application.

My understanding is that indexes built with them will result in problems 
that might not be readily apparent. You have done great work in 
alternative tokenizers and filters.
>> Recently, we've seen that there is some mistrust here in JVMs at the same version
level from different vendors (Sun, Harmony, IBM) in producing the same results. (IIRC: Thai
break iterator. Random tests.)
> Right, Sun JDK 7 will be a new unicode version. Harmony uses a
> different unicode version than Sun. There's nothing we can do about
> this except document it?
I don't know if it would make sense to have a JVM/Unicode map e.g. 
(vendor + jvm version => unicode version) and when an index is created 
note that tuple. Upon opening an index for reading or writing, the 
current value could be compared to the stored value. If they don't match 
something could be done. (warning? error?)

This could be optional where the default is to do no check and to store 
nothing.

But documentation is always good.

> Whether or not a special customized break iterator for Thai locale
> exists, and how it works, is just a jvm "feature". There's nothing we
> can do about this except document it?

>> Within a release of Lucene, a small handful of analyzers may have changed sufficiently
to warrant re-index of indexes built with them.
> which ones changed in a backwards-incompatible way that forced you to reindex?

Other than the change to the JVM, which really is out of my control. 
Maybe I'm not reading it correctly, but the QueryParser is changing in 
3.1. If one wants the old QueryParser, one has to have their own 
Analyzer implementations.


>> So basically, I have given up on Lucene being backward compatible where it matters
the most to me: Stable analyzer components. The gain I get from this admission is far better.
YMMV.
>>
> which ones changed in a backwards-incompatible way that forced you to reindex?
Basically the ones in contrib. Because of the lack of a strong bw-compat 
guarantee, I am only fairly confident that nothing changed. I know the 
test have been improving, but when I started contributing small changes 
to them, I thought they were rudimentary. It didn't give me a lot of 
confidence that any contrib analyzer is stable. But ultimately, it's 
because I want the best analysis for each language's text that I use the 
improvements.

I wish I had more time to help.

If needed, I can do a review of the code to give an exact answer.

I could have and should have made it clearer.

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message