lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Vasilev <ivasi...@sirma.bg>
Subject Re: Is there bug in Range searches?
Date Mon, 22 Oct 2007 15:11:29 GMT
10x Hoss for the answer. It is good news that this topic is very rare 
and clients do not complain about this. I hope our clients will also not 
complain :)

Looking strictly at this I think this leads to a non correct behavior on 
indexing applications, but as there are no unsatisfied clients may be no 
more comments about this are needed.
I will just explain what I mean when speaking about incorrect behaviour. 
Imagine index contains German files content. There are the following 
words indexed: "Alle", "beschlossen", "foto", "für", "Ältere".
This is the order of the words in the index, because this is the order 
imposed by String.compareTo method. But when using German collator it 
always puts the word "Ältere" after "Alle" and before "beschlossen", no 
matter what strength level is used. This means that when sorting we will 
have the words in the following order: "Alle", "Ältere", "beschlossen", 
"foto", "für". And now imagine user makes range query like this - 
content:(Alle TO foto) and expects the items containing "Ältere" and 
"beschlossen" to be returned. But as in the index the word "Ältere" is 
at the last place it's entry will not be returned, but only those of 
"beschlossen" will persist in the results.
So in German the problem comes only with 3 umlaut letters but as we will 
have clients using CJK documents I do not now how big is the difference 
between String.compareTo and Collator.compare methods for those languages.
I hope we will not have problems with those clients, but if any I will 
change a bit source of Term and TermBuffer classes. By the way I have 
experience with changes in those clases where I made values of numeric 
fields to be ordered in numeric order but not in lexicographical one.

Thanks ones again.
Ivan

Chris Hostetter wrote:
> :   1. Are Range queries work correctly with all languages for which
> :      there are analyzers? (for example CJK and Thai);
>
> Terms when indexed are allways ordered lexigraphically (using 
> Term.compareTo which uses String.compareTo) ... regardless of what field 
> or language they are in, so "Range Queries" must do their comparisons 
> lexigraphically as well.
>
> because all Terms are indexed in one continuous TermEnum, it would be 
> fairly imposible to definite different Collators per field at index time.
>
> it's pretty rare that anyone ever talks about Range Queries on words, so 
> this doesn't typiclly pose a problem ... i've never seen anyone comment 
> that they can't do sane Range queries on their text because the language 
> it's in doesn't collate in the same order as the default compareTo.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> __________ NOD32 2605 (20071022) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message