lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-7052) BytesRefHash.sort should always sort in unicode code point order
Date Sun, 28 Feb 2016 10:34:22 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170991#comment-15170991
] 

Uwe Schindler edited comment on LUCENE-7052 at 2/28/16 10:34 AM:
-----------------------------------------------------------------

Hi Mike,
I know originally we added the different comparators to be able to allow the index term dict
to be sorted in different order. This never prooved to be useful, as many Lucene queries rely
on the default order. The only codec that used another byte order internally was the Lucene
3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime).
As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove
the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?

I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup nowhere else anymore
(except some test that check alternative sorts - those can be removed).

As a first step I changed the BytesRef code to no longer use inner classes and instead use
a lambda to define the comparators. But I'd suggest to remove at least the UTF-16 one completely
and move it as private impl detail to TSTLookup (as only used there).

_FYI: The lambda has no speed impact because it is called only once and internally compiles
to a class file that implements Comparator. It just looks nicer than the horrible comparator
classes_


was (Author: thetaphi):
Hi Mike,
I know originally we added the different comparators to be able to allow the index term dict
to be sorted in different order. This never prooved to be useful, as many Lucene queries rely
on the default order. The only codec that used another byte order internally was the Lucene
3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime).
As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove
the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?

I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup nowhere else anymore
(except some test that check alternative sorts - those can be removed).

As a first step I changed the BytesRef code to no longer use inner classes and instead use
a lambda to define the comparators. But I'd suggest to remove at least the UTF-16 one completely
and move it as private impl detail and move it hidden TSTLookup (as only used there).

_FYI: The lambda has no speed impact because it is called only once and internally compiles
to a class file that implements Comparator. It just looks nicer than the horrible comparator
classes_

> BytesRefHash.sort should always sort in unicode code point order
> ----------------------------------------------------------------
>
>                 Key: LUCENE-7052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7052
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master, 6.0
>
>         Attachments: LUCENE-7052-cleanup1.patch, LUCENE-7052.patch
>
>
> Today {{BytesRefHash.sort}} takes a custom {{Comparator}} but we always pass it {{BytesRef.getUTF8SortedAsUnicodeComparator()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message