lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2798) Randomize indexed collation key testing
Date Thu, 09 May 2013 23:06:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2798:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
    
> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Core
>          Issue Type: Test
>          Components: modules/analysis
>    Affects Versions: 3.1, 4.0-ALPHA
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-2798.patch, LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing
is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency
in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales
tested, and no collator options are exercised).  This affects both the JDK implementation
in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided
by the Collator itself.  Instead of the current set of static tests, this could be achieved
via indexing randomly generated terms' collation keys (and collator options) and then comparing
the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed
terms is inherently unstable.  When performing runtime collation, the Collator addresses the
sort stability issue by adding a secondary sort over the normalized original terms.  In order
to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will
need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original
terms in addition to their collation keys, then using a Sort over the original terms as the
secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8
order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare,
if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for
the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message