Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 99891 invoked from network); 11 Apr 2011 15:18:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Apr 2011 15:18:45 -0000 Received: (qmail 24647 invoked by uid 500); 11 Apr 2011 15:18:44 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 24599 invoked by uid 500); 11 Apr 2011 15:18:44 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 24592 invoked by uid 99); 11 Apr 2011 15:18:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Apr 2011 15:18:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Apr 2011 15:18:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B68E39C307 for ; Mon, 11 Apr 2011 15:18:05 +0000 (UTC) Date: Mon, 11 Apr 2011 15:18:05 +0000 (UTC) From: "Steven Rowe (JIRA)" To: dev@lucene.apache.org Message-ID: <156783353.49272.1302535085744.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <18723320.111041291481951722.JavaMail.jira@thor> Subject: [jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018386#comment-13018386 ] Steven Rowe commented on LUCENE-2798: ------------------------------------- bq. also i don't see any check that preflex codec isn't in use for this test? {{TestCollationKeyAnalyzer.setUp()}} handles it: {code:java} @Override public void setUp() throws Exception { super.setUp(); assumeFalse("preflex format only supports UTF-8 encoded bytes", "PreFlex".equals(CodecProvider.getDefault().getDefaultFieldCodec())); } {code} And in practice, the test gets skipped 25% of the time as a result of this. > Randomize indexed collation key testing > --------------------------------------- > > Key: LUCENE-2798 > URL: https://issues.apache.org/jira/browse/LUCENE-2798 > Project: Lucene - Java > Issue Type: Test > Components: Analysis > Affects Versions: 3.1, 4.0 > Reporter: Steven Rowe > Assignee: Steven Rowe > Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2798.patch > > > Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised). This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}. > The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself. Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms. > Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable. When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms. In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort. > Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that. > From #lucene: > {quote} > rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo > rmuir__: and in the index sort on the collated field, followed by the original term > rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list > rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8")) > {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org