From java-dev-return-11686-apmail-lucene-java-dev-archive=lucene.apache.org@lucene.apache.org Thu Sep 22 16:06:13 2005 Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 56152 invoked from network); 22 Sep 2005 16:06:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 22 Sep 2005 16:06:13 -0000 Received: (qmail 48142 invoked by uid 500); 22 Sep 2005 16:06:11 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 47904 invoked by uid 500); 22 Sep 2005 16:06:09 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 47891 invoked by uid 99); 22 Sep 2005 16:06:09 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2005 09:06:09 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [128.230.18.29] (HELO mailer.syr.edu) (128.230.18.29) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2005 09:06:14 -0700 Received: from [128.230.38.212] (syru38-212.syr.edu) by mailer.syr.edu (LSMTP for Windows NT v1.1b) with SMTP id <0.146F438E@mailer.syr.edu>; Thu, 22 Sep 2005 12:05:44 -0400 Message-ID: <4332D657.40601@syr.edu> Date: Thu, 22 Sep 2005 12:05:43 -0400 From: Steven Rowe User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050411) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: UTF-8 and unit test failure for org.apache.analysis.ru.RussianStem in build with Kaffe References: <43323882.4030205@bytemason.org> In-Reply-To: <43323882.4030205@bytemason.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Barry Hawkins wrote: > Guys, > Hello, it's those pesky Debian Lucene package maintainers again :-). > Lucene currently builds and passes all but one unit test against > Kaffe[0] 1.1.6. In debugging the failure of the unit test for > org.apache.analysis.ru.RussianStem, I enabled a build of the JUnit test > reports. A detailed account is listed in Debian Bug Report #272295[1], > but in brief, the 7-character String of Cyrillic expected is matched for > the first five characters, then an issue occurs and what appears to be a > few thousand characters are spewed out and the unit test fails. I have > a tarball of the unit test reports temporarily stored on my FTP site[2] > if anyone would care to take a look. > Given the recent thread about UTF-8[3], I thought I would present > this to you guys to see if you might have any insight on the issue. > Thanks in advance for your time in reading this message. > > [0] - http://www.kaffe.org > [1] - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=272295 > [2] - ftp://www.bytemason.org/lucene_reports_2005092001.tar.gz From the HTML failure report for org.apache.lucene.analysis.ru.TestRussianStem in [2] (spaces added to align information), and < and > shown as '<' and '>', resp.): unicode expected: < б е з д о м н > but was: < б е з д о 㰄 㴄 ... Rewritten in hex notation: unicode expected: < б е з д о м н > but was: < б е з д о 㰄 㴄 ... So, it appears to be the case that the last two of the seven characters have their byte order reversed: 3C04 versus 043C, and 3D04 versus 043D. The next several characters in the output after the expected seven characters are: 〄 ഀ ਀ ㄄ 㔄 ... Rewritten as hex: 〄 ഀ ਀ ㄄ 㔄 ... Byte swapped: а б е ... Transliterated into the Latin-1 alphabet, this is "a\r\nbe", where "\r" and "\n" are carriage return and newline, resp., and the "b" is the Cyrillic character that sounds like English "b". So, it looks to me like the data following the expected output is extremely likely to be some form of intelligent data, which has simply been byte swapped. I hope this helps -- I haven't got the time to investigate the code to connect this evidence to it. Steve Rowe --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org