lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: UTF-8 and unit test failure for org.apache.analysis.ru.RussianStem in build with Kaffe
Date Thu, 22 Sep 2005 14:58:21 GMT
Hi Barry,

>     Hello, it's those pesky Debian Lucene package maintainers again :-).
>  Lucene currently builds and passes all but one unit test against
>Kaffe[0] 1.1.6.  In debugging the failure of the unit test for
>org.apache.analysis.ru.RussianStem, I enabled a build of the JUnit test
>reports.  A detailed account is listed in Debian Bug Report #272295[1],
>but in brief, the 7-character String of Cyrillic expected is matched for
>the first five characters, then an issue occurs and what appears to be a
>few thousand characters are spewed out and the unit test fails.  I have
>a tarball of the unit test reports temporarily stored on my FTP site[2]
>if anyone would care to take a look.
>     Given the recent thread about UTF-8[3], I thought I would present
>this to you guys to see if you might have any insight on the issue.
>Thanks in advance for your time in reading this message.

Without downloading the tarball and digging into it, one bit of 
feedback is that Cyrillic has numerous encodings. A common source of 
problems is that text encoded using 8859-5 (for example) is getting 
identified as KOI8-R (or vice versa), so the conversion to Unicode 
fails on some characters.

As to the bug report, the HTML is tagged as UTF-8, but it looks like 
the text coming from the DB is using one of the legacy Cyrillic 
encodings. So my browser isn't very happy :)

-- Ken

>
>[0] - http://www.kaffe.org
>[1] - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=272295
>[2] - ftp://www.bytemason.org/lucene_reports_2005092001.tar.gz
>[3] -
>http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200509.mbox/%3c72676F9F-45EC-4CF7-890F-3F8702564D05@rectangular.com%3e
>

-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message