lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: UTF-8 and unit test failure for org.apache.analysis.ru.RussianStem in build with Kaffe
Date Thu, 22 Sep 2005 16:05:43 GMT
Barry Hawkins wrote:
> Guys,
>     Hello, it's those pesky Debian Lucene package maintainers again :-).
>  Lucene currently builds and passes all but one unit test against
> Kaffe[0] 1.1.6.  In debugging the failure of the unit test for
> org.apache.analysis.ru.RussianStem, I enabled a build of the JUnit test
> reports.  A detailed account is listed in Debian Bug Report #272295[1],
> but in brief, the 7-character String of Cyrillic expected is matched for
> the first five characters, then an issue occurs and what appears to be a
> few thousand characters are spewed out and the unit test fails.  I have
> a tarball of the unit test reports temporarily stored on my FTP site[2]
> if anyone would care to take a look.
>     Given the recent thread about UTF-8[3], I thought I would present
> this to you guys to see if you might have any insight on the issue.
> Thanks in advance for your time in reading this message.
> 
> [0] - http://www.kaffe.org
> [1] - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=272295
> [2] - ftp://www.bytemason.org/lucene_reports_2005092001.tar.gz

 From the HTML failure report for 
org.apache.lucene.analysis.ru.TestRussianStem in [2] (spaces added to 
align information), and &lt; and &gt; shown as '<' and '>', resp.):

unicode expected:
   < &#1073; &#1077; &#1079; &#1076; &#1086; &#1084;  &#1085;
>

but was:
   < &#1073; &#1077; &#1079; &#1076; &#1086; &#15364; &#15620;
...

Rewritten in hex notation:

unicode expected:
   < &#x431; &#x435; &#x437; &#x434; &#x43E; &#x43C;  &#x43D;
>

but was:
   < &#x431; &#x435; &#x437; &#x434; &#x43E; &#x3C04; &#x3D04;
...

So, it appears to be the case that the last two of the seven characters 
have their byte order reversed: 3C04 versus 043C, and 3D04 versus 043D.

The next several characters in the output after the expected seven 
characters are:

  &#12292; &#3328;  &#2560;  &#12548; &#13572; ...

Rewritten as hex:

  &#x3004; &#x0D00; &#x0A00; &#x3104; &#x3504; ...

Byte swapped:

  &#x430;  &#xD;    &#xA;    &#x431;  &#x435;  ...

Transliterated into the Latin-1 alphabet, this is "a\r\nbe", where "\r" 
and "\n" are carriage return and newline, resp., and the "b" is the 
Cyrillic character that sounds like English "b".

So, it looks to me like the data following the expected output is 
extremely likely to be some form of intelligent data, which has simply 
been byte swapped.

I hope this helps -- I haven't got the time to investigate the code to 
connect this evidence to it.

Steve Rowe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message