lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vanlerberghe, Luc" <>
Subject RE: failure in the Russian Analyzer in contrib
Date Fri, 11 Feb 2005 16:19:37 GMT
There are no unicode character in the java sources so that didn't make
any difference...

I'm suspecting subversion now: the stemsUnicode.txt and wordsUnicode.txt
files are encoded in UTF-16 (they have the proper two byte byte-order
prefix) and have property svn:eol-style set to native.
On my (Windows :( )system the files are 904424 and 1101164 bytes long
and are full of "0d 0a 00" byte sequences which in unicode should
probably just be "0a 00" or "0d 00 0a 00".

On Mac the "0a" sequences won't be touched by svn.

Is there a way to do a svn update --raw or something that I can check

If this is indeed the problem, a possible fix would be to set the
svn:eol-style to LF or else let svn know that the file is in unicode
(perhaps setting the svn:mime-type property to something else than the


-----Original Message-----
From: Erik Hatcher [] 
Sent: vrijdag 11 februari 2005 16:33
To: Lucene Developers List
Subject: Re: TestCase for KeywordAnalyzer split into

On Feb 11, 2005, at 9:04 AM, Vanlerberghe, Luc wrote:

> Here's the diff for the TestCase 'inline'.
> It should be applied in
> contrib\analyzers\src\test\org\apache\lucene\analysis
> The failure in the Russian Analyzer is unrelated (I updated all 
> sources to HEAD i.e. 153399 to be sure) but you probably need the 
> Russian fonts to see the error: unicode expected:<?????????> but 
> was:<???????????>

My guess is it's a file encoding issue on your system.  The files should
be in UTF8 encoding.  The build file has a parameter you can

	ant -Dbuild.encoding=utf-8

All is well for me running on Mac OS X with a fresh Subversion checkout.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message