lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Wachter <>
Subject Re: encoding of german analyzer source files
Date Fri, 26 Nov 2004 13:24:52 GMT
Murray Altheim wrote:

> Stefan Wachter wrote:
>> Hi Daniel,
>> I am using NetBeans 3.6 which certainly is unicode aware. Yet, 
>> NetBeans seems not to detect that the source files of Lucene are 
>> UTF-8 encoded automatically. I guess that it uses the platform 
>> specific default encoding which is ISO-8859-1 for my Linux operating 
>> system.
> In linux you can set the default encoding both at platform-level,
> at a user-level, and for individual applications. You're not forced
> to stay within ISO-8859-1. Think about it this way: if that were
> the case, how on a multi-user system like linux could a machine
> support only one encoding? This sounds more like a NetBeans problem
> than a OS problem. I don't use NetBeans, but there must be a way to
> indicate the encoding beyond what your particular user settings are.
> Otherwise, English programmers couldn't develop non-English programs,
> which is hard to believe.

I can tell the NetBeans-IDE the encoding of every single source file. 
But the problem is that I might not know which the correct encoding is. 
In case of Lucene it is quite clear because it is mentioned in the 
build.xml file. But what is the situation if someone sends you a stemmer 
class for example for Swahili and you do not know in which encoding the 
author wrote the source. Then you can try lots of encodings until the 
java compiler will be satisfied with it. And even then you might not be 
sure that you used the right encoding.

>> I think what Java lacks is a means to indicate the encoding of source 
>> files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The 
>> encoding has to be fed into the system from the outside. What else 
>> could be the reason for having an encoding switch to the java 
>> compiler? Therefore I think it is best to have Java source files to 
>> be plain ASCII.
> Java has quite a lot of localization features built into the
> language. Yes, the encoding has to be specified, just as one
> would have to tell any processor how to decode any given set
> of bytes. Java itself is Unicode aware for anything dealing
> with characters. For dealing with byte streams the encoding
> has to be specified. Here's a good article on the subject:
> As for crippling files by forcing them into plain ASCII, why
> would we want to step back 20 years in computer science? It's
> been a long-fought battle to get to where we are now, and the
> desires of a few people to be able to look at a file in ASCII
> are far outweighed by the rest of the world, whose languages
> don't fit into that straitjacket. As was mentioned, it would
> make the code a great deal harder to both read and manage.
> I remember looking at a desktop publishing application
> developed at StoneHand in 1996 that had Arabic, Gujarati,
> Japanese, Chinese, English, and Hebrew on the screen at the
> same time and thinking damn! pretty impressive! We now have
> that kind of thing in our browsers and think little of it.
> I'd hate to step back to pre-1996 again.
> We should all be using Unicode-aware tools. It's what the rest
> of the world is doing, even in the Anglocentric US. For an
> international project like Lucene, there's no good reason to
> step back in time to ASCII. There are many programmers using
> the Lucene source code that have no problem with Unicode, and
> it would not be in their interest to be suddenly reading
> numeric character entities rather then normally-readable text.
> Murray

Of  course I also like all the unicode awareness of Java. In fact I 
wrote a Java-XML-Databinding including an XML parser (cf. 
that benefitted much of this awareness. In XML, there is a cleary 
defined mechanism how the file encoding can be determined (looking at 
the first 4 bytes). However, in Java there isn't such a mechanism. If I 
get some sources from somewhere and I want to compile them then I must 
know their encoding. If there are different encodings for different 
sources in a project then I have to be careful to call the compiler 
several times with changing encoding switches.

Therefore it would be great if all Java programmers would agree on the 
same encoding of source files (let it be UTF-8, ISO-8859-1 or something 
really exotic). This has nothing to do with the display - it is just the 
file encoding. Of course this is not realistic. So why not using just 
ASCII encoding with the amendment of the \u escape? Of course you 
opposed that then the sources are not so readable. But progamming books 
teach me to factor out the text parts from programm code.


> ......................................................................
> Murray Altheim          
> Knowledge Media Institute
> The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .
>    [International Committee of the Red Cross director] Kraehenbuhl
>    pointed out that complying with international humanitarian law
>    was "an obligation, not an option", for all sides of the conflict.
>    "If these rules or any other applicable rules of international
>    humanitarian law are violated, the persons responsible must be
>    held accountable for their actions," he said. -- BBC News
>   "In my judgment, this new paradigm [the War on Terror] renders
>    obsolete Geneva's strict limitations on questioning of enemy
>    prisoners and renders quaint some of its provisions [...]
>    Your determination [that the Geneva Conventions] does not apply
>    would create a reasonable basis in law that [the War Crimes Act]
>    does not apply, which would provide a solid defense to any future
>    prosecution." -- Alberto Gonzalez, appointed US Attorney General,
>    and likely Supreme Court nominee, in a memo to George W. Bush
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message