lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: encoding of german analyzer source files
Date Fri, 26 Nov 2004 23:53:41 GMT
Actually, in Netbeans 4.0 although you can tell the editor the encoding
for each file individually, that doesn't solve your problem.  At least
for standard projects, you can only provide the required compiler switch
(-encoding UTF-8) at the project level.  I ran into this same issue
building Lucene and had to set the compiler encoding to be uniformly
UTF-8 for the entire Lucene source tree, which fortunately works.

There are a number of ways to work around this.  E.g., one could put a
non-UTF 8 analyzer into a dependent separate project.

These are of course Netbeans issues and not Lucene issues.  Re. Lucene,
if not automated, there should at least be a readme entry or something
that identifies the necessary file encodings (which could just say that
UTF8 is the required encoding to compile Lucene).  When I first tried to
build Lucene, because of the default ISO-8859-1 encoding, I got errors
concerning illegal character objects.  I didn't know what encoding was
required, or even for sure that my problem was an encoding issue, so I
asked a question on this list.  It would be better if this was more
apparent.

Chuck

  > -----Original Message-----
  > From: Murray Altheim [mailto:m.altheim@open.ac.uk]
  > Sent: Friday, November 26, 2004 2:29 PM
  > To: Lucene Developers List
  > Subject: Re: encoding of german analyzer source files
  > 
  > Andi Vajda wrote:
  > >>I can tell the NetBeans-IDE the encoding of every single source
file.
  > But the
  > >>problem is that I might not know which the correct encoding is. In
  > case of
  > >>Lucene it is quite clear because it is mentioned in the build.xml
file.
  > But
  > >>what is the situation if someone sends you a stemmer class for
example
  > for
  > >>Swahili and you do not know in which encoding the author wrote the
  > source.
  > >>Then you can try lots of encodings until the java compiler will be
  > satisfied
  > >>with it. And even then you might not be sure that you used the
right
  > >>encoding.
  > >
  > >>Therefore it would be great if all Java programmers would agree on
the
  > same
  > >>encoding of source files (let it be UTF-8, ISO-8859-1 or something
  > really
  > >
  > > Actually, the reason for the change to utf-8 was that for Lucene
to
  > compile on
  > > Windows with gcj (mingw), the encoding better be utf-8 because of
the
  > typical
  > > absence of iconv facility there. Therefore, it would be safe to
assume
  > the
  > > swahili stemmer source to also be encoded in utf-8.
  > >
  > > Andi..
  > 
  > Andi,
  > 
  > It may seem pretty safe to assume from practice, but from the Java
  > programmer's point of view, it's still not. It's perfectly possible
  > that the Swahili file be in UTF-8 or UTF-16, little-endian or big-
  > endian, or perhaps some other encoding we don't even know about.
  > A minor point I was trying to make is that absent some external
  > mechanism, there's really *no way* to know the encoding of a file.
  > You can sniff the first few bytes (which is what is recommended
  > in the XML 1.0 spec, you can see how they do it there), but making
  > such an assumption may lead to program failure if the assumption
  > is incorrect.
  > 
  >    Extensible Markup Language (XML) 1.0 (Third Edition)
  >    Appendix F Autodetection of Character Encodings
  >    http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing
  > 
  > The suggestions there are pretty usable for files that have nothing
  > to do with XML.
  > 
  > I don't know how many people on this list are familiar with
  > O'Reilly's "CJKV Information Processing" (with the puffer fish on
  > the cover), which opened up my eyes to a new world. After reading
  > it I got a terrible fright and couldn't sleep for weeks.
  > 
  >    "CJKV Information Processing: Chinese, Japanese, Korean
  >       & Vietnamese Computing", by Ken Lunde, O'Reilly Publishing.
  >    http://www.oreilly.com/catalog/cjkvinfo/index.html
  >
http://www.amazon.com/exec/obidos/tg/detail/-/1565922247/002-2766986-
  > 0676059?v=glance&vi=reviews
  > 
  > Murray
  > 
  >
......................................................................
  > Murray Altheim
http://kmi.open.ac.uk/people/murray/
  > Knowledge Media Institute
  > The Open University, Milton Keynes, Bucks, MK7 6AA, UK
.
  > 
  >    [International Committee of the Red Cross director] Kraehenbuhl
  >    pointed out that complying with international humanitarian law
  >    was "an obligation, not an option", for all sides of the
conflict.
  >    "If these rules or any other applicable rules of international
  >    humanitarian law are violated, the persons responsible must be
  >    held accountable for their actions," he said. -- BBC News
  >    http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm
  > 
  >   "In my judgment, this new paradigm [the War on Terror] renders
  >    obsolete Geneva's strict limitations on questioning of enemy
  >    prisoners and renders quaint some of its provisions [...]
  >    Your determination [that the Geneva Conventions] does not apply
  >    would create a reasonable basis in law that [the War Crimes Act]
  >    does not apply, which would provide a solid defense to any future
  >    prosecution." -- Alberto Gonzalez, appointed US Attorney General,
  >    and likely Supreme Court nominee, in a memo to George W. Bush
  >    http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message