commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Meyn (GBIF)" <>
Subject [digester] utf-8 problems
Date Wed, 22 Dec 2010 08:54:59 GMT
Hi all,

I'm trying to read utf-8 encoded files and pass those to a digester for parsing.  The parsed
results break the encoding (subtly) such that some characters are not being represented properly.
 My test case so far has been u umlaut, so ü.

I'm confident the file is uft-8 because I created it like this (so, without BOM):

      Writer out = new OutputStreamWriter(new FileOutputStream(path), "UTF-8");

My Digester code looks like this:

      FileInputStream fis = new FileInputStream(file);

      Digester digester = new Digester();

      // a bunch of digester rules

      InputSource inputSource = new InputSource(fis);

If I dump the contents of the fis before digesting using an InputStreamReader set to utf8
I see the umlaut.  The result of the digest is ü.  I've tried all of the parse signatures,
including the inputStream version where I use an InputStreamReader set to UTF-8.  I've also
tried the InputSource method (above) without using the setEncoding.  All cases produce the
same result.  

I suspect this may be a Xerces problem, but maybe it's Digester, and ideally it's me being
dumb in some way.  Any and all help is appreciated.

Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message