commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Meyn (GBIF)" <om...@gbif.org>
Subject [digester] utf-8 problems
Date Wed, 22 Dec 2010 08:54:59 GMT
Hi all,

I'm trying to read utf-8 encoded files and pass those to a digester for parsing.  The parsed
results break the encoding (subtly) such that some characters are not being represented properly.
 My test case so far has been u umlaut, so ü.

I'm confident the file is uft-8 because I created it like this (so, without BOM):

      Writer out = new OutputStreamWriter(new FileOutputStream(path), "UTF-8");
      out.write(longXmlStringContainingUmlaut);
      out.close();

My Digester code looks like this:

      FileInputStream fis = new FileInputStream(file);

      Digester digester = new Digester();
      digester.setNamespaceAware(true);
      digester.setValidating(false);
      digester.push(targetObject);

      // a bunch of digester rules

      InputSource inputSource = new InputSource(fis);
      inputSource.setEncoding("UTF-8");
      digester.parse(inputSource);

If I dump the contents of the fis before digesting using an InputStreamReader set to utf8
I see the umlaut.  The result of the digest is ü.  I've tried all of the parse signatures,
including the inputStream version where I use an InputStreamReader set to UTF-8.  I've also
tried the InputSource method (above) without using the setEncoding.  All cases produce the
same result.  

I suspect this may be a Xerces problem, but maybe it's Digester, and ideally it's me being
dumb in some way.  Any and all help is appreciated.

Thanks,
Oliver
--
Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12
http://www.gbif.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Mime
View raw message