commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Meyn (GBIF)" <om...@gbif.org>
Subject Re: [digester] utf-8 problems
Date Wed, 22 Dec 2010 11:52:48 GMT
Bah - please disregard.  Of course the answer was me being dumb.  Turns out there are two passes
through digesters, the second one taking input from the first.  The second pass wasn't setting
the encoding on its InputStream.

Apologies for the distraction,
Oliver

On 2010-12-22, at 9:54 AM, Oliver Meyn (GBIF) wrote:

> Hi all,
> 
> I'm trying to read utf-8 encoded files and pass those to a digester for parsing.  The
parsed results break the encoding (subtly) such that some characters are not being represented
properly.  My test case so far has been u umlaut, so ü.
> 
> I'm confident the file is uft-8 because I created it like this (so, without BOM):
> 
>      Writer out = new OutputStreamWriter(new FileOutputStream(path), "UTF-8");
>      out.write(longXmlStringContainingUmlaut);
>      out.close();
> 
> My Digester code looks like this:
> 
>      FileInputStream fis = new FileInputStream(file);
> 
>      Digester digester = new Digester();
>      digester.setNamespaceAware(true);
>      digester.setValidating(false);
>      digester.push(targetObject);
> 
>      // a bunch of digester rules
> 
>      InputSource inputSource = new InputSource(fis);
>      inputSource.setEncoding("UTF-8");
>      digester.parse(inputSource);
> 
> If I dump the contents of the fis before digesting using an InputStreamReader set to
utf8 I see the umlaut.  The result of the digest is ü.  I've tried all of the parse signatures,
including the inputStream version where I use an InputStreamReader set to UTF-8.  I've also
tried the InputSource method (above) without using the setEncoding.  All cases produce the
same result.  
> 
> I suspect this may be a Xerces problem, but maybe it's Digester, and ideally it's me
being dumb in some way.  Any and all help is appreciated.
> 
> Thanks,
> Oliver
> --
> Oliver Meyn
> Software Developer
> Global Biodiversity Information Facility (GBIF)
> +45 35 32 15 12
> http://www.gbif.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
> 
> 


--
Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12
http://www.gbif.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Mime
View raw message