abdera-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject Re: more fun with character encodings
Date Sat, 08 Sep 2007 18:23:08 GMT
I just wanted to follow up with this.  If you think the title value is
being converted to MacRoman at some point, try forcing it back to UTF-8
just prior to setting the value in the abdera objects, e.g.

public class Misc {

  public static void main(String... args) throws Exception {

    String t = "ë";

    String s = convert(t, "UTF-8","MacRoman");

    Writer w = new OutputStreamWriter(System.out,"UTF-8");

    w.write(s);     // wrong
    w.write("\n");
    w.write(convert(s,"MacRoman","UTF-8"));   // correct

    w.flush();

  }

  public static String convert(
    String string,
    String from,
    String to)
      throws UnsupportedEncodingException {
    return new String(string.getBytes(from),to);
  }
}

If you're converting from the default charset to UTF-8, and you're not
sure what the default charset is, use
java.nio.charset.Charset.defaultCharset().name() to get the name of the
default charset at runtime.

If this doesn't work for you, then we definitely still have a problem :-)

- James

Brian Moseley wrote:
> i'm running into a similar issue as was discussed earlier this week
> with regard to problem data.
> 
> as was mentioned earlier, it turns out that the os x native character
> encoding is MacRoman. well, it appears that even though both my mysql
> database and my jdbc connection are configured to use utf8, at some
> point the data taken from the db and inserted into an atom feed is
> turning up in MacRoman, even though the ResponseContext's content type
> is set to "application/atom+xml; charset=UTF-8".
> 
> from my re-reading of the various recent threads and my examining of
> the code in the 0.3.0 branch, it seems like the value i set for an
> entry's title (for instance) should be converted into utf8 while the
> entry is being serialized. but it's clearly not. when i look at the
> feed as it's fetched from my server by curl, in Terminal.app, the
> non-ascii character in the entry title is rendered using what i like
> to call the "wtf" glyph rather than the one that represents the actual
> character in question. and when i run the feed through the
> validome.org validator, it complains about this character being an
> invalid utf8 character.
> 
> when i run the server and database on linux and get a non-ascii
> character into the database,viewing the corresponding entry document
> in Terminal.app shows me the expected character, not the wtf one.
> 
> i've run through all of my code looking for places where we might be
> instantiating a Reader without specifying an encoding, but i can't
> find any. i'm using the 0.3.0-incubating jars that i deployed earlier
> today into the people.apache.org/m2-incubating-repository which
> contain the recent default encoding fixes. so i'm at a loss as to what
> could be going on. i feel like i'm missing something basic with regard
> to character encodings. any pointers?
> 
> for reference, here's a url for the entry document as served by os x.
> notice the final character of the title and summary are both the wtf
> character.
> 
> http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14?ticket=fnwrt8htw1
> 
> and here is what happens when i plug that url into validome's atom validator:
> 
> http://www.validome.org/rss-atom/validate?lang=en&url=http://bcm.osafoundation.org:8080/chandler/atom/item/a05e2870-5cce-11dc-f4b0-84f152603f14%3fticket=fnwrt8htw1&version=atom_1_0
> 
> thanks!
> 

Mime
View raw message