cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gelo1234 <gelo1...@gmail.com>
Subject Re: [2.1] Overzealous escaping of high Unicode code points
Date Thu, 08 Jun 2017 18:11:03 GMT
It depends on what type of Serializer you use and what kind of Serlializer
config you put into your sitemap?

By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of
1 UTF-16 char you got 2 chars UTF-8 encoded.
Of cource there might be also issue with emoji charset, but I would first
try to change encoding in Serliazer config (to be UTF-16).

Greetings,
-Greg

2017-06-07 10:43 GMT+02:00 Flynn, Peter <pflynn@ucc.ie>:

> I had a related problem with 3–4 CJK characters being converted to their
> &#hex; format. Very weird, but it turned out to be the old and buggy copy
> of jtidy, and I can't figure out how to replace it.
>
> I haven't had the problem you describe, though, and I have a user who has
> implemented emoji in Cocoon, see http://research.ucc.ie/emojis/
>
> P
>
> --
> Peter Flynn | Academic and Collaborative Technologies | IT Services |
> University College Cork | Ireland | pflynn@ucc.ie |
> http://research.ucc.ie/profiles/H505/pflynn | Sent from Hiri
> <https://www.hiri.com/>
>
>
> On 2017-06-06 17:08:51+01:00 Christopher Schultz wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> All,
>
> I've been testing my application for use with high Unicode code points
> such as emoji like 😍 which is this one:http://www.fileformat.info/info/unicode/char/1F60D/index.htm
>
> My application and database can handle this code point, but Cocoon
> butchers it in a way that I have seen before -- the way that
> commons-lang's StringEscapeUtils.escapeXml/escapeHtml seems to do.
>
> Instead of letting the character through as-is, it tries to convert it
> into these two numbered entities:
>
> ��
>
> Oddly enough, those are the two double-byte UTF-16 characters you'd
> get, but they shouldn't be split-up like that, I don't think.
>
> I haven't found a version of commons-lang 2.x that doesn't break these
> kinds of characters. commons-lang3 does the right thing, but they are
> incompatible libraries.
>
> Does anyone know the code well enough to know how difficult it would
> be to change the way Cocoon 2.1 escapes its output? For example, by
> using commons-lang3?
>
> I haven't tried Cocoon 2.2, yet, and I can't tell what dependencies it
> has. I also can't exactly tell what to do now that I've downloaded the
> binary package. Can this just be used as a drop-in replacement for
> Cocoon 2.1.x? Cocoon 2.1.x could build a WAR file that I then
> customized for my own application, adding various libraries and
> configuration files to it. I think I'll follow-up with a separate post
> about this.
>
> - -chris
>
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
>
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBCAAGBQJZNtOBAAoJEBzwKT+lPKRYEuIP/3gSJZDNEbzsHkI5zYjMZbFf
> vKvRRnBSl+6IdrcUasftf+AkXIIYwj6xnUQ7winsLW/n8TdDG6jPqsg4Khsozc6z
> aa23qDly62gmCsqpLohXxt/ZNKdPY4sOTghaaEUFTtTgpeD3M/INF90myT8SwO4K
> WUtqVparSqp/Zf9JMm3OCIguMKbsRNYWVIQuiJxDQJkWYwrw0iVk2v8mc6iz/mDF
> w6np4EvFr9fqdDufKpPw8anEkrp5JEuTx47vMOtz4sixVr2C6ehgP4zs3kVzdVid
> QPeUsrosV1tsRC9bMVLGmjo7UhNseeXCp/AceIT6AQE8Q1clgy9GcoNMf60dgGku
> et0xoGptYgbCfmJL+PuA9y7fJYjgTTQheqzuC721n2/sx+kyBSBWSMIhqia2sd4y
> spcT4kw+uChsWjwoeGOHOm4IimrVgXkfJeHVSXV4m66sHS9t+bDiiErwS1SikvSV
> qF64/L0u8hYFLD1ehURoHBi4foE1Td3eRGOGHgodcYL9C8U+Yv+fWaiYQ5O4CCnW
> pToFvVoQOdZY+VVC8hz1ggbRMSxjT2GQLLJ2mjbGzGUJjlwyQaoZnADSSu0efj88
> O2AlWB2Bf/Ag6E4C9jEjj+cauBfR+1NIK7F1Jo6C02yY1SUOSoOAFDZ7EkO4qYAO
> YhvgSQXNmKps6rusNjNZ
> =q8Eh
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>

Mime
View raw message