cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gelo1234 <gelo1...@gmail.com>
Subject Re: [2.1] Overzealous escaping of high Unicode code points
Date Thu, 08 Jun 2017 18:17:11 GMT
Chris,

Even with C3 (cocoon 3.0 beta) unless you specify optional encoding in your
Serializer config, you fallback to default UTF-8:

org.apache.cocoon.optional.servlet.components.sax.serializers.util

public class ConfigurationUtils {

    private ConfigurationUtils() {
    }

    public static String getEncoding(Map<String, ? extends Object>
configuration) {
        String encoding = (String) configuration.get("encoding");

        if (encoding == null || "".equals(encoding)) {
            encoding = "UTF-8";
        }

        return encoding;
    }
...

Greetings,
Greg


2017-06-08 20:11 GMT+02:00 gelo1234 <gelo1234@gmail.com>:

>
> It depends on what type of Serializer you use and what kind of Serlializer
> config you put into your sitemap?
>
> By default XMLSerializer/HTMLSerializer uses UTF-8 encoding. So instead of
> 1 UTF-16 char you got 2 chars UTF-8 encoded.
> Of cource there might be also issue with emoji charset, but I would first
> try to change encoding in Serliazer config (to be UTF-16).
>
> Greetings,
> -Greg
>
> 2017-06-07 10:43 GMT+02:00 Flynn, Peter <pflynn@ucc.ie>:
>
>> I had a related problem with 3–4 CJK characters being converted to their
>> &#hex; format. Very weird, but it turned out to be the old and buggy copy
>> of jtidy, and I can't figure out how to replace it.
>>
>> I haven't had the problem you describe, though, and I have a user who has
>> implemented emoji in Cocoon, see http://research.ucc.ie/emojis/
>>
>> P
>>
>> --
>> Peter Flynn | Academic and Collaborative Technologies | IT Services |
>> University College Cork | Ireland | pflynn@ucc.ie |
>> http://research.ucc.ie/profiles/H505/pflynn | Sent from Hiri
>> <https://www.hiri.com/>
>>
>>
>> On 2017-06-06 17:08:51+01:00 Christopher Schultz wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> All,
>>
>> I've been testing my application for use with high Unicode code points
>> such as emoji like 😍 which is this one:http://www.fileformat.info/info/unicode/char/1F60D/index.htm
>>
>> My application and database can handle this code point, but Cocoon
>> butchers it in a way that I have seen before -- the way that
>> commons-lang's StringEscapeUtils.escapeXml/escapeHtml seems to do.
>>
>> Instead of letting the character through as-is, it tries to convert it
>> into these two numbered entities:
>>
>> ��
>>
>> Oddly enough, those are the two double-byte UTF-16 characters you'd
>> get, but they shouldn't be split-up like that, I don't think.
>>
>> I haven't found a version of commons-lang 2.x that doesn't break these
>> kinds of characters. commons-lang3 does the right thing, but they are
>> incompatible libraries.
>>
>> Does anyone know the code well enough to know how difficult it would
>> be to change the way Cocoon 2.1 escapes its output? For example, by
>> using commons-lang3?
>>
>> I haven't tried Cocoon 2.2, yet, and I can't tell what dependencies it
>> has. I also can't exactly tell what to do now that I've downloaded the
>> binary package. Can this just be used as a drop-in replacement for
>> Cocoon 2.1.x? Cocoon 2.1.x could build a WAR file that I then
>> customized for my own application, adding various libraries and
>> configuration files to it. I think I'll follow-up with a separate post
>> about this.
>>
>> - -chris
>>
>> -----BEGIN PGP SIGNATURE-----
>> Comment: GPGTools - http://gpgtools.org
>>
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>
>> iQIcBAEBCAAGBQJZNtOBAAoJEBzwKT+lPKRYEuIP/3gSJZDNEbzsHkI5zYjMZbFf
>> vKvRRnBSl+6IdrcUasftf+AkXIIYwj6xnUQ7winsLW/n8TdDG6jPqsg4Khsozc6z
>> aa23qDly62gmCsqpLohXxt/ZNKdPY4sOTghaaEUFTtTgpeD3M/INF90myT8SwO4K
>> WUtqVparSqp/Zf9JMm3OCIguMKbsRNYWVIQuiJxDQJkWYwrw0iVk2v8mc6iz/mDF
>> w6np4EvFr9fqdDufKpPw8anEkrp5JEuTx47vMOtz4sixVr2C6ehgP4zs3kVzdVid
>> QPeUsrosV1tsRC9bMVLGmjo7UhNseeXCp/AceIT6AQE8Q1clgy9GcoNMf60dgGku
>> et0xoGptYgbCfmJL+PuA9y7fJYjgTTQheqzuC721n2/sx+kyBSBWSMIhqia2sd4y
>> spcT4kw+uChsWjwoeGOHOm4IimrVgXkfJeHVSXV4m66sHS9t+bDiiErwS1SikvSV
>> qF64/L0u8hYFLD1ehURoHBi4foE1Td3eRGOGHgodcYL9C8U+Yv+fWaiYQ5O4CCnW
>> pToFvVoQOdZY+VVC8hz1ggbRMSxjT2GQLLJ2mjbGzGUJjlwyQaoZnADSSu0efj88
>> O2AlWB2Bf/Ag6E4C9jEjj+cauBfR+1NIK7F1Jo6C02yY1SUOSoOAFDZ7EkO4qYAO
>> YhvgSQXNmKps6rusNjNZ
>> =q8Eh
>> -----END PGP SIGNATURE-----
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>> For additional commands, e-mail: users-help@cocoon.apache.org
>>
>>
>

Mime
View raw message