qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Schloming <...@alum.mit.edu>
Subject Re: UTF8 / binary strings in dynamic languages
Date Wed, 21 Aug 2013 17:49:52 GMT
That's the end point with python 3, but that doesn't work with python 2.x.

According to Guido, the way python 2.5 users should express their intent is
to write data using unadorned quotes, e.g. 'data', and text using unicode
quotes, e.g. u'text'. In python 2.6, we are supposed to express our intent
with b'data' or the bytes constructor, with full knowledge that both of
these are just aliases for 'data' and str. The python 2 to 3 conversion
scripts will convert code correctly given these conventions.

I believe this means that we have no choice but to map the str type to
binary for python 2.x because if we map it to unicode then people using the
recommended practices will get completely the opposite of the behaviour
they expect, and their code won't upgrade to python 3.x without manual
intervention.

In other words, we need to ensure the following lines of code work
correctly and consistently across all python versions given appropriate
application of the 2 to 3 conversion script:

    message.properties[u"string-key"] = "binary value" # 2 - 3 conversion
script will automatically apply the b prefix here
    message.properties[u"string-key"] = b"binary value"

--Rafael



On Wed, Aug 21, 2013 at 12:14 PM, Justin Ross <justin.ross@gmail.com> wrote:

> I'm missing something about this.  The python 2-3 migration plan is to
> treat a value expressed with 'str' as unambiguously textual, and a
> value expressed with 'bytes' as unambiguously data.  Doesn't that line
> up with this proposal?
>
> On Wed, Aug 21, 2013 at 11:49 AM, Rafael Schloming <rhs@alum.mit.edu>
> wrote:
> > I think for python at least if we were to treat ambiguous string values
> as
> > text rather than data, we would be at odds with the python community's
> 2->3
> > migration plan. The following thread has a useful discussion of this that
> > is worth a careful read:
> >
> >
> http://stackoverflow.com/questions/1736228/python-data-vs-text/1736279#1736279
> >
> > --Rafael
> >
> >
> >
> > On Wed, Aug 21, 2013 at 11:31 AM, Justin Ross <justin.ross@gmail.com>
> wrote:
> >
> >> Jimmy, thanks for getting this started.  I'd love your feedback to
> >> help sort this out.
> >>
> >> I think these are the cases:
> >>
> >> 1. If the language string is unambiguously textual, send it as amqp
> str16
> >> 2. If the language string is unambiguously arbitrary bytes, send it as
> >> amqp vbin
> >>
> >> These are easy.  We can tell the user's intention, and we can do the
> >> right thing.
> >>
> >> 3. If the language string is an overloaded text/bytes type, as is
> >> regrettably quite common, what do we do then?
> >>
> >> The current answer to this question is "send it as vbin".  That's very
> >> safe, insofar as it won't throw any sort of encoding exception.  It
> >> does not, however, always honor what I think is the user's more
> >> typical intention: produce an ascii string at the other end.
> >>
> >> So for 3, I'd like to consider the possibility of, by default, sending
> >> ambiguous language strings as ascii rendered to amqp str16.  This
> >> requires an encoding step that may produce errors.  And maybe that's
> >> just too obnoxious!  That's what I'd like to know.
> >>
> >> In summary, if we have a way to determine what the user wanted (text
> >> or bytes), we should try to carry that through on the wire.  At the
> >> following URL I've tried to map out what type information we can get
> >> for each language.  Please update it as you please.
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/qpid/Language+support+for+unambiguous+text+string+and+byte+array+types
> >>
> >> On Wed, Aug 21, 2013 at 8:44 AM, Jimmy Jones <jimmyjones2@gmx.co.uk>
> >> wrote:
> >> >> > AFAIK in perl, if you include unicode characters in a string it'll
> >> >> > set the utf8 flag. If you don't include any unicode characters
> (eg. 7
> >> >> > bit ascii, or raw bytes) the flag won't be set. So given a perl
> >> >> > scalar that doesn't contain any utf8 characters, you don't know
if
> >> >> > its a textual string (str16) or a binary string (vbin). There
is a
> >> >> > is_utf8_string function, but that'll only tell you if the string
> >> >> > would be valid utf8, but it could be a binary string that happens
> to
> >> >> > be valid utf8, so that's not really safe.
> >> >>
> >> >> You can explicitly mark it as utf8 using utf8::upgrade() though,
> right?
> >> >> Certainly I tried that in a simple test and the property in question
> was
> >> >> then sent as str16.
> >> >
> >> > Yes, if I as a user had a string that was textual, I could call
> >> utf8::upgrade() to ensure it got sent as str16. I guess this is similar
> in
> >> concept to calling setEncoding in C++, although maybe less natural in a
> >> dynamically typed language.
> >>
> >> It would be more reasonable to treat perl scalars as textual for our
> >> API if perl offered a good way to explicitly handle byte arrays.  My
> >> (certainly insufficient) web browsing suggested that wasn't really
> >> available, or not in a form recommended for use.  Any candidates for a
> >> serviceable explicitly-arbitrary-bytes-and-not-text-at-all "type" in
> >> perl?
> >>
> >> Thanks!
> >> Justin
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
> >> For additional commands, e-mail: dev-help@qpid.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message