commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chas Honton <c...@honton.org>
Subject Re: [text] On the value of idempotent string escape methods?
Date Wed, 22 Feb 2017 03:50:24 GMT
Not sufficiently useful to include in commons. 

Chas

> On Feb 21, 2017, at 1:31 PM, Bhowmik, Bindul <bindulbhowmik@gmail.com> wrote:
> 
>> On Tue, Feb 21, 2017 at 7:55 AM, sebb <sebbaz@gmail.com> wrote:
>>> On 21 February 2017 at 12:40, Rob Tompkins <chtompki@apache.org> wrote:
>>> 
>>>> On Feb 21, 2017, at 6:02 AM, sebb <sebbaz@gmail.com> wrote:
>>>> 
>>>> On 21 February 2017 at 04:40, Sampanna Kahu <sampyash@gmail.com <mailto:sampyash@gmail.com>>
wrote:
>>>>> Hi Guys,
>>>>> Very good points are being made above. Please allow me to add my two
cents
>>>>> :-)
>>>>> 
>>>>> What if the string contains syntactically valid HTML characters/tags
and
>>>>> our aim is to prevent rendering these tags in the browser when this string
>>>>> is being served via a web application? Or prevent the execution of harmful
>>>>> embedded scripts when serving it? The 'escapeOnce' method could be useful
>>>>> here, right?
>>>> 
>>>> I don't think so.
>>>> 
>>>>> To explain better, let's consider an example of the specific use-case
that
>>>>> I had in mind when building the 'escapeOnce' method:
>>>>> Consider the scenario of a simple restful web application where users
can
>>>>> manipulate their text using simple crud operations. Lets assume that
we do
>>>>> not have the 'escapeOnce' method yet.
>>>>> 1. A user comes and submits his string. We escape it and store it in
our
>>>>> database. If the string had any HTML characters, they would have gotten
>>>>> escaped.
>>>>> 
>>>>> 2. After some time, the same user fetches his string, adds some more
HTML
>>>>> characters and submits it. At this point, although the escape method
would
>>>>> correctly escape the freshly added HTML characters, it would escape the
>>>>> older escaped HTML characters again! (for example &gt; would become
>>>>> &amp;gt;)
>>>>> And this effect gets magnified if step number 2 above is repeated.
>>>> 
>>>> Of course, that is my point.
>>>> 
>>>> Also remember that you want to show the original string to the user.
>>>> That's not possible in general if you use this approach.
>>>> 
>>>> Suppose they originally entered
>>>> 
>>>> "To code ampersand (&) in HTML, use '&amp;'"
>>>> 
>>>> Using escapeOnce, this would become:
>>>> 
>>>> "To code ampersand (&amp;) in HTML, use '&amp;'"
>>>> 
>>>> You can either show that directly to the user, or use an unescapeOnce
>>>> and show them:
>>>> 
>>>> "To code ampersand (&) in HTML, use '&'"
> 
> I have had this use case in a project (enclosing XML/HTML content in a
> XML stream) and the expected output for escapeOnce in this case would
> be:
> "To code ampersand (&amp;) in HTML, use '&amp;amp;'"
> 
> And similarly unsecape once would generate back:
> "To code ampersand (&) in HTML, use '&amp;'"
> 
> Just my two cents, as I have had to write this code.
> 
>>>> 
>>>> Neither makes any sense.
>>>> 
>>>>> How do we solve the above problem without the 'escapeOnce' method?
>>>> 
>>>> Store the raw string in the database and escape it just before display.
>>>> 
>>>> If you are using Javascript, then use an approach such as this to escape
it:
>>>> 
>>>> document.getElementById("whereItGoes").appendChild(document.createTextNode(unsafe_str));
>>>> 
>>>> See:
>>>> 
>>>> http://shebang.brandonmintern.com/foolproof-html-escaping-in-javascript/
<http://shebang.brandonmintern.com/foolproof-html-escaping-in-javascript/>
>>>> 
>>>> This has a good discussion of some of the problems.
>>>> 
>>>> ==
>>>> 
>>>> Sorry, but it's not possible in general to do what you want, because
>>>> one cannot reliably determine if a string has been escaped just from
>>>> looking at the string.
>>> 
>>> Another thought occurred to me (again despite potential lack of value).
>>> 
>>> We should be able to quickly verify if there are any escape strings in the string
in question. A single application of unescape followed by checking string equality with the
original input would yield a predicate on the existence of escape’s present in the input
in question.
>> 
>> Again, what does unescape mean in this context?
>> Does it ignore incomplete escape sequences, or throw an error?
>> 
>>> From there we could: (1) escape if no escapes were present in the original, or
(2) throw an exception if there were escapes present in the original string.
>>> Again, this feels contrived, so I’m not really suggesting that we add it. I’m
just playing with ideas here that could accomplish what Sampanna is going for.
>> 
>> The request is impossible to fulfill reliably, and does not deserve to
>> be added to a Commons library.
>> 
>> I don't know why this is still being discussed.
>> 
>>> -Rob
>>> 
>>>> 
>>>> The most one can do is to sanitise the string by escaping anything
>>>> that is unescaped.
>>>> However that process corrupts the input - a browser won't display the
>>>> proper output.
>>>> 
>>>>>> On 20 February 2017 at 21:40, sebb <sebbaz@gmail.com> wrote:
>>>>>> 
>>>>>>> On 20 February 2017 at 15:36, Rob Tompkins <chtompki@apache.org>
wrote:
>>>>>>> 
>>>>>>>> On Feb 20, 2017, at 10:30 AM, sebb <sebbaz@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>> On 20 February 2017 at 14:55, Rob Tompkins <chtompki@apache.org>
wrote:
>>>>>>>>> 
>>>>>>>>>> On Feb 20, 2017, at 4:31 AM, sebb <sebbaz@gmail.com>
wrote:
>>>>>>>>>> 
>>>>>>>>>> On 19 February 2017 at 14:29, Raymond DeCampo <ray@decampo.org
>>>>>> <mailto:ray@decampo.org>> wrote:
>>>>>>>>>>> I am trying to see how having the proposed unescape()
method leads
>>>>>> to an a
>>>>>>>>>>> useful escape method.
>>>>>>>>>>> 
>>>>>>>>>>> E.g. clearly unescape("&amp;") would evaluate
to "&".  So would
>>>>>>>>>>> unescape("&amp;amp;").  That means the proposed
escape() method
>>>>>> would also
>>>>>>>>>>> have the same output for "&amp;" and "&amp;amp;".
>>>>>>>>>>> 
>>>>>>>>>>> I think a better approach for an idempotent escape
would be to just
>>>>>>>>>>> unescape the string once, and then run the traditional
escape.
>>>>>>>>>> 
>>>>>>>>>> That does not eliminate the problems, as you state
below.
>>>>>>>>>> 
>>>>>>>>>>> You will
>>>>>>>>>>> still have issues if the user intended to escape
the string "&amp;"
>>>>>> but you
>>>>>>>>>>> are never going to crack that without some kind
of state saving.
>>>>>>>>>> 
>>>>>>>>>> That is my exact point.
>>>>>>>>>> 
>>>>>>>>>> Since it's not possible for the function to work
reliably, we should
>>>>>>>>>> not mislead users by pretending that there is a magic
method that
>>>>>>>>>> works.
>>>>>>>>>> 
>>>>>>>>>>> Than given that the functionality is available
via to consecutive
>>>>>> calls to
>>>>>>>>>>> existing methods, I would probably be disinclined
to include it in
>>>>>> the
>>>>>>>>>>> library.
>>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>> 
>>>>>>>>> I’m a (+1) for removal as well.
>>>>>>>>> 
>>>>>>>>> Also, I didn’t mean for my example to sound like a
proposal. I merely
>>>>>> was trying to get to a potentially valuable stateless idempotent
string
>>>>>> escape function. Its contrivance it quite clear.
>>>>>>>>> 
>>>>>>>>> Any other comments out there?
>>>>>>>>> 
>>>>>>>>> We could provide a stateful escaper (that figures out
how many escapes
>>>>>> a string is in), or a method that returns the number of escapes in
a string
>>>>>> is. Again, I’m not all that sure on the value of such methods.
>>>>>>>> 
>>>>>>>> I don't think it's possible to work out the number of times
a string
>>>>>>>> has been escaped.
>>>>>>> 
>>>>>>> That may indeed be true, but it is possible to return the number
of
>>>>>> times unescape need be run before subsequent unescapes yield the
same
>>>>>> result.
>>>>>> 
>>>>>> That in itself is potentially ambiguous.
>>>>>> Does the unescaper keep going until there are no valid escape
>>>>>> sequences left, or does it stop when there is a least one ampersand
>>>>>> which is not part of a valid escape sequence?
>>>>>> 
>>>>>>> Again, I’m not sure if this is a valuable measure to concern
ourselves
>>>>>> with.
>>>>>> 
>>>>>> I don't think it provides anything useful.
>>>>>> 
>>>>>>>> 
>>>>>>>> The most one can do is to determine if a string has not been
escaped.
>>>>>>>> That would be the case where a string has one or more unescaped
>>>>>>>> characters in it.
>>>>>>>> For example "This & that" has obviously not been escaped.
>>>>>>>> 
>>>>>>>> However if a string has no un-escaped characters it it, that
does not
>>>>>>>> necessarily mean that it has already been escaped.
>>>>>>>> For example: "This &amp; that".
>>>>>>>> This might have been escaped - or it might not.
>>>>>>> 
>>>>>>> Ah, I was using the definition of “having been escaped” to
be that the
>>>>>> string contains escape sequences.
>>>>>>> 
>>>>>>>> For example it could be the answer to: "How does one code
'This &
>>>>>>>> that' in HTML?”
>>>>>>>> 
>>>>>>>> The application has to keep track of the escape-state of
the string.
>>>>>>> 
>>>>>>> Definitely agreed with your definition of “having been escaped."
>>>>>>> 
>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> -Rob
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Feb 18, 2017 at 12:04 PM, Rob Tompkins
<chtompki@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> In preparation for the 1.0 release, I think
we should address Sebb's
>>>>>>>>>>>> concern in TEXT-40 about the attempt to create
"idempotent" string
>>>>>> escape
>>>>>>>>>>>> methods. By idempotent I mean someMethod("some
string") =
>>>>>>>>>>>> someMethod(someMethod(someMethod(...someMethod("some
string")))), a
>>>>>>>>>>>> single application of a method is equal to
any number of the
>>>>>> applications
>>>>>>>>>>>> of the method on the same input.
>>>>>>>>>>>> 
>>>>>>>>>>>> Below I lay out a mechanism by which it is
possible to write such
>>>>>> methods,
>>>>>>>>>>>> but I don’t know the value in writing such
methods. I'm merely
>>>>>> expressing
>>>>>>>>>>>> that idempotency is a possibility.
>>>>>>>>>>>> 
>>>>>>>>>>>> For string "un-escaping", I believe that
we can write a method that,
>>>>>>>>>>>> indeed, is idempotent by simply running the
un-escape method the
>>>>>> finite
>>>>>>>>>>>> number of un-escapings to get to the point
at which the string
>>>>>> remains
>>>>>>>>>>>> unchanged between applications of the un-escaping
method. (I
>>>>>> believe that I
>>>>>>>>>>>> can write a proof that all un-escape methods
have such a point, if
>>>>>> that is
>>>>>>>>>>>> needed for the sake of discussion).
>>>>>>>>>>>> 
>>>>>>>>>>>> If indeed we can create an idempotent un-escape
method, then we can
>>>>>> simply
>>>>>>>>>>>> take that method run it, and then run the
escaping method one time.
>>>>>> If we
>>>>>>>>>>>> always completely unescape and then escape
once then we do have an
>>>>>>>>>>>> idempotent method.
>>>>>>>>>>>> 
>>>>>>>>>>>> Such a method might not be all that valuable
to the user though.
>>>>>>>>>>>> Furthermore, this just explains one way to
create such an idempotent
>>>>>>>>>>>> method. Whether or not more or more valuable
methods exists, would
>>>>>> take
>>>>>>>>>>>> some more though.
>>>>>>>>>>>> 
>>>>>>>>>>>> Anyone have any thoughts? My feeling is that
it might be more
>>>>>> effort than
>>>>>>>>>>>> it's worth to ensure that any string is only
"singly encoded.”
>>>>>> Further, we
>>>>>>>>>>>> probably should give a look at the “escape_once”
methods in
>>>>>>>>>>>> StringEsapeUtils.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> -Rob
>>>>>>>>>>>> ------------------------------------------------------------
>>>>>> ---------
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
<mailto:
>>>>>> dev-unsubscribe@commons.apache.org>
>>>>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
<mailto:
>>>>>> dev-help@commons.apache.org>
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <mailto:dev-unsubscribe@commons.apache.org>
>>>> For additional commands, e-mail: dev-help@commons.apache.org <mailto:dev-help@commons.apache.org>
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message