commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Tompkins <chtom...@apache.org>
Subject Re: [text] On the value of idempotent string escape methods?
Date Tue, 21 Feb 2017 12:40:12 GMT

> On Feb 21, 2017, at 6:02 AM, sebb <sebbaz@gmail.com> wrote:
> 
> On 21 February 2017 at 04:40, Sampanna Kahu <sampyash@gmail.com <mailto:sampyash@gmail.com>>
wrote:
>> Hi Guys,
>> Very good points are being made above. Please allow me to add my two cents
>> :-)
>> 
>> What if the string contains syntactically valid HTML characters/tags and
>> our aim is to prevent rendering these tags in the browser when this string
>> is being served via a web application? Or prevent the execution of harmful
>> embedded scripts when serving it? The 'escapeOnce' method could be useful
>> here, right?
> 
> I don't think so.
> 
>> To explain better, let's consider an example of the specific use-case that
>> I had in mind when building the 'escapeOnce' method:
>> Consider the scenario of a simple restful web application where users can
>> manipulate their text using simple crud operations. Lets assume that we do
>> not have the 'escapeOnce' method yet.
>> 1. A user comes and submits his string. We escape it and store it in our
>> database. If the string had any HTML characters, they would have gotten
>> escaped.
>> 
>> 2. After some time, the same user fetches his string, adds some more HTML
>> characters and submits it. At this point, although the escape method would
>> correctly escape the freshly added HTML characters, it would escape the
>> older escaped HTML characters again! (for example &gt; would become
>> &amp;gt;)
>> And this effect gets magnified if step number 2 above is repeated.
> 
> Of course, that is my point.
> 
> Also remember that you want to show the original string to the user.
> That's not possible in general if you use this approach.
> 
> Suppose they originally entered
> 
> "To code ampersand (&) in HTML, use '&amp;'"
> 
> Using escapeOnce, this would become:
> 
> "To code ampersand (&amp;) in HTML, use '&amp;'"
> 
> You can either show that directly to the user, or use an unescapeOnce
> and show them:
> 
> "To code ampersand (&) in HTML, use '&'"
> 
> Neither makes any sense.
> 
>> How do we solve the above problem without the 'escapeOnce' method?
> 
> Store the raw string in the database and escape it just before display.
> 
> If you are using Javascript, then use an approach such as this to escape it:
> 
> document.getElementById("whereItGoes").appendChild(document.createTextNode(unsafe_str));
> 
> See:
> 
> http://shebang.brandonmintern.com/foolproof-html-escaping-in-javascript/ <http://shebang.brandonmintern.com/foolproof-html-escaping-in-javascript/>
> 
> This has a good discussion of some of the problems.
> 
> ==
> 
> Sorry, but it's not possible in general to do what you want, because
> one cannot reliably determine if a string has been escaped just from
> looking at the string.

Another thought occurred to me (again despite potential lack of value). 

We should be able to quickly verify if there are any escape strings in the string in question.
A single application of unescape followed by checking string equality with the original input
would yield a predicate on the existence of escape’s present in the input in question. From
there we could: (1) escape if no escapes were present in the original, or (2) throw an exception
if there were escapes present in the original string.

Again, this feels contrived, so I’m not really suggesting that we add it. I’m just playing
with ideas here that could accomplish what Sampanna is going for.

-Rob

> 
> The most one can do is to sanitise the string by escaping anything
> that is unescaped.
> However that process corrupts the input - a browser won't display the
> proper output.
> 
>> On 20 February 2017 at 21:40, sebb <sebbaz@gmail.com> wrote:
>> 
>>> On 20 February 2017 at 15:36, Rob Tompkins <chtompki@apache.org> wrote:
>>>> 
>>>>> On Feb 20, 2017, at 10:30 AM, sebb <sebbaz@gmail.com> wrote:
>>>>> 
>>>>> On 20 February 2017 at 14:55, Rob Tompkins <chtompki@apache.org>
wrote:
>>>>>> 
>>>>>>> On Feb 20, 2017, at 4:31 AM, sebb <sebbaz@gmail.com> wrote:
>>>>>>> 
>>>>>>> On 19 February 2017 at 14:29, Raymond DeCampo <ray@decampo.org
>>> <mailto:ray@decampo.org>> wrote:
>>>>>>>> I am trying to see how having the proposed unescape() method
leads
>>> to an a
>>>>>>>> useful escape method.
>>>>>>>> 
>>>>>>>> E.g. clearly unescape("&amp;") would evaluate to "&".
 So would
>>>>>>>> unescape("&amp;amp;").  That means the proposed escape()
method
>>> would also
>>>>>>>> have the same output for "&amp;" and "&amp;amp;".
>>>>>>>> 
>>>>>>>> I think a better approach for an idempotent escape would
be to just
>>>>>>>> unescape the string once, and then run the traditional escape.
>>>>>>> 
>>>>>>> That does not eliminate the problems, as you state below.
>>>>>>> 
>>>>>>>> You will
>>>>>>>> still have issues if the user intended to escape the string
"&amp;"
>>> but you
>>>>>>>> are never going to crack that without some kind of state
saving.
>>>>>>> 
>>>>>>> That is my exact point.
>>>>>>> 
>>>>>>> Since it's not possible for the function to work reliably, we
should
>>>>>>> not mislead users by pretending that there is a magic method
that
>>>>>>> works.
>>>>>>> 
>>>>>>>> Than given that the functionality is available via to consecutive
>>> calls to
>>>>>>>> existing methods, I would probably be disinclined to include
it in
>>> the
>>>>>>>> library.
>>>>>>> 
>>>>>>> +1
>>>>>> 
>>>>>> I’m a (+1) for removal as well.
>>>>>> 
>>>>>> Also, I didn’t mean for my example to sound like a proposal. I
merely
>>> was trying to get to a potentially valuable stateless idempotent string
>>> escape function. Its contrivance it quite clear.
>>>>>> 
>>>>>> Any other comments out there?
>>>>>> 
>>>>>> We could provide a stateful escaper (that figures out how many escapes
>>> a string is in), or a method that returns the number of escapes in a string
>>> is. Again, I’m not all that sure on the value of such methods.
>>>>> 
>>>>> I don't think it's possible to work out the number of times a string
>>>>> has been escaped.
>>>> 
>>>> That may indeed be true, but it is possible to return the number of
>>> times unescape need be run before subsequent unescapes yield the same
>>> result.
>>> 
>>> That in itself is potentially ambiguous.
>>> Does the unescaper keep going until there are no valid escape
>>> sequences left, or does it stop when there is a least one ampersand
>>> which is not part of a valid escape sequence?
>>> 
>>>> Again, I’m not sure if this is a valuable measure to concern ourselves
>>> with.
>>> 
>>> I don't think it provides anything useful.
>>> 
>>>>> 
>>>>> The most one can do is to determine if a string has not been escaped.
>>>>> That would be the case where a string has one or more unescaped
>>>>> characters in it.
>>>>> For example "This & that" has obviously not been escaped.
>>>>> 
>>>>> However if a string has no un-escaped characters it it, that does not
>>>>> necessarily mean that it has already been escaped.
>>>>> For example: "This &amp; that".
>>>>> This might have been escaped - or it might not.
>>>> 
>>>> Ah, I was using the definition of “having been escaped” to be that the
>>> string contains escape sequences.
>>>> 
>>>>> For example it could be the answer to: "How does one code 'This &
>>>>> that' in HTML?”
>>>>> 
>>>>> The application has to keep track of the escape-state of the string.
>>>> 
>>>> Definitely agreed with your definition of “having been escaped."
>>>> 
>>>>> 
>>>>>> Cheers,
>>>>>> -Rob
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Feb 18, 2017 at 12:04 PM, Rob Tompkins <chtompki@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> In preparation for the 1.0 release, I think we should
address Sebb's
>>>>>>>>> concern in TEXT-40 about the attempt to create "idempotent"
string
>>> escape
>>>>>>>>> methods. By idempotent I mean someMethod("some string")
=
>>>>>>>>> someMethod(someMethod(someMethod(...someMethod("some
string")))), a
>>>>>>>>> single application of a method is equal to any number
of the
>>> applications
>>>>>>>>> of the method on the same input.
>>>>>>>>> 
>>>>>>>>> Below I lay out a mechanism by which it is possible to
write such
>>> methods,
>>>>>>>>> but I don’t know the value in writing such methods.
I'm merely
>>> expressing
>>>>>>>>> that idempotency is a possibility.
>>>>>>>>> 
>>>>>>>>> For string "un-escaping", I believe that we can write
a method that,
>>>>>>>>> indeed, is idempotent by simply running the un-escape
method the
>>> finite
>>>>>>>>> number of un-escapings to get to the point at which the
string
>>> remains
>>>>>>>>> unchanged between applications of the un-escaping method.
(I
>>> believe that I
>>>>>>>>> can write a proof that all un-escape methods have such
a point, if
>>> that is
>>>>>>>>> needed for the sake of discussion).
>>>>>>>>> 
>>>>>>>>> If indeed we can create an idempotent un-escape method,
then we can
>>> simply
>>>>>>>>> take that method run it, and then run the escaping method
one time.
>>> If we
>>>>>>>>> always completely unescape and then escape once then
we do have an
>>>>>>>>> idempotent method.
>>>>>>>>> 
>>>>>>>>> Such a method might not be all that valuable to the user
though.
>>>>>>>>> Furthermore, this just explains one way to create such
an idempotent
>>>>>>>>> method. Whether or not more or more valuable methods
exists, would
>>> take
>>>>>>>>> some more though.
>>>>>>>>> 
>>>>>>>>> Anyone have any thoughts? My feeling is that it might
be more
>>> effort than
>>>>>>>>> it's worth to ensure that any string is only "singly
encoded.”
>>> Further, we
>>>>>>>>> probably should give a look at the “escape_once”
methods in
>>>>>>>>> StringEsapeUtils.
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> -Rob
>>>>>>>>> ------------------------------------------------------------
>>> ---------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <mailto:
>>> dev-unsubscribe@commons.apache.org>
>>>>>>> For additional commands, e-mail: dev-help@commons.apache.org
<mailto:
>>> dev-help@commons.apache.org>
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>> 
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org <mailto:dev-unsubscribe@commons.apache.org>
> For additional commands, e-mail: dev-help@commons.apache.org <mailto:dev-help@commons.apache.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message