axis-c-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Mitchell (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AXIS2C-859) guththila parser fails to handle escape sequences for ampersand, less than, greater than
Date Tue, 29 Jan 2008 23:49:37 GMT

    [ https://issues.apache.org/jira/browse/AXIS2C-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563745#action_12563745
] 

Bill Mitchell commented on AXIS2C-859:
--------------------------------------

Thanks for picking up this issue, Lahiru.  I was thinking about starting to look at it in
detail myself.    

Examining the patch, I have a couple thoughts.

First, you allocate a block of memory, escape_char, to hold the copy of the token to this
point.  But the size of the block is 4 on most machines, the sizeof a pointer to char.  So
this size will frequently not be enough to copy all the characters preceding the escaped character.
 

It occurs to me that guththila tries to go to a lot of effort to avoid allocating memory.
 Having worked on some issues recently in the buffer management code, I would propose a different
solution: moving the data in the buffer itself.  Although the obvious solution would be to
replace the escaped sequence with the intended character and slide the remainder of the buffer
down, this could be timeconsuming.  A clever idea might be to replace the escaped sequence,
placing the intended character at the end of the sequence, and copy the characters up from
the start of the token, moving the token start up and reducing its size.  In most cases, this
would not be a large amount of data to move, and it avoids the memory allocation entirely.
 

Second, I like where you chose to insert this code, in the token_close logic.  Although I
can imagine trying to make this part of guththila_next, where it could massage the buffer
contents while it was deciding where the token boundaries are, it seems best to leave that
logic deciding where the edges of the tokens are without changing the characters inside the
tokens.  

Third, looking at the examples of character escaping in various texts, it appears that one
can find escaped character sequences in text and in attribute values.  So this logic either
needs to be duplicated, not pretty, or pushed down into a lower level shared routine.

Fourth, you inserted this logic in the _char_data: case.  It appears to me from the XML documentation
that we are supposed to replace sequences in text, but not in comments.  guththila_next()
seems to confuse this issue, as it treats them both as _char_data.  To distinguish the two,
my guess is it would be better to define a new token type, rather than cheat and look at the
m->guththila_event to tell them apart.  A new token type might point the direction to solving
the CDATA problem, whenever that gets approached.  Maybe use _char_data for the raw char data,
without processing, and a new _text_data for char data that undergoes processing of entity
sequences.  

Fifth, when checking the following characters after the ampersand, it would be best to check
first that enough characters are left in the token, before looking at the characters themselves
and perhaps falling off the end of the buffer.  

Of course, I'm relatively new to this logic, so these are just my observations.  

> guththila parser fails to handle escape sequences for ampersand, less than, greater than
> ----------------------------------------------------------------------------------------
>
>                 Key: AXIS2C-859
>                 URL: https://issues.apache.org/jira/browse/AXIS2C-859
>             Project: Axis2-C
>          Issue Type: Bug
>          Components: guththila
>    Affects Versions: Current (Nightly)
>         Environment: Windows XP, Visual Studio 2005, guththila parser, libcurl
>            Reporter: Bill Mitchell
>         Attachments: diff.txt
>
>
> When an incoming message contains within text the escaped ampersand sequence, "&amp;",
this sequence is being passed to the client as raw text without being converted to the single
ampersand character.  Clearly, this action must take place at the level of the parser, as
only the parser knows whether it is seeing simple text, and conversion is required, or text
embedded in a CDATA section, where conversion is not allowed.  I have tested the build with
the libxml parser, and of course the libxml parser behaves correctly: the text passed to the
client contains only the single ampersand character, not the escaped sequence.  (See section
2.4 of XML 1.0 spec.)
> Looking at the code, I expect the same problem occurs with all escaped sequences, less
than and greater than as well as ampersand, on both input and output.  I also don't see where
CDATA sections are handled, but as I am not seeing CDATA in the messages from the service
I am hitting, I have not tested this case.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: axis-c-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-c-dev-help@ws.apache.org


Mime
View raw message