abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jv ning (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ABDERA-222) Parse failures reading utf-8 xml files that have attribute values that contain non US-ASCII valid utf-8 characters
Date Wed, 25 Mar 2009 22:07:53 GMT

    [ https://issues.apache.org/jira/browse/ABDERA-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689256#action_12689256
] 

jv ning commented on ABDERA-222:
--------------------------------

This appears to trigger when the socket read boundaries fall such that the first byte of a
multi byte character is the first byte in a read from the network socket.

In our failing case, there are 3 reads issed against the input stream returned by the httpmethod.
1 for 4 bytes
1 for 196 bytes
1 for 3800 bytes
and then for 4 k bytes.

In our failing case, the read for 196 bytes does returns less that 196 bytes, and the first
character read in the next read is the start byte of our multibyte character.
The multi-byte character is returned in the 3rd READ_ARRAY call and written to position 200
in the input buffer.
When the mutli-byte character is not the first byte sequence returned by read, there is no
exception.

"TIME"	"method"	"read byte count"	"read byte count after mark resets"	"where read data is
written into the buffer passed to read"	"read request size"	"count read"
1238017735367	" AVAILABLE"	0	0	0	4	4
1238017735367	"READ_ARRAY"	0	0			
1238017735367	" AVAILABLE"	4	4			
1238017735367	"READ_ARRAY"	4	4	4	196	158
1238017735367	" AVAILABLE"	162	162			
1238017735367	"READ_ARRAY"	162	162	200	3800	2890
1238017735370	"     CLOSE"	3052	3052			


> Parse failures reading utf-8 xml files that have attribute values that contain non US-ASCII
valid utf-8 characters
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: ABDERA-222
>                 URL: https://issues.apache.org/jira/browse/ABDERA-222
>             Project: Abdera
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>         Environment: solarix x86_64, MaxOS Leopard x86_64, linux x86_64
>            Reporter: jv ning
>
> When parsing XML files that are items fetched by http-client 3.1 
> The same items parse correctly, if written to a byte array and then a ByteArrayInputStream
on the byte array, is passed to parse.
> parser.parse(response.getResponseBodyAsStream());
> Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character (NULL, unicode
0) encountered: not valid in any content
>  at [row,col {unknown-source}]: [3,56]
>         at com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:615)
>         at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:644)
>         at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4554)
>         at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2886)
>         at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
>         at org.apache.abdera.parser.stax.FOMBuilder.getNextElementToParse(FOMBuilder.java:163)
>         at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:187) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message