incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Ansell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-99) NQuadsWriter should force ASCII in OutputStream constructor
Date Tue, 22 May 2012 23:18:41 GMT

    [ https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281300#comment-13281300
] 

Peter Ansell commented on ANY23-99:
-----------------------------------

I have had instances in the past where the most difficult to find fault in a UTF-8-standardised
system has been the use of an OutputStreamWriter(OutputStream) constructor instead of the
OutputStreamWriter(OutputStream,Charset) constructor. I have no specific example of non-ASCII
output coming out of the NQuadsWriter. Are there any character sets that could create non-ASCII
compatible NQuads documents if the users locale was setup with the charset and OutputStreamWriter(OutputStreap)
inherited that locale by default because we didn't specify US-ASCII explicitly? The escaping
seems to make it okay at a semantic level but it would still practically be variable based
on the JVM environment properties if it isn't explicitly set. Not changing the constructor
just seems like we are looking for a bug that could be easily avoided (based on the current
spec saying ASCII-only).

There are examples of non-ASCII data successfully going into the NQuadsParser in NQuadsParserTest,
which is to be expected if we accept liberally and output standardised NQuads, although it
is a little strange that the test suite explicitly supports it given the specification is
very clear currently about the \u encoding rules for all non-ASCII characters.

It would be great if both NTriples and NQuads would be able to fully support UTF-8 when they
are revised. It is also great that NTriples is getting a specific MIME type this time around.
Hopefully the distinction between the two types for essentially the same format doesn't confuse
people. It seems fairly unique to have a scenario where a single format has two legitimate
types where the only difference is the encoding rules. It would be ideal to be able to handle
\uNNNN the same as the native UTF-8 bytes and that would make it possible to parse old documents
while all new documents just use UTF-8 without having to check whether they wanted text/plain
NTriples or application/n-triples NTriples when writing out. 

Naively I would see this possibly requiring two different Rio writers (as Rio writers have
a unique relationship with single RDFFormat which has a single charset attached to it) and
possibly two different Rio parsers for the same reason. That doesn't really seem ideal but
if necessary it may be a workaround.
                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII encoded. [1]
The current NQuadsWriter(OutputStream) constructor does not enforce this when creating the
OutputStreamWriter to wrap up the given outputstream. If it is not enforced, then the users
locale will be used to create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message