commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stian Soiland-Reyes (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (COMMONSRDF-51) RDF-1.1 specifies that language tags need to be compared using lower-case
Date Wed, 11 Jan 2017 16:25:59 GMT

    [ https://issues.apache.org/jira/browse/COMMONSRDF-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818566#comment-15818566
] 

Stian Soiland-Reyes edited comment on COMMONSRDF-51 at 1/11/17 4:24 PM:
------------------------------------------------------------------------

I think this needs to be clarified on public-rdf-comments@w3.org as our "character by character"
is a [quote from the spec|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal]:

{quote}

Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the
two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal,
character by character. Thus, two literals can have the same value without being the same
RDF term. For example:

      "1"^^xs:integer
      "01"^^xs:integer
    
denote the same value, but are not the same literal RDF terms and are not term-equal because
their lexical form differs.
{quote}

It also says above the value space is always in lower case, but then says equality is done
"character by character" -- not by value space.  (As that example shows, the lexical value
of data types like integers are also compared by character instead of by value space)

I have nevertheless started a branch [COMMONSRDF-51-langtag-lcase|https://github.com/apache/commons-rdf/compare/COMMONSRDF-51-langtag-lcase]
to try this out.. this revealed bugs in the bindings for simple (just the Turkish case), jsonld-java
(which does no validation of language tags), rdf4j (fails Turkish test) and jena (fails Turkish
test).

As both RDF4J and Jena are vulnerable to the Turkish case, that should be reported upstream
after rdf-comments clarifications.

Would it make sense for Commons RDF to strengthen getLanguageTag() to ALWAYS return the language
tag in lower case for any RDF implementations (e.g. normalize if implementation does not do
it correctly internally) - as a kind of interoperability/RDF 1.1 measure - or should we strive
to keep their current case representation as-is? 


was (Author: stain):
I think this needs to be clarified on public-rdf-comments@w3.org as our "character by character"
is a [quote from the spec|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal]:

{quote}

Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the
two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal,
character by character. Thus, two literals can have the same value without being the same
RDF term. For example:

      "1"^^xs:integer
      "01"^^xs:integer
    
denote the same value, but are not the same literal RDF terms and are not term-equal because
their lexical form differs.
{quote}

It also says above the value space is always in lower case, but then says equality is done
"character by character" and not by value space.  (As that example shows, the lexical value
of data types like integers are also compared by character instead of by value space)

I have nevertheless started a branch [COMMONSRDF-51-langtag-lcase|https://github.com/apache/commons-rdf/compare/COMMONSRDF-51-langtag-lcase]
to try this out.. this revealed bugs in the bindings for simple (just the Turkish case), jsonld-java
(which does no validation of language tags), rdf4j (fails Turkish test) and jena (fails Turkish
test).

As both RDF4J and Jena are vulnerable to the Turkish case, that should be reported upstream
after rdf-comments clarifications.

Would it make sense for Commons RDF to strengthen getLanguageTag() to ALWAYS return the language
tag in lower case for any RDF implementations (e.g. normalize if implementation does not do
it correctly internally) - as a kind of interoperability/RDF 1.1 measure - or should we strive
to keep their current case representation as-is? 

> RDF-1.1 specifies that language tags need to be compared using lower-case
> -------------------------------------------------------------------------
>
>                 Key: COMMONSRDF-51
>                 URL: https://issues.apache.org/jira/browse/COMMONSRDF-51
>             Project: Apache Commons RDF
>          Issue Type: Bug
>          Components: api
>    Affects Versions: 0.3.0
>            Reporter: Peter Ansell
>            Assignee: Stian Soiland-Reyes
>
> The [RDF-1.1 specification states that the [value space of Literal language tags is lowercase|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal],
which does not conflict with the case-insensitive specification in BCP47. The Literal.equals
and Literal.hashCode API contracts should specify that language tags must be compared using
lowercase, even if they are otherwise stored and returned as upper-case by getLanguageTag.
The API currently has incorrect language by saying "character-by-character" for language tag
comparisons, as that implies case-sensitive comparisons are used.
> The lowercasing must also be done using a locale that is consistent (known example where
lowercase and uppercase do not roundtrip as expected for US-ASCII characters is Turkish [1]),
so I would recommend actually stating that .toLowerCase(Locale.ENGLISH) is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message