lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
Date Fri, 30 Oct 2009 22:57:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772151#action_12772151
] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. process-internal is somethign that won't be stored or interchanged in any way (internal
to the process)

Right, this is the crux of the disagreement: you think storage (with the exception of in-memory
usage) means interchange.  I and Yonik think that storage does not necessarily mean interchange.

Section 16.7 (_Noncharacters_) of the Unicode 5.0.0 standand (the latest version for which
an electronic version of this chapter is available), says:

{quote}
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal
use. They are forbidden for use in open interchange of Unicode text data. See Section 3.4,
Characters and Encoding, for the formal definition of noncharacters and conformance requirements
related to their use.

The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each
plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and
so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition,
there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF.
For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation
Forms-A block, but those noncharacters are not "Arabic noncharacters" or "right-to-left noncharacters,"
and are not distinguished in any other way from the other noncharacters, except in their code
point values.

Applications are free to use any of these noncharacter code points internally but should never
attempt to exchange them. If a noncharacter is received in open interchange, an application
is not required to interpret it in any way. It is good practice, however, to recognize it
as a noncharacter and to take appropriate action, such as removing it from the text. Note
that Unicode conformance freely allows the removal of these characters. (See conformance clause
C7 in Section 3.2, Conformance Requirements.)

In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to interpretation
by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation
whatsoever outside of their possible application-internal private uses.

*U+FFFF and U+10FFFF.*  These two noncharacter code points have the attribute of being associated
with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF
is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with
the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two
noncharacter code points useful for internal purposes as sentinels. For example, they might
be used to indicate the end of a list, to represent a value in an index guaranteed to be higher
than any valid character value, and so on.
{quote}

(I left out the last part about U+FFFE.)

Again, the crux of the matter is the definition of "open interchange".

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in
the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used
process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't
be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message