lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
Date Fri, 30 Oct 2009 22:35:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772135#action_12772135
] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

bq. I strongly disagree with the assumption that interchange and serialization are synonymous.

Actually I won't argue with you too much about this. i only care about lucene-java.

bq. I actually agree with this argument. What if Lucene needs more process-internal characters?
I don't have any way of gauging the probability that it will in the future (other than the
last eight years of history, during which only one was deemed necessary). But what does Mike
M. say? "Design for now" or something like that?

right, the point is that in my processing as a user, i might need to have delimiters or whatever.
i should not have to worry about lucene treating them as an *abstract character* because the
unicode standard says it should not.
so for example, if i create a MultiTermQuery, i should be able to use U+FFFE and U+FFFF both
internally, perhaps to delimit things for different reasons, without any concern that they
are stored in term text.
by storing them in term text, by definition they are being treated as abstract character.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in
the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used
process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't
be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message