lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character
Date Fri, 30 Oct 2009 22:33:59 GMT


Steven Rowe commented on LUCENE-2019:

bq. Steven, the only reason I might disagree is that a Lucene Index is supposed to be portable
across different languages other than Lucene Java.

Right, but not all Lucene indexes in-the-wild are accessed from more than one language.  The
vast majority of Lucene index uses, I'd venture to guess, are single-language, single-process

bq. in my opinion, if you are to store process-internal codepoints as abstract characters
in terms, then you should not claim that Lucene indexes are in any Unicode format, because
then they violate the standard.

I strongly disagree with the assumption that interchange and serialization are synonymous.

bq. By *not* storing them in terms, then you are free to use them as delimiters, or other
purposes. right now U+FFFF is used as a delimiter, but who knows, maybe someday you might
need more?

I actually agree with this argument.  What if Lucene needs more process-internal characters?
 I don't have any way of gauging the probability that it will in the future (other than the
last eight years of history, during which only one was deemed necessary).  But what does Mike
M. say? "Design for now" or something like that?

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>                 Key: LUCENE-2019
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in
the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't
be in the index or will cause problems. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message