lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator
Date Fri, 21 Mar 2008 10:19:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581049#action_12581049
] 

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
we can't handle a string that really contains \uffff
{quote}
This is an invalid UTF16 string for interchange.  The standard explicitly allows for certain
characters (including this one) to be used for internal purposes.

{quote}
However, I agree with the usage for assertion, that "\uffff" is placed after at the end of
a string in a char sequence.
{quote}
I don't think this is necessary for assertion.  The memory cost for this is sizable.  Right
now tracking a string's length consumes 2 bytes (0xffff char) per posting.  By adding length
we're consuming an additional 4 bytes.  While indexing, there are a large number of postings
(one per unique term) so this added RAM usage is not negligible.

I think we should do one or the other, but not both.

Really the tradeoff we are exploring here is whether using up 2 more bytes per term, which
causes us to flush sooner & merge more often for a given RAM buffer size, is offset by
the speedup of not having to check for 0xffff and compute length in certain places.

One problem with the patch is you forgot to add another int (4 bytes) POSTING_NUM_BYTE in
DocumentsWriter.  This is important because the tradeoff we are exploring here is whether
increasing RAM usage of a Posting, which causes more frequent flushing, while then saving
some of not having to compare to 0xffff in certain places, is net/net a performance "win".
 Can you fix this?  Thanks.

Have you run any performance tests to assess the impact of this change?  I think that's critical
here since if this is net/net a performance loss we shouldn't make the change.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should
not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string
that really contains \uffff. And also, we can calculate the end char position in a character
sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end
of a string in a char sequence.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message