lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator
Date Tue, 25 Mar 2008 09:23:24 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581843#action_12581843
] 

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
I think we should not use \uffff as a terminator in Lucene library regardless of the fact
that it is allowed in Unicode standard, because it is unnecessary.
{quote}

I'm not yet convinced it's unecessary.  We need to run performance
tests to understand the time/space tradeoff here.  If this change
speeds up indexing we should do it.  RAM is cheap.

By far, the Posting instances consume the most RAM in DocumentsWriter.
Right now each Posting is 66 bytes; this patch, once finished
increases that to 68 bytes.

I don't like increasing the byte usage of Posting unless there's a
good counterbalance, which I think this change *may* have if we see
that it improves indexing speed.

I just checked: when indexing Wikipedia with a 64 MB buffer, each
segment flushed has ~430,000 Posting instances.  So the Posting
instances alone account for 27 MB of the buffer.

That means the added 2 bytes from this change will consume ~840 KB
additional RAM, which is not insignificant loss of RAM efficiency.

[Aside: by Zipf's law, the vast majority of these terms should occur
rarely.  Eg roughly half will occur only once.  If we could find some
way to represent these rare terms with a much more compact structure
(Posting has alot of "overhead" to efficiently manage a long posting
list) then we would greatly increase DW's RAM efficiency.]




> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should
not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string
that really contains \uffff. And also, we can calculate the end char position in a character
sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end
of a string in a char sequence.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message