lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2942) toString() methods on term/queries/etc are wrong: assume utf-8 encoded bytes.
Date Mon, 28 Feb 2011 18:00:38 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000435#comment-13000435
] 

Robert Muir commented on LUCENE-2942:
-------------------------------------

Uwe: my plan is to actually fix toString itself (toString should be human readable, thats
its purpose!)

The existing code should be bytesToString() or hexToString() or something of that nature,
this way if you explicitly want bytes you can get that.

{quote}
Internally it could quickly be implemneted as calling utf8ToString() and fallback on Exception.
Or is there a faster was to detect if its valid UTF-8?
{quote}

Not really, you cannot trust the JRE to do this correctly, e.g. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052

Additionally the behavior of malformed bytes is undefined, e.g. IBM JREs use IGNORE but Sun
JREs use REPLACE... even if they actually detected correctly :)

Don't worry I will take care of this part.


> toString() methods on term/queries/etc are wrong: assume utf-8 encoded bytes.
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-2942
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2942
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>
> In Lucene's trunk, a Term is just a Bytesref.
> In a lot of cases this is a UTF-8 encoded string, but in some cases its not (e.g. collation
fields).
> The problem is that the toString methods all currently call utf8ToString().
> This is wrong, though from a practical point of view i think just printing the bytes
won't be very helpful for debugging most cases where the bytes really are utf-8 encoded.
> So i think in these cases we should use the following technique: if the bytes are a valid
utf-8 sequence, use BytesRef.utf8tostring(), otherwise just print the bytes: BytesRef.toString()
> its no problem for performance because toString is only for debugging anyway.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message