couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Shiming (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-760) Put attachments with cyrillic names is fail.
Date Sat, 05 Feb 2011 12:54:31 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990974#comment-12990974
] 

He Shiming commented on COUCHDB-760:
------------------------------------

Hi again. After debugging for half a day, I've got some insights regarding the failure of
the test.

couch_util.validate_utf8_fast is correct with minor problems. According to wikipedia, the
equation should be like this:

    <<_:O/binary, C1, C2, _/binary>> when
            C1 >= 192, C1 =< 223,
            C2 >= 128, C2 =< 191 ->
        validate_utf8_fast(B, 2 + O);
    <<_:O/binary, C1, C2, C3, _/binary>> when
            C1 >= 224, C1 =< 239,
            C2 >= 128, C2 =< 191,
            C3 >= 128, C3 =< 191 ->
        validate_utf8_fast(B, 3 + O);
    <<_:O/binary, C1, C2, C3, C4, _/binary>> when
            C1 >= 240, C1 =< 247,
            C2 >= 128, C2 =< 191,
            C3 >= 128, C3 =< 191,
            C4 >= 128, C4 =< 191 ->
        validate_utf8_fast(B, 4 + O);
    _ ->

After this change the routine is theoretically correct. I've extracted it out and tested some
strings. It's got correct results.

Regarding the tests, the 1st one is easy to fix. Since you are saving "Колян.txt", you
should retrieve by that name: var xhr = CouchDB.request("GET", "/test_suite_db/good_doc/Колян.txt");
. This test is actually passed.

I'm not able to fix the rest of the tests, and the problem seemed related. After debugging,
I discovered that a javascript string "foo\x80txt" of incorrect utf-8 encoding, is altered
when erlang gets to see it.

couch_util.validate_utf8_fast is supposed to see <<102, 111, 111, 128, 116, 120, 116>>.
But it saw <<102,111,111,194,128,116,120,116>> instead.

Either the browser or the erlang httpd has attempted to fix the incorrect utf-8 encoding,
making it impossible for couchdb to see it. Since the original code ruled out anything beyond
128, the test will pass.

So in order for utf-8 attachment names to work, this test will need to be rewritten. I tried
other combinations of the string, but I was unable to get pass the "encoding fix". CouchDB
always sees correct encoding.

> Put attachments with cyrillic names is fail.
> --------------------------------------------
>
>                 Key: COUCHDB-760
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-760
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>         Environment: Distributor ID:	Ubuntu
> Description:	Ubuntu 9.10
> Release:	9.10
> Codename:	karmic
>            Reporter: Antonio
>              Labels: attachments
>         Attachments: COUCHDB-760.patch, couchdb_760.patch
>
>
> I try upload any file with cyrillic name(like Колян.txt) and its fail 
> i try with futon.
> And create test http://friendpaste.com/WrVoFIOZb3T5r70Fz8XWB (see line 22):
> this test is fail with # Exception raised: {"error":"bad_request","reason":"Attachment
name is not UTF-8 encoded"}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message