Mailing-List: contact dev-help@stdcxx.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@stdcxx.apache.org
Received-SPF: pass (athena.apache.org: domain of msebor@gmail.com designates
 209.85.146.183 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:organization:user-agent:mime-version:to:cc:subject:references:in-reply-to:content-type:sender;
        b=W6sU5yFhWI73GSknOFrQsFIuN0+KJ2eQxLdICaH0rNwrx+N/GRbRKCckzFpwJdwuYjkZjuUWAzTclTNXLD+bbg2gfvAfq/1CRwtGQAWIc5ye6R41Rx1kGOtkObg2djTz0o8qvgpy0VoKi4u/+DNwZ+Itb5k98m9p3cAAO2PZnRU=
Message-ID: <47D718D8.6010901@roguewave.com>
Date: Tue, 11 Mar 2008 17:42:16 -0600
From: Martin Sebor <sebor@roguewave.com>
Organization: Rogue Wave Software, Inc.
User-Agent: Thunderbird 2.0.0.12 (X11/20080226)
MIME-Version: 1.0
To: dev@stdcxx.apache.org
CC: Eric Lemings <lemings@roguewave.com>,
 Martin Sebor <sebor@roguewave.com>
Subject: Re: STDCXX-435
References: 
 <CFFDD219128FD94FB4F92B99F52D0A49010A3640@exchmail01.Blue.Roguewave.Com>
In-Reply-To: 
 <CFFDD219128FD94FB4F92B99F52D0A49010A3640@exchmail01.Blue.Roguewave.Com>
Content-Type: multipart/mixed;
 boundary="------------030004060705060201050902"
Sender: Martin Sebor <msebor@gmail.com>

--------------030004060705060201050902
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I don't think it's that simple. IIRC, the problem is that we're using
mbsrtowcs() on sequences that aren't guaranteed to be NUL-terminated.
It seems that it should be straightforward to either specify a limit
to mbsrtowcs() that's small enough so as to prevent the function from
ever reaching the end of the (non-NUL-terminated) source sequence, or
make sure there is a NUL at the end (by copying the short subsequence
at the end of the source sequence into a small local buffer and NUL
terminating it there. I recall trying to get that approach to work
at first and failing. I don't remember why it didn't work anymore,
if it was because the code became too complex and inefficient or
if it simply wasn't possible to guarantee that it would be correct
in all cases.

In any event, I came to the conclusion that we can't call mbsrtowcs()
to convert whole ranges of characters at once but that we must do the
conversion one character at a time instead. That's what the attached
patch does (I think). Since I wrote it months ago I don't remember
how extensively I tested it, or if it even applies cleanly after so
much time has passed. It does appear to fix the bug in the originally
submitted test case though :)

Martin

Eric Lemings wrote:
>  
>>>From Martin's comments in the Jira issue:
>  
> "...The function attempts to convert the source sequence up until the
> terminating NUL (or an invalid byte) or until it has produced the
> requested number of destitation characters. When the destination buffer
> is large enough for more the number of characters in the source sequence
> the function just keeps converting past the end."
>  
> More than that, I think the call to mbsrtowcs() is using the wrong
> length.  It's using dst_len when it should be using src_len which is
> smaller in the test case listed in the issue.
>  
> Some debugging excerpts:
>  
>     __rw_libc_do_in (state=@0x7fff5d880800
> <mailto:state=@0x7fff5d880800> , from=0x7fff5d880820 "abc",
>     from_end=0x7fff5d880821 "bc", from_next=@0x7fff5d8807f8
> <mailto:from_next=@0x7fff5d8807f8> , to=0x7fff5d880810,
>     to_limit=0x7fff5d880818, to_next=@0x7fff5d8807f0
> <mailto:to_next=@0x7fff5d8807f0> )
>     at /work/stdcxx/trunk.1/src/wcodecvt.cpp:357
> 
>     372         const _RWSTD_SIZE_T ret = mbsrtowcs (pdst, &psrc,
> dst_len, &state);
>  
>     (gdb) print src_len
>     $6 = 1
>     (gdb) print dst_len
>     $7 = 2
> 
> I think it should actually use min(src_len, dst_len).
>  
> Thoughts?
>  
> Brad.
>  
> 


--------------030004060705060201050902
Content-Type: text/x-patch;
 name="stdcxx-435.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="stdcxx-435.patch"

Index: src/wcodecvt.cpp
===================================================================
--- src/wcodecvt.cpp	(revision 636063)
+++ src/wcodecvt.cpp	(working copy)
@@ -342,9 +342,6 @@
 }
 
 
-//  This returns two result codes:  error and ok. The partial error result
-//  is not  returned because there  is no way  to know whether or  not the
-//  input sequence contains any more valid characters.
 static _STD::codecvt_base::result
 __rw_libc_do_in (_RWSTD_MBSTATE_T &state,
                  const char       *from, 
@@ -359,95 +356,59 @@
 
     _STD::codecvt_base::result res = _STD::codecvt_base::ok;
 
-    _RWSTD_MBSTATE_T save_state = state;   // saved state before conversion
+    // compute the length of the source sequence in bytes and
+    // the size of the destination buffer in wide characters
+    _RWSTD_SIZE_T src_len  = from_end - from;
+    _RWSTD_SIZE_T dst_size = to_limit - to;
 
-    _RWSTD_SIZE_T src_len = from_end - from;   // source length
-    _RWSTD_SIZE_T dst_len = to_limit - to;       // destination length
+    // set the initial values to the source and destination pointers
+    const char* psrc = from;
+    wchar_t*    pdst = to;
 
-    const char*   psrc = from_next ? from_next : "";   // source
-    wchar_t*      pdst = to_next;                      // destination
+    while (dst_size && src_len) {
 
-#ifndef _RWSTD_NO_MBSRTOWCS
-    // mbsrtowcs() requires the input to be a NULL-terminated string
-    const _RWSTD_SIZE_T ret = mbsrtowcs (pdst, &psrc, dst_len, &state);
-#else   // if defined (_RWSTD_NO_MBSRTOWCS)
-    const _RWSTD_SIZE_T ret = _RWSTD_SIZE_MAX;
-#endif    // _RWSTD_NO_MBSRTOWCS
+        // the number of bytes that form the next multibyte character
+        _RWSTD_SIZE_T nbytes;
 
-    // if an error occurred during the restartable function
-    // or if that function is not available
-    if (_RWSTD_SIZE_MAX == ret) {
-        // the conversion here (besides the previous failure) is done 
-        // one character at a time because the non-reentrant/restartable 
-        // counterpart of mbsrtowcs() does not provide any information
-        // about the size of the input that has been processed.
-        _RWSTD_UNUSED (state);
-
-        // restore `psrc' value
-        psrc = from_next ? from_next : "";
-
-        while (dst_len && src_len) {
-
-            _RWSTD_SIZE_T tmp;
-
 #ifndef _RWSTD_NO_MBRTOWC
-            tmp = mbrtowc (pdst, psrc, src_len, &state);
+        nbytes = mbrtowc (pdst, psrc, src_len, &state);
 #elif !defined (_RWSTD_NO_MBTOWC)
-            tmp = mbtowc (pdst, psrc, src_len);
+        nbytes = mbtowc (pdst, psrc, src_len);
 #else
-            tmp = _RWSTD_SIZE_MAX;
+        nbytes = _RWSTD_SIZE_MAX;
 #endif
 
-            // error; -1 result comes only from an illegal sequence
-            if (_RWSTD_SIZE_MAX == tmp) {
-                res = _STD::codecvt_base::error;
-                break;
-            }
+        // -1 indicates an invalid sequence (i.e., error)
+        if (nbytes == (_RWSTD_SIZE_T)(-1)) {
+        res = _STD::codecvt_base::error;
+            break;
+        }
  
-            // not enough bytes in input to form a valid 
-            // character - translates to an ok result
-            if (tmp == (_RWSTD_SIZE_T)(-2))
-                break;
+        // -2 indicates an ambiguous but valid subsequence
+        // (i.e., ok)
+        if (nbytes == (_RWSTD_SIZE_T)(-2))
+            break;
 
-            // the multibyte sequence is the NULL character
-            if (tmp == 0) 
-                tmp++;
+        // 0 indicates the NUL character (skip over it)
+        if (nbytes == 0) 
+            ++nbytes;
 
-            // adjust the pointers
-            psrc    += tmp;
-            src_len -= tmp;
-            ++pdst;
-            --dst_len;
-        }
-
-        // adjust "next" pointers
-        from_next = psrc;
-        to_next   = pdst;
-
+        // > 0 indicates the number of bytes in the successfully
+        // converted multibyte character
+        psrc    += nbytes;
+        src_len -= nbytes;
+        ++pdst;
+        --dst_size;
     }
-    else {
-        // the conversion succeeded on the first attempt
 
-        if (psrc)
-            from_next = psrc;
-        else {
+    // adjust "next" pointers
+    from_next = psrc;
+    to_next   = pdst;
 
-            // mbsrtowcs() sets `psrc' to 0 if the conversions
-            // stops due to the terminating NUL character
-
-            const _RWSTD_SIZE_T tmp =
-                __rw_libc_mbrlen (save_state, from_next, ret);
-            
-            from_next += tmp;
-        }
-
-        to_next += ret;
-    }
-
     // if the conversion has exhausted all space in the destination
     // range AND there are more COMPLETE characters in the source
     // range then we have a "partial" conversion
-    if (res == _STD::codecvt_base::ok && src_len && !dst_len) {
+    if (res == _STD::codecvt_base::ok && src_len && !dst_size) {
         _RWSTD_MBSTATE_T tmp_state = state;
         _RWSTD_SIZE_T tmp = __rw_libc_mbrlen (tmp_state, psrc, src_len);
         if (tmp < (_RWSTD_SIZE_T)(-2))

--------------030004060705060201050902--