Return-Path: Delivered-To: apmail-stdcxx-dev-archive@www.apache.org Received: (qmail 66079 invoked from network); 11 Mar 2008 23:42:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Mar 2008 23:42:48 -0000 Received: (qmail 46750 invoked by uid 500); 11 Mar 2008 23:42:45 -0000 Delivered-To: apmail-stdcxx-dev-archive@stdcxx.apache.org Received: (qmail 46732 invoked by uid 500); 11 Mar 2008 23:42:45 -0000 Mailing-List: contact dev-help@stdcxx.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@stdcxx.apache.org Delivered-To: mailing list dev@stdcxx.apache.org Received: (qmail 46723 invoked by uid 99); 11 Mar 2008 23:42:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Mar 2008 16:42:45 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of msebor@gmail.com designates 209.85.146.183 as permitted sender) Received: from [209.85.146.183] (HELO wa-out-1112.google.com) (209.85.146.183) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Mar 2008 23:42:07 +0000 Received: by wa-out-1112.google.com with SMTP id j4so3118712wah.21 for ; Tue, 11 Mar 2008 16:42:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:organization:user-agent:mime-version:to:cc:subject:references:in-reply-to:content-type:sender; bh=Cj3v+yfUVruhvL40uqMmCEIxuia44hdA1tKI04Y5aSg=; b=vqDf/BmFew4mITqlYjJal1ukyK5sY2lBH6pV3+8nS/UaH14sf9+H7F1LusjJSwrlwq4HYPQHdqXFmzjkvIn6FNGUdBy8NuGrQefkeSYLZGcQYRdPNqNRfW+sd4nISb9eyuawYJHa9WU0l70TuTu5UYQMPQbuUHnB0eFyIJWdZu8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:organization:user-agent:mime-version:to:cc:subject:references:in-reply-to:content-type:sender; b=W6sU5yFhWI73GSknOFrQsFIuN0+KJ2eQxLdICaH0rNwrx+N/GRbRKCckzFpwJdwuYjkZjuUWAzTclTNXLD+bbg2gfvAfq/1CRwtGQAWIc5ye6R41Rx1kGOtkObg2djTz0o8qvgpy0VoKi4u/+DNwZ+Itb5k98m9p3cAAO2PZnRU= Received: by 10.114.126.1 with SMTP id y1mr5755205wac.108.1205278938625; Tue, 11 Mar 2008 16:42:18 -0700 (PDT) Received: from localhost.localdomain ( [71.229.200.170]) by mx.google.com with ESMTPS id a8sm26839422poa.2.2008.03.11.16.42.17 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 11 Mar 2008 16:42:17 -0700 (PDT) Message-ID: <47D718D8.6010901@roguewave.com> Date: Tue, 11 Mar 2008 17:42:16 -0600 From: Martin Sebor Organization: Rogue Wave Software, Inc. User-Agent: Thunderbird 2.0.0.12 (X11/20080226) MIME-Version: 1.0 To: dev@stdcxx.apache.org CC: Eric Lemings , Martin Sebor Subject: Re: STDCXX-435 References: In-Reply-To: Content-Type: multipart/mixed; boundary="------------030004060705060201050902" Sender: Martin Sebor X-Virus-Checked: Checked by ClamAV on apache.org --------------030004060705060201050902 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit I don't think it's that simple. IIRC, the problem is that we're using mbsrtowcs() on sequences that aren't guaranteed to be NUL-terminated. It seems that it should be straightforward to either specify a limit to mbsrtowcs() that's small enough so as to prevent the function from ever reaching the end of the (non-NUL-terminated) source sequence, or make sure there is a NUL at the end (by copying the short subsequence at the end of the source sequence into a small local buffer and NUL terminating it there. I recall trying to get that approach to work at first and failing. I don't remember why it didn't work anymore, if it was because the code became too complex and inefficient or if it simply wasn't possible to guarantee that it would be correct in all cases. In any event, I came to the conclusion that we can't call mbsrtowcs() to convert whole ranges of characters at once but that we must do the conversion one character at a time instead. That's what the attached patch does (I think). Since I wrote it months ago I don't remember how extensively I tested it, or if it even applies cleanly after so much time has passed. It does appear to fix the bug in the originally submitted test case though :) Martin Eric Lemings wrote: > >>>From Martin's comments in the Jira issue: > > "...The function attempts to convert the source sequence up until the > terminating NUL (or an invalid byte) or until it has produced the > requested number of destitation characters. When the destination buffer > is large enough for more the number of characters in the source sequence > the function just keeps converting past the end." > > More than that, I think the call to mbsrtowcs() is using the wrong > length. It's using dst_len when it should be using src_len which is > smaller in the test case listed in the issue. > > Some debugging excerpts: > > __rw_libc_do_in (state=@0x7fff5d880800 > , from=0x7fff5d880820 "abc", > from_end=0x7fff5d880821 "bc", from_next=@0x7fff5d8807f8 > , to=0x7fff5d880810, > to_limit=0x7fff5d880818, to_next=@0x7fff5d8807f0 > ) > at /work/stdcxx/trunk.1/src/wcodecvt.cpp:357 > > 372 const _RWSTD_SIZE_T ret = mbsrtowcs (pdst, &psrc, > dst_len, &state); > > (gdb) print src_len > $6 = 1 > (gdb) print dst_len > $7 = 2 > > I think it should actually use min(src_len, dst_len). > > Thoughts? > > Brad. > > --------------030004060705060201050902 Content-Type: text/x-patch; name="stdcxx-435.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="stdcxx-435.patch" Index: src/wcodecvt.cpp =================================================================== --- src/wcodecvt.cpp (revision 636063) +++ src/wcodecvt.cpp (working copy) @@ -342,9 +342,6 @@ } -// This returns two result codes: error and ok. The partial error result -// is not returned because there is no way to know whether or not the -// input sequence contains any more valid characters. static _STD::codecvt_base::result __rw_libc_do_in (_RWSTD_MBSTATE_T &state, const char *from, @@ -359,95 +356,59 @@ _STD::codecvt_base::result res = _STD::codecvt_base::ok; - _RWSTD_MBSTATE_T save_state = state; // saved state before conversion + // compute the length of the source sequence in bytes and + // the size of the destination buffer in wide characters + _RWSTD_SIZE_T src_len = from_end - from; + _RWSTD_SIZE_T dst_size = to_limit - to; - _RWSTD_SIZE_T src_len = from_end - from; // source length - _RWSTD_SIZE_T dst_len = to_limit - to; // destination length + // set the initial values to the source and destination pointers + const char* psrc = from; + wchar_t* pdst = to; - const char* psrc = from_next ? from_next : ""; // source - wchar_t* pdst = to_next; // destination + while (dst_size && src_len) { -#ifndef _RWSTD_NO_MBSRTOWCS - // mbsrtowcs() requires the input to be a NULL-terminated string - const _RWSTD_SIZE_T ret = mbsrtowcs (pdst, &psrc, dst_len, &state); -#else // if defined (_RWSTD_NO_MBSRTOWCS) - const _RWSTD_SIZE_T ret = _RWSTD_SIZE_MAX; -#endif // _RWSTD_NO_MBSRTOWCS + // the number of bytes that form the next multibyte character + _RWSTD_SIZE_T nbytes; - // if an error occurred during the restartable function - // or if that function is not available - if (_RWSTD_SIZE_MAX == ret) { - // the conversion here (besides the previous failure) is done - // one character at a time because the non-reentrant/restartable - // counterpart of mbsrtowcs() does not provide any information - // about the size of the input that has been processed. - _RWSTD_UNUSED (state); - - // restore `psrc' value - psrc = from_next ? from_next : ""; - - while (dst_len && src_len) { - - _RWSTD_SIZE_T tmp; - #ifndef _RWSTD_NO_MBRTOWC - tmp = mbrtowc (pdst, psrc, src_len, &state); + nbytes = mbrtowc (pdst, psrc, src_len, &state); #elif !defined (_RWSTD_NO_MBTOWC) - tmp = mbtowc (pdst, psrc, src_len); + nbytes = mbtowc (pdst, psrc, src_len); #else - tmp = _RWSTD_SIZE_MAX; + nbytes = _RWSTD_SIZE_MAX; #endif - // error; -1 result comes only from an illegal sequence - if (_RWSTD_SIZE_MAX == tmp) { - res = _STD::codecvt_base::error; - break; - } + // -1 indicates an invalid sequence (i.e., error) + if (nbytes == (_RWSTD_SIZE_T)(-1)) { + res = _STD::codecvt_base::error; + break; + } - // not enough bytes in input to form a valid - // character - translates to an ok result - if (tmp == (_RWSTD_SIZE_T)(-2)) - break; + // -2 indicates an ambiguous but valid subsequence + // (i.e., ok) + if (nbytes == (_RWSTD_SIZE_T)(-2)) + break; - // the multibyte sequence is the NULL character - if (tmp == 0) - tmp++; + // 0 indicates the NUL character (skip over it) + if (nbytes == 0) + ++nbytes; - // adjust the pointers - psrc += tmp; - src_len -= tmp; - ++pdst; - --dst_len; - } - - // adjust "next" pointers - from_next = psrc; - to_next = pdst; - + // > 0 indicates the number of bytes in the successfully + // converted multibyte character + psrc += nbytes; + src_len -= nbytes; + ++pdst; + --dst_size; } - else { - // the conversion succeeded on the first attempt - if (psrc) - from_next = psrc; - else { + // adjust "next" pointers + from_next = psrc; + to_next = pdst; - // mbsrtowcs() sets `psrc' to 0 if the conversions - // stops due to the terminating NUL character - - const _RWSTD_SIZE_T tmp = - __rw_libc_mbrlen (save_state, from_next, ret); - - from_next += tmp; - } - - to_next += ret; - } - // if the conversion has exhausted all space in the destination // range AND there are more COMPLETE characters in the source // range then we have a "partial" conversion - if (res == _STD::codecvt_base::ok && src_len && !dst_len) { + if (res == _STD::codecvt_base::ok && src_len && !dst_size) { _RWSTD_MBSTATE_T tmp_state = state; _RWSTD_SIZE_T tmp = __rw_libc_mbrlen (tmp_state, psrc, src_len); if (tmp < (_RWSTD_SIZE_T)(-2)) --------------030004060705060201050902--