Return-Path: Delivered-To: apmail-directory-dev-archive@www.apache.org Received: (qmail 81590 invoked from network); 31 Aug 2005 09:34:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 31 Aug 2005 09:34:10 -0000 Received: (qmail 52907 invoked by uid 500); 31 Aug 2005 09:34:09 -0000 Delivered-To: apmail-directory-dev-archive@directory.apache.org Received: (qmail 52724 invoked by uid 500); 31 Aug 2005 09:34:08 -0000 Mailing-List: contact dev-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Apache Directory Developers List" Delivered-To: mailing list dev@directory.apache.org Received: (qmail 52711 invoked by uid 99); 31 Aug 2005 09:34:08 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2005 02:34:08 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of elecharny@gmail.com designates 64.233.170.203 as permitted sender) Received: from [64.233.170.203] (HELO rproxy.gmail.com) (64.233.170.203) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2005 02:34:22 -0700 Received: by rproxy.gmail.com with SMTP id r35so82067rna for ; Wed, 31 Aug 2005 02:34:05 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:subject:from:to:in-reply-to:references:content-type:date:message-id:mime-version:x-mailer:content-transfer-encoding; b=CVah1QvOpPh5DcZ4TdY8Mg4+okMiCtYdTEV85IQaVGZ9ZtYFs7zuT7LxEn1kwRf0qCk2d4iScqCqvJT9TrUSvgCoeQMVYku4KzpljhipeoetOfC+SzmgPurV/MOTb9l5PmgqtIqMX58Y3az8tyn4R5D/I7jAy1SzrdT5g9xMXKE= Received: by 10.38.73.46 with SMTP id v46mr19225rna; Wed, 31 Aug 2005 02:34:05 -0700 (PDT) Received: from ?10.10.10.27? ( [80.11.159.38]) by mx.gmail.com with ESMTP id 75sm321190rnb.2005.08.31.02.34.04; Wed, 31 Aug 2005 02:34:05 -0700 (PDT) Subject: Re: LDAP protocol implementation and data containing accents From: Emmanuel Lecharny To: Apache Directory Developers List In-Reply-To: <32609e770508310200938c875@mail.gmail.com> References: <32609e770508300911404e04d6@mail.gmail.com> <1125421285.7948.28.camel@portable> <32609e770508310200938c875@mail.gmail.com> Content-Type: text/plain; charset=utf-8 Date: Wed, 31 Aug 2005 11:34:03 +0200 Message-Id: <1125480843.7925.15.camel@portable> Mime-Version: 1.0 X-Mailer: Evolution 2.2.1.1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Wed, 2005-08-31 at 11:00 +0200, Jérôme Baumgarten wrote: > On 8/30/05, Emmanuel Lecharny wrote: > > > Also, accents in a filter are incorrectly received (decoded ?) in > > > SearchHandler, for example the filter (sn=*é*) is retrieved as > > > (sn=*Ã(c)*). > > > > Are you using UTF-8 to encode your string? Data are stored in UTF-8 > > format in Ldap. > > I did some other tests and I get the following (clients and server > running on a Windows box) : > > * JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is > incorrect w.r.t accents > > * Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents > > * JNDI test code : filter is incorrect w.r.t accents > > * JLDAP test code : filter is incorrect w.r.t accents > > * OpenLDAP ldapsearch (but running on a Linux box) : filter is > correct w.r.t accents > > I can fix these problems if I do the following : > > String filter = LdapProxyUtils.filterToString(request.getFilter()); > try { > filter = new String(filter.getBytes(), "UTF-8"); > } catch (UnsupportedEncodingException ueEx) { > throw new RuntimeException(ueEx); > } > > But I don't really understand why I must do so since "RFC 2254 - The > String Representation of LDAP Search Filters" says that it is > represented as an UTF-8 string. Thus I would expect the filter value > to be correct, no matter the platform my LDAP proxy is running on. It's not a question of tool or platform. Values are stored in UTF-8 in LDAP if they are Strings (from RFC 2251) : " 4.1.2. String Types The LDAPString is a notational convenience to indicate that, although strings of LDAPString type encode as OCTET STRING types, the ISO 10646 [13] character set (a superset of Unicode) is used, encoded following the UTF-8 algorithm [14]. Note that in the UTF-8 algorithm characters which are the same as ASCII (0x0000 through 0x007F) are represented as that same ASCII character in a single byte. The other byte values are used to form a variable-length encoding of an arbitrary character." So you must send String values encoded in UTF-8 when requesting a Ldap Server. If you use a tool, there is good chance that a convversion is done from your locale to UTF-8 (ie ISO-8859-1 to UTF-8 in your case). If you write a piece of code to send requests to LDAP, you *MUST* do this conversion yourself. Using a simple new String("Jérome") is not enough, as it will internally encode "Jérôme" using UTF-16. So you always should use a new String("Jérôme", "UTF-8") before sending data to Ldap. It applies to search filters, too. > Also, has anyone tested search on ApacheDS with filter containing > accents ? The problems I'm facing right now may also be present with > ApacheDS. Sure we have problem with accents !!! Strings are created in ApacheDs using new String(byte[] data) without using a UTF-8 encoding. So this is a bug. It would be cool to add a JIRA issue with a simple test case. However, we are actually tracking down a bug related to encoding and binary values, it may fix your problem. Emmanuel Lécharny