Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 955E17EA5 for ; Mon, 29 Aug 2011 12:35:32 +0000 (UTC) Received: (qmail 98402 invoked by uid 500); 29 Aug 2011 12:35:28 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 97976 invoked by uid 500); 29 Aug 2011 12:35:25 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 97807 invoked by uid 99); 29 Aug 2011 12:35:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Aug 2011 12:35:24 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of fschiettecatte@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vx0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Aug 2011 12:35:16 +0000 Received: by vxh17 with SMTP id 17so6573641vxh.35 for ; Mon, 29 Aug 2011 05:34:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=iVPKTlr6hwS5QuRzxVr+H1SXzom2nkD4jX8gS3f+bFI=; b=feQUSoDZ16np18fGbiXi6NfLzIGiz8oKHe4+mRviGlsdooePcw5Y0OsOmu2ZSoPPfA vt0ogKWJLOMhVyGfQVJLBD4/9j2/Ij513keifuJptlM0iIJK50sFpGPmTI6+CMnoqvq1 99yl1OCs2pY8VLvtCYBUUxx0NpmITM0eICxWA= Received: by 10.220.116.13 with SMTP id k13mr1117923vcq.245.1314621295931; Mon, 29 Aug 2011 05:34:55 -0700 (PDT) Received: from macpro.local (c-76-119-125-100.hsd1.ma.comcast.net [76.119.125.100]) by mx.google.com with ESMTPS id eq10sm2860393vdb.4.2011.08.29.05.34.54 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 29 Aug 2011 05:34:55 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1244.3) Subject: =?iso-8859-1?Q?Re=3A_Error_while_decoding_=25DC_=28=DC=29_from_U?= =?iso-8859-1?Q?RL_-_results_in_=3F?= From: =?iso-8859-1?Q?Fran=E7ois_Schiettecatte?= In-Reply-To: Date: Mon, 29 Aug 2011 08:34:53 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <487B345A-1BCD-40E3-A2DD-70710F4AA790@gmail.com> References: <68D3EA95-41AB-4E28-9BDD-ABA5909C263C@gmail.com> To: solr-user@lucene.apache.org X-Mailer: Apple Mail (2.1244.3) Merlin Just to make sure I understand what is going on here, you are getting = searches from external crawlers. These are coming in the form of an HTTP = request I assume? Have you checked the encoding specified in these requests (in the = content type header). If the encoding is not specified then iso-8859-1 = is usually assumed. Also have you checked the default encoding of your = container? If you are using tomcat that is set using URIEncoding, for = example: Fran=E7ois On Aug 28, 2011, at 3:10 PM, Merlin Morgenstern wrote: > I double checked all code on that page and it looks like everything is = in > utf-8 and works just perfect. The problematic URLs are called always = by bots > like google bot. Looks like they are operating with a different = encoding. > The page itself has an utf-8 meta tag. >=20 > So it looks like I have to find a way that checks for the encoding and > encodes apropriatly. this should be a common solr problem if all = search > engines treat utf-8 that way, right? >=20 > Any ideas how to fix that? Is there maybe a special solr functionality = for > this? >=20 > 2011/8/27 Fran=E7ois Schiettecatte >=20 >> Merlin >>=20 >> =DC encodes to two characters in utf-8 (C39C), and one in iso-8859-1 = (%DC) so >> it looks like there is a charset mismatch somewhere. >>=20 >>=20 >> Cheers >>=20 >> Fran=E7ois >>=20 >>=20 >>=20 >> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote: >>=20 >>> Hello, >>>=20 >>> I am having problems with searches that are issued from spiders that >> contain >>> the ASCII encoded character "=FC" >>>=20 >>> For example in : "=DCbersetzung" >>>=20 >>> The solr log shows following query request: /suche/%DCbersetzung >>> which has been translated into solr query: q=3D?ersetzung >>>=20 >>> If you enter the search term directly as a user into the search box = it >> will >>> result into: >>> /suche/=DCbersetzung which returns perfect results. >>>=20 >>> I am decoding the URL within PHP: $term =3D trim(urldecode($q)); >>>=20 >>> Somehow urldecode() translates the Character =DC (%DC) into a ? = which is a >>> illigeal first character in Solr. >>>=20 >>> I tried it without urldecode(), with rawurldecode() and with >> utf8_decode() >>> but all of those did not help. >>>=20 >>> Thank you for any help or hint on how to solve that problem. >>>=20 >>> Regards, Merlin >>=20 >>=20