lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From François Schiettecatte <fschietteca...@gmail.com>
Subject Re: Error while decoding %DC (Ü) from URL - results in ?
Date Mon, 29 Aug 2011 12:34:53 GMT
Merlin

Just to make sure I understand what is going on here, you are getting searches from external
crawlers. These are coming in the form of an HTTP request I assume?

Have you checked the encoding specified in these requests (in the content type header). If
the encoding is not specified then iso-8859-1 is usually assumed. Also have you checked the
default encoding of your container? If you are using tomcat that is set using URIEncoding,
for example:

    <Connector address="localhost" port="8000" protocol="HTTP/1.1"
               connectionTimeout="20000" URIEncoding="UTF-8" />

François

On Aug 28, 2011, at 3:10 PM, Merlin Morgenstern wrote:

> I double checked all code on that page and it looks like everything is in
> utf-8 and works just perfect. The problematic URLs are called always by bots
> like google bot. Looks like they are operating with a different encoding.
> The page itself has an utf-8 meta tag.
> 
> So it looks like I have to find a way that checks for the encoding and
> encodes apropriatly. this should be a common solr problem if all search
> engines treat utf-8 that way, right?
> 
> Any ideas how to fix that? Is there maybe a special solr functionality for
> this?
> 
> 2011/8/27 François Schiettecatte <fschiettecatte@gmail.com>
> 
>> Merlin
>> 
>> Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so
>> it looks like there is a charset mismatch somewhere.
>> 
>> 
>> Cheers
>> 
>> François
>> 
>> 
>> 
>> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote:
>> 
>>> Hello,
>>> 
>>> I am having problems with searches that are issued from spiders that
>> contain
>>> the ASCII encoded character "ü"
>>> 
>>> For example in : "Übersetzung"
>>> 
>>> The solr log shows following query request: /suche/%DCbersetzung
>>> which has been translated into solr query: q=?ersetzung
>>> 
>>> If you enter the search term directly as a user into the search box it
>> will
>>> result into:
>>> /suche/Übersetzung which returns perfect results.
>>> 
>>> I am decoding the URL within PHP: $term     = trim(urldecode($q));
>>> 
>>> Somehow urldecode() translates the Character Ü (%DC) into a ? which is a
>>> illigeal first character in Solr.
>>> 
>>> I tried it without urldecode(), with rawurldecode() and with
>> utf8_decode()
>>> but all of those did not help.
>>> 
>>> Thank you for any help or hint on how to solve that problem.
>>> 
>>> Regards, Merlin
>> 
>> 


Mime
View raw message