lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Merlin Morgenstern <merlin.morgenst...@googlemail.com>
Subject Re: Error while decoding %DC (Ü) from URL - results in ?
Date Sun, 28 Aug 2011 19:10:20 GMT
I double checked all code on that page and it looks like everything is in
utf-8 and works just perfect. The problematic URLs are called always by bots
like google bot. Looks like they are operating with a different encoding.
The page itself has an utf-8 meta tag.

So it looks like I have to find a way that checks for the encoding and
encodes apropriatly. this should be a common solr problem if all search
engines treat utf-8 that way, right?

Any ideas how to fix that? Is there maybe a special solr functionality for
this?

2011/8/27 François Schiettecatte <fschiettecatte@gmail.com>

> Merlin
>
> Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so
> it looks like there is a charset mismatch somewhere.
>
>
> Cheers
>
> François
>
>
>
> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote:
>
> > Hello,
> >
> > I am having problems with searches that are issued from spiders that
> contain
> > the ASCII encoded character "ü"
> >
> > For example in : "Übersetzung"
> >
> > The solr log shows following query request: /suche/%DCbersetzung
> > which has been translated into solr query: q=?ersetzung
> >
> > If you enter the search term directly as a user into the search box it
> will
> > result into:
> > /suche/Übersetzung which returns perfect results.
> >
> > I am decoding the URL within PHP: $term     = trim(urldecode($q));
> >
> > Somehow urldecode() translates the Character Ü (%DC) into a ? which is a
> > illigeal first character in Solr.
> >
> > I tried it without urldecode(), with rawurldecode() and with
> utf8_decode()
> > but all of those did not help.
> >
> > Thank you for any help or hint on how to solve that problem.
> >
> > Regards, Merlin
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message