lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: character encoding issue...
Date Mon, 04 Nov 2013 12:57:24 GMT
The problem is there are about a dozen places where the character
encoding can be mis-configured. The problem you're seeing above
actually looks like a problem with the character set configured in
your browser, it may have nothing to do with what's actually in Solr.

You might write small SolrJ program and see if you can dump the contents
in binary and examine to see...

Best
Erick


On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <rajinimaski@gmail.com> wrote:

> How are you extracting the text that is there in the website[1] you are
> referring to? Apache Nutch or any other crawler? If yes, initially check
> whether that crawler engine is giving you data in correct format before you
> invoke solr index method.
>
> [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
>
> URI encoding should resolve this problem.
>
>
>
>
> On Fri, Nov 1, 2013 at 10:50 AM, Chris <christudas@gmail.com> wrote:
>
> > Hi Rajani,
> >
> > I followed the steps exactly as in
> >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> >
> > However, when i send a query to this new instance in tomcat, i again get
> > the error -
> >
> >   <str name="fulltxt">Scheduled Groups Maintenance
> > In preparation for the new release roll-out,���� Diigo groups won’t be
> > accessible on Sept 28 (Mon) around midnight 0:00 PST for several
> > hours.
> > Stay tuned to say hello to Diigo V4 soon!
> >
> > location of the text  -
> > http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/
> >
> > same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/
> >
> > All text in title comes like -
> >
> > ������������������������������������
- ���������������������
> > ������������</str>
> >     <arr name="text">
> >       <str>������������������������������������
-
> > ��������������������� ������������</str>
> >     </arr>
> >
> >
> > Can you please advice?
> >
> > Chris
> >
> >
> >
> >
> > On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <rajinimaski@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > >    If you are using Apache Tomcat Server, hope you are not missing the
> > > below mentioned configuration:
> > >
> > >  <Connector port=”port Number″ protocol=”HTTP/1.1″
> > > connectionTimeout=”20000″
> > > redirectPort=”8443″ *URIEncoding=”UTF-8″*/>
> > >
> > > I had faced similar issue with Chinese Characters and had resolved with
> > the
> > > above config.
> > >
> > > Links for reference :
> > >
> > >
> >
> http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/
> > >
> > >
> >
> http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8
> > >
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Tue, Oct 29, 2013 at 9:20 PM, Chris <christudas@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I get characters like -
> > > >
> > > > ������������������ - CTA������������
-
> > > >
> > > > in the solr index. I am adding Java beans to solr by the addBean()
> > > > function.
> > > >
> > > > This seems to be a character encoding issue. Any pointers on how to
> > > > resolve this one?
> > > >
> > > > I have seen that this occurs  mostly for japanese chinese characters.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message