lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Van Tassell, Kristian" <kristian.vantass...@siemens.com>
Subject RE: UTF-8 support during indexing content
Date Wed, 01 Feb 2012 16:38:05 GMT
Travis and all,

This is solved and was not directly a Solr issue. I'll note the solution here in case anyone
makes the same mistake. The documents are UTF-8 and the source documents are converted via
XSLT. They look good up to that point. 

First off, based off of of some other recommendations I found, I changed the Tomcat <Connector>
element to include the URIEncoding="UTF-8" setting.

The primary problem, however, was the data (mydata below) was read in without an encoding
designation. 

DirectXmlRequest up = new DirectXmlRequest( "/update", mydata );

The stream was previously gathered incorrectly:

BufferedReader reader = new BufferedReader(new FileReader(filePath));

I've since changed this and am now getting the intended result.

InputStreamReader reader = new InputStreamReader(new FileInputStream(filePath), "UTF-8");

Thanks,
Kristian


-----Original Message-----
From: Travis Low [mailto:tlow@4centurion.com] 
Sent: Wednesday, February 01, 2012 8:27 AM
To: solr-user@lucene.apache.org
Subject: Re: UTF-8 support during indexing content

Are you sure the input document is in UTF-8?  That looks like classic
ISO-8859-1-treated-as-UTF-8.

How did you confirm the document contains the right quote marks immediately
prior to uploading?  If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.

cheers,
Travis

On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:

> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
>        the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word �stemming� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
>        DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
>        solrServer.request( up );
>        solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**
Mime
View raw message