lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Low <t...@4centurion.com>
Subject Re: UTF-8 support during indexing content
Date Wed, 01 Feb 2012 14:26:46 GMT
Are you sure the input document is in UTF-8?  That looks like classic
ISO-8859-1-treated-as-UTF-8.

How did you confirm the document contains the right quote marks immediately
prior to uploading?  If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.

cheers,
Travis

On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:

> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
>        the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word �stemming� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
>        DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
>        solrServer.request( up );
>        solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message