lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Ludwig <...@as-guides.com>
Subject Re: Problem adding unicoded docs to Solr through SolrJ
Date Wed, 29 Apr 2009 13:15:24 GMT
ahmed baseet schrieb:

> public void postToSolrUsingSolrj(String rawText, String pageId) {

>             doc.addField("features", rawText );

> In the above the param rawText is just the html stripped off of all
> its tags, js, css etc and pageId is the Url for that page. When I'm
> using this for English pages its working perfectly fine but the
> problem comes up when I'm trying to index some non-english pages.

Maybe you're constructing a string without specifying the encoding, so
Java uses your default platform encoding?

String(byte[] bytes)
   Constructs a new String by decoding the specified array of
   bytes using the platform's default charset.

String(byte[] bytes, Charset charset)
   Constructs a new String by decoding the specified array of bytes using
   the specified charset.

> Now what I did is just extracted the raw text from that html page and
> manually created an xml page like this
>
> <?xml version="1.0" encoding="UTF-8"?>
> <add>
>   <doc>
>     <field name="id">UTF2TEST</field>
>     <field name="name">Test with some UTF-8 encoded characters</field>
>     <field name="features">*some tamil unicode text here*</field>
>    </doc>
> </add>
>
> and posted this from command line using the post.jar file. Now searching
> gives me the result but unlike last time browser shows the indexed text in
> tamil itself and not the raw unicode.

Now that's perfect, isn't it?

> I tried doing something like this also,

> // Encode in Unicode UTF-8
>  utfEncodedText = new String(rawText.getBytes("UTF-8"));
>
> but even this didn't help eighter.

No encoding specified, so the default platform encoding is used, which
is likely not what you want. Consider the following example:

package milu;
import java.nio.charset.Charset;
public class StringAndCharset {
   public static void main(String[] args) {
     byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
     System.out.println(Charset.defaultCharset().displayName());
     System.out.println(new String(bytes));
     System.out.println(new String(bytes,  Charset.forName("UTF-8")));
   }
}

Output:

windows-1252
Käse (bad)
Käse (good)

Michael Ludwig

Mime
View raw message