lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benson Margulies" <bim2...@basistech.com>
Subject RE: encoding question.
Date Wed, 14 Feb 2007 13:35:33 GMT
The usual source of this problem is HTML forms. If you want to get UTF-8
back from a form, you have to send \the form itself/ to the browser in
UTF-8.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Wednesday, February 14, 2007 3:50 AM
To: java-user@lucene.apache.org
Subject: Re: encoding question.


Internally Lucene deals with pure Java Strings; when writing those
strings
to and reading those strings back from disk, Lucene allways uses the
stock
Java "modified UTF-8"  format, regardless of what your file.encoding
system property may be.

typcially when people have encoding problems in their lucene
applications,
the origin of hte problem is in the way they fetch the data before
indexing it ... if you can make a String object, and System.out.println
that string and see what you expect, then handing that string to Lucene
as
a field value should work fine.

what exactly is the "value" object you are calling getBytes on? ... if
it's another String, then you've already got serious problems -- i can't
imagine any situation where fetching the bytes from a String in one
charset and using those bytes to construct another string (either in a
different charset, or in the system default charset) would make any
sense
at all.

wherever your original binary data is coming from (files on disk,
network
socket, etcc...) that's when you should be converting those bytes into
chars using whatever charset you know those bytes represent.



: Date: Wed, 14 Feb 2007 09:16:58 +0330
: From: Mohammad Norouzi <mnrz57@gmail.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: encoding question.
:
: Hi
: I want to index data with utf-8 encoding, so when adding field to a
document
: I am using the code new String(value.getBytes("utf-8"))
: in the other hand, when I am going to search I was using the same
snippet
: code to convert to utf-8 but it did not work so finally I found
somewhere
: that had been said to use new
String(valueToSearch.getBytes("cp1252"),"UTF8")
: and it worked fine but I still has some problem.
: first, some characters are weird when I get result from lucene, It
seems it
: is in cp1252 encoding.
: second, if the java environment property "file.encoding" not been
cp1252 the
: result is completely in incorrect encoding. so I must change this
property
: using System.setProperty("file.encoding","cp1252")
:
: is lucene neglect my utf-8 encoding and proceed indexing data using
cp1252?
: how can I correct weird characters I received by searching?
:
: Thank you very much in advance.
: --
: Regards,
: Mohammad
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message