lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Øie <k...@gan.no>
Subject Re: problems with search on Russian content
Date Fri, 22 Nov 2002 08:37:54 GMT
Hi i took a look at Andrey Grishin russian character problem and found 
something strange happening while we tried to debug it. It seems that 
he has avoided the usual "querying with different encoding than 
indexed" problem as he can dump out correctly encoded russian at all 
points in his application.

Is the strings for terms treated differently than the text stored in 
text fields? The reason i ask is that his russian words are correct in 
the stored text fields, but shows up faulty in a terms() dump. If he 
had a character encoding problem in his application the fields should 
show up faulty as well i think. Even stranger is that i use Lucene 1.2 
successfully for utf-8, iso-8859-1, iso-8859-5 and iso-8859-7. Why is 
this problem showing in russian(Cp1251) and not the other encodings?

Strangeness number two is the theory that if the russian word ",!,_,U" was 
skewed to say "0d66539qw" upon indexing, and the problem was just a 
consistent encoding problem, wouldn't a query with  ",!,_,U" be skewed to 
"0d66539qw" and be found anyway?

mvh karl )*ie


Begin forwarded message:

> From: "Andrey Grishin" <grishin@softline.kiev.ua>
> Date: Thu Nov 21, 2002  15:13:33 Europe/Oslo
> To: "Karl Oie" <karl@gan.no>
> Subject: Re: How to include strange characters??
>
> yes, you are right - there are no russian words in returned terms :(((
> I've just executed the following
> --------------
> IndexReader r =
> IndexReader.open("C:\\j\\jakarta-tomcat-4.1.12\\index\\ukrenergo");
> TermEnum e = r.terms();
> while (e.next()) {
>   Term term = (Term) e.term();
>   System.out.println("term : " + term.text());
> }
> --------------
> and got no russian words in result
> there are some "strange" terms returned instead of russian:
> term : 0d4xvp70w
> term : 0d66539qw
> term : 0d67les2o
> term : 0d6eqgic0
> etc.....
>
> So, I think we got a problem. THis is great :)), thank you...
> but how to fix it?
>
>
>
>
> ----- Original Message -----
> From: "Karl ?e" <karl@gan.no>
> To: "Andrey Grishin" <grishin@softline.kiev.ua>
> Sent: Thursday, November 21, 2002 3:56 PM
> Subject: Re: How to include strange characters??
>
>
> another thing to check is weither the IndexReader.terms() actually
> contains your term.
>
> mvh karl oie
>
> On Thursday, Nov 21, 2002, at 14:31 Europe/Oslo, Andrey Grishin wrote:
>
>> Karl,
>> I have the same problem with lucene search within russian content.
>> I tried all your advises, but lucene still can't find anything :((((
>> I indexed the content using Cp1251 charset
>> ------------
>> text = new String(text.getBytes("Cp1251"));
>> doc.add(Field.Text(CONTENT_FIELD,text));
>> ------------
>> and I am searching using the same charset
>> String txt = ",!,_,U";
>> txt = new String(txt.getBytes("Cp1251"));
>> PrefixQuery query = new PrefixQuery(new
>> Term(PortalHTMLDocument.CONTENT_FIELD, txt));
>> hits = searcher.search(query);
>>
>> and lucene can't find nothing.
>> Also I checked for the DecodeInterceptor in my server.xml - there
>> isn't any
>> I tried UTF-8/16 - and got the same result.
>> if I list all index's content via iterating IndexReader- I can see
>> that my russian content is stored in index...
>> Can you please help me? Do you have any more ideas about what else can
>> be done here to fix this problem?
>>
>> I will appreciate any help.
>> Thanks, Andrey.
>>
>> P.S.
>> I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message