lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Che Dong <ched...@chedong.com>
Subject Re: searching using the CJKAnalyzer
Date Tue, 12 Oct 2004 11:22:04 GMT
CJKAnalyser not support single byte-stream, front end interface and 
backend indexing process need to transform source into double byte 
charactor-stream properly before search/index.

Please tell me know the output of
http://www.chedong.com/tech/HelloUnicode.java
with javac -encoding=gb2312 and javac -encoding=iso-8859-1

Regards

Che Dong


Daan Hoogland wrote:
> Jon Schuster wrote:
> 
> 
>>I didn't need to make any changes to Entities to get Japanese searches working. Are
you using the CJKAnalyzer when you perform the search, not only when building the index?
>> 
>>
> 
> Yes, I use CJKAnalyzer all around. When searching I translate 
> character-entities in order to find anything. When displaying search 
> results, I don't see anything that looks as being part of an eastern 
> character set. instead I see accented latin - and mathematical symbols.
> 
> When I don't pass entities by the way things get really nasty:
> query passed: >Θ??µ░╕<
>  char(Θ, LATIN_1_SUPPLEMENT)  char(?, LATIN_1_SUPPLEMENT) token found : 
>  >Θ< length: 1
>  char(?, LATIN_1_SUPPLEMENT)  char(µ, LATIN_1_SUPPLEMENT)  char(░, 
> LATIN_1_SUPPLEMENT) token found : >µ< length: 1
>  char(╕, LATIN_1_SUPPLEMENT) searching contents:"Θ µ"
> 
> This was a query for two japanese characters.
> 
> 
>>-----Original Message-----
>>From: Daan Hoogland [mailto:daan.hoogland@asml.com] 
>>Sent: Sunday, October 10, 2004 10:48 PM
>>To: Lucene Users List
>>Subject: Re: searching using the CJKAnalyzer
>>Importance: Low
>>
>>
>>Che Dong wrote:
>>
>> 
>>
>>
>>>Seem not Analyser problem but html parser charset detecting error.
>>>
>>>Could you show me the detail of the problem?
>>>   
>>>
>>
>>Thank Che,
>>I got it working by making the decode() from the Entities in demo 
>>public. I wrote a scanner to tranlate any entities in the query.
>>I want to translate back to entities in the results, but I'm not sure 
>>what the criteria should be. It seems to be just binary data.
>>How to conclude that 0Š4?0†3¨¦?0„4 means ÓÐÒ°?
>>
>> 
>>
>>
>>>Thanks
>>>
>>>Che Dong
>>>
>>>Daan Hoogland wrote:
>>>
>>>   
>>>
>>>
>>>>LS,
>>>>in
>>>>http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
>>>>Jon Schuster explains how to get a Japanese search system working. I 
>>>>followed his advice and got a index that "luke" shows as what I 
>>>>expected it to be.
>>>>I don't know how to enter a search so that it gets passed to the 
>>>>engine properly. It works in luke but not in weblucene or in my own app.
>>>>
>>>>
>>>>     
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>   
>>>
>>
>>
>>
>> 
>>
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message