lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daan Hoogland <daan.hoogl...@asml.com>
Subject Re: searching using the CJKAnalyzer
Date Tue, 12 Oct 2004 12:56:52 GMT
Che Dong wrote:

> CJKAnalyser not support single byte-stream, front end interface and 
> backend indexing process need to transform source into double byte 
> charactor-stream properly before search/index.
>
> Please tell me know the output of
> http://www.chedong.com/tech/HelloUnicode.java
> with javac -encoding=gb2312 and javac -encoding=iso-8859-1

Here's the output. I can see what's wrong in it. Should I put an extra 
field in the index containing encoding?

>
> Regards
>
> Che Dong
>
>
> Daan Hoogland wrote:
>
>> Jon Schuster wrote:
>>
>>
>>> I didn't need to make any changes to Entities to get Japanese 
>>> searches working. Are you using the CJKAnalyzer when you perform the 
>>> search, not only when building the index?
>>>
>>>
>>
>> Yes, I use CJKAnalyzer all around. When searching I translate 
>> character-entities in order to find anything. When displaying search 
>> results, I don't see anything that looks as being part of an eastern 
>> character set. instead I see accented latin - and mathematical symbols.
>>
>> When I don't pass entities by the way things get really nasty:
>> query passed: >Θ??µ░╕<
>>  char(Θ, LATIN_1_SUPPLEMENT)  char(?, LATIN_1_SUPPLEMENT) token found 
>> :  >Θ< length: 1
>>  char(?, LATIN_1_SUPPLEMENT)  char(µ, LATIN_1_SUPPLEMENT)  char(░, 
>> LATIN_1_SUPPLEMENT) token found : >µ< length: 1
>>  char(╕, LATIN_1_SUPPLEMENT) searching contents:"Θ µ"
>>
>> This was a query for two japanese characters.
>>
>>
>>> -----Original Message-----
>>> From: Daan Hoogland [mailto:daan.hoogland@asml.com] Sent: Sunday, 
>>> October 10, 2004 10:48 PM
>>> To: Lucene Users List
>>> Subject: Re: searching using the CJKAnalyzer
>>> Importance: Low
>>>
>>>
>>> Che Dong wrote:
>>>
>>>
>>>
>>>
>>>> Seem not Analyser problem but html parser charset detecting error.
>>>>
>>>> Could you show me the detail of the problem?
>>>>  
>>>
>>>
>>> Thank Che,
>>> I got it working by making the decode() from the Entities in demo 
>>> public. I wrote a scanner to tranlate any entities in the query.
>>> I want to translate back to entities in the results, but I'm not 
>>> sure what the criteria should be. It seems to be just binary data.
>>> How to conclude that 0Š4?0†3¨¦?0„4 means ÓÐÒ°?
>>>
>>>
>>>
>>>
>>>> Thanks
>>>>
>>>> Che Dong
>>>>
>>>> Daan Hoogland wrote:
>>>>
>>>>  
>>>>
>>>>> LS,
>>>>> in
>>>>> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
>>>>> Jon Schuster explains how to get a Japanese search system working. 
>>>>> I followed his advice and got a index that "luke" shows as what I 
>>>>> expected it to be.
>>>>> I don't know how to enter a search so that it gets passed to the 
>>>>> engine properly. It works in luke but not in weblucene or in my 
>>>>> own app.
>>>>>
>>>>>
>>>>>    
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>
>>>>
>>>>  
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



-- 
The information contained in this communication and any attachments is confidential and may
be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review,
use, disclosure or distribution is prohibited. If you are not the intended recipient, please
notify the sender immediately by replying to this message and destroy all copies of this message
and any attachments. ASML is neither liable for the proper and complete transmission of the
information contained in this communication, nor for any delay in its receipt.
Mime
View raw message