lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Indexing multiple languages
Date Tue, 31 May 2005 23:08:37 GMT
Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It  
will keep English as-is (removing stop words, lowercasing, and such)  
and separate CJK characters into separate tokens also.

     Erik


On May 31, 2005, at 5:49 PM, jian chen wrote:

> Hi,
>
> Interesting topic. I thought about this as well. I wanted to index
> Chinese text with English, i.e., I want to treat the English text
> inside Chinese text as English tokens rather than Chinese text tokens.
>
> Right now I think maybe I have to write a special analyzer that takes
> the text input, and detect if the character is an ASCII char, if it
> is, assembly them together and make it as a token, if not, then, make
> it as a Chinese word token.
>
> So, bottom line is, just one analyzer for all the text and do the
> if/else statement inside the analyzer.
>
> I would like to learn more thoughts about this!
>
> Thanks,
>
> Jian
>
> On 5/31/05, Tansley, Robert <robert.tansley@hp.com> wrote:
>
>> Hi all,
>>
>> The DSpace (www.dspace.org) currently uses Lucene to index metadata
>> (Dublin Core standard) and extracted full-text content of documents
>> stored in it.  Now the system is being used globally, it needs to
>> support multi-language indexing.
>>
>> I've looked through the mailing list archives etc. and it seems it's
>> easy to plug in analyzers for different languages.
>>
>> What if we're trying to index multiple languages in the same  
>> site?  Is
>> it best to have:
>>
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so  
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
>>
>> I don't fully understand the consequences in terms of performance for
>> 1/, but I can see that false hits could turn up where one word  
>> appears
>> in different languages (stemming could increase the changes of this).
>> Also some languages' analyzers are quite dramatically different (e.g.
>> the Chinese one which just treats every character as a separate
>> token/word).
>>
>> On the other hand, if people are searching for proper nouns in  
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at  
>> once.
>>
>>
>> I'm also not sure of the storage and performance consequences of 2/.
>>
>> Approach 3/ seems like it might be the most complex from an
>> implementation/code point of view.
>>
>> Does anyone have any thoughts or recommendations on this?
>>
>> Many thanks,
>>
>>  Robert Tansley / Digital Media Systems Programme / HP Labs
>>   http://www.hpl.hp.com/personal/Robert_Tansley/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message