lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bob Cheung" <>
Subject RE: Indexing multiple languages
Date Fri, 03 Jun 2005 01:06:51 GMT
Hi Erik,

I am a new comer to this list and please allow me to ask a dumb

For the StandardAnalyzer, will it have to be modified to accept
different character encodings.

We have customers in China, Taiwan and Hong Kong.  Chinese data may come
in 3 different encoding:  Big5, GB and UTF8.

What is the default encoding for the StandardAnalyser.

Btw, I did try running the lucene demo (web template) to index the HTML
files after I added one including English and Chinese characters.  I was
not able to search for any Chinese in that HTML file (returned no hits).
I wonder whether I need to change some of the java programs to index
Chinese and/or accept Chinese as search term.  I was able to search for
the HTML file if I used English word that appeared in the added HTML



On May 31, 2005, Erik wrote:

Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
will keep English as-is (removing stop words, lowercasing, and such)
and separate CJK characters into separate tokens also.


On May 31, 2005, at 5:49 PM, jian chen wrote:

> Hi,
> Interesting topic. I thought about this as well. I wanted to index
> Chinese text with English, i.e., I want to treat the English text
> inside Chinese text as English tokens rather than Chinese text tokens.
> Right now I think maybe I have to write a special analyzer that takes
> the text input, and detect if the character is an ASCII char, if it
> is, assembly them together and make it as a token, if not, then, make
> it as a Chinese word token.
> So, bottom line is, just one analyzer for all the text and do the
> if/else statement inside the analyzer.
> I would like to learn more thoughts about this!
> Thanks,
> Jian
> On 5/31/05, Tansley, Robert <> wrote:
>> Hi all,
>> The DSpace ( currently uses Lucene to index metadata
>> (Dublin Core standard) and extracted full-text content of documents
>> stored in it.  Now the system is being used globally, it needs to
>> support multi-language indexing.
>> I've looked through the mailing list archives etc. and it seems it's
>> easy to plug in analyzers for different languages.
>> What if we're trying to index multiple languages in the same
>> site?  Is
>> it best to have:
>> 1/ one index for all languages
>> 2/ one index for all languages, with an extra language field so
>> searches
>> can be constrained to a particular language
>> 3/ separate indices for each language?
>> I don't fully understand the consequences in terms of performance for
>> 1/, but I can see that false hits could turn up where one word
>> appears
>> in different languages (stemming could increase the changes of this).
>> Also some languages' analyzers are quite dramatically different (e.g.
>> the Chinese one which just treats every character as a separate
>> token/word).
>> On the other hand, if people are searching for proper nouns in
>> metadata
>> (e.g. "DSpace") it may be advantageous to search all languages at
>> once.
>> I'm also not sure of the storage and performance consequences of 2/.
>> Approach 3/ seems like it might be the most complex from an
>> implementation/code point of view.
>> Does anyone have any thoughts or recommendations on this?
>> Many thanks,
>>  Robert Tansley / Digital Media Systems Programme / HP Labs
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message