lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian chen <>
Subject Re: Indexing multiple languages
Date Tue, 31 May 2005 21:49:23 GMT

Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.

Right now I think maybe I have to write a special analyzer that takes
the text input, and detect if the character is an ASCII char, if it
is, assembly them together and make it as a token, if not, then, make
it as a Chinese word token.

So, bottom line is, just one analyzer for all the text and do the
if/else statement inside the analyzer.

I would like to learn more thoughts about this!



On 5/31/05, Tansley, Robert <> wrote:
> Hi all,
> The DSpace ( currently uses Lucene to index metadata
> (Dublin Core standard) and extracted full-text content of documents
> stored in it.  Now the system is being used globally, it needs to
> support multi-language indexing.
> I've looked through the mailing list archives etc. and it seems it's
> easy to plug in analyzers for different languages.
> What if we're trying to index multiple languages in the same site?  Is
> it best to have:
> 1/ one index for all languages
> 2/ one index for all languages, with an extra language field so searches
> can be constrained to a particular language
> 3/ separate indices for each language?
> I don't fully understand the consequences in terms of performance for
> 1/, but I can see that false hits could turn up where one word appears
> in different languages (stemming could increase the changes of this).
> Also some languages' analyzers are quite dramatically different (e.g.
> the Chinese one which just treats every character as a separate
> token/word).
> On the other hand, if people are searching for proper nouns in metadata
> (e.g. "DSpace") it may be advantageous to search all languages at once.
> I'm also not sure of the storage and performance consequences of 2/.
> Approach 3/ seems like it might be the most complex from an
> implementation/code point of view.
> Does anyone have any thoughts or recommendations on this?
> Many thanks,
>  Robert Tansley / Digital Media Systems Programme / HP Labs
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message