lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: Indexing multiple languages
Date Tue, 07 Jun 2005 07:02:32 GMT
Tansley, Robert wrote:

>Hi all,
>
>The DSpace (www.dspace.org) currently uses Lucene to index metadata
>(Dublin Core standard) and extracted full-text content of documents
>stored in it.  Now the system is being used globally, it needs to
>support multi-language indexing.
>
>I've looked through the mailing list archives etc. and it seems it's
>easy to plug in analyzers for different languages.
>
>What if we're trying to index multiple languages in the same site?  Is
>it best to have:
>
>1/ one index for all languages
>2/ one index for all languages, with an extra language field so searches
>can be constrained to a particular language
>3/ separate indices for each language?
>
>I don't fully understand the consequences in terms of performance for
>1/, but I can see that false hits could turn up where one word appears
>in different languages (stemming could increase the changes of this).
>Also some languages' analyzers are quite dramatically different (e.g.
>the Chinese one which just treats every character as a separate
>token/word).
>  
>
>On the other hand, if people are searching for proper nouns in metadata
>(e.g. "DSpace") it may be advantageous to search all languages at once.
>
>
>I'm also not sure of the storage and performance consequences of 2/.
>
>Approach 3/ seems like it might be the most complex from an
>implementation/code point of view.  
>  
>
But this will be the most robust solution. You have to differentiate 
between languages anyway,
and as you pointed here, you can differentiate by adding a Keyword field 
for language, or you can create different
indexes.

If you need to use complex search strings over multiple fields and 
indexes then I recommend you to use the QueryParser
to compute the search string. When you instantiate a QueryPArser you 
will need to provide an analyzer, that will be different
for different languages.

I think that the differences in performance won't be noticable between  
2nd and 3rd solutions, but from maintenance point of
view, I would choose the third solution.

Of course there are other factors that must be take in account when 
designing such an application:
number of documents to be indexed, number of document fields, index 
change frequency, server load (number of concurrent sessions), etc.

 Hope this hints help you a little,

 Best,

 Sergiu



>Does anyone have any thoughts or recommendations on this?
>
>Many thanks,
>
> Robert Tansley / Digital Media Systems Programme / HP Labs
>  http://www.hpl.hp.com/personal/Robert_Tansley/
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message