lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Indexing multiple languages
Date Tue, 31 May 2005 23:12:50 GMT

I'm very likely going to be using DSpace and some related  
technologies from the SIMILE project very soon :)

On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:
> Hi all,
> The DSpace ( currently uses Lucene to index metadata
> (Dublin Core standard) and extracted full-text content of documents
> stored in it.  Now the system is being used globally, it needs to
> support multi-language indexing.
> I've looked through the mailing list archives etc. and it seems it's
> easy to plug in analyzers for different languages.
> What if we're trying to index multiple languages in the same site?  Is
> it best to have:
> 1/ one index for all languages
> 2/ one index for all languages, with an extra language field so  
> searches
> can be constrained to a particular language
> 3/ separate indices for each language?

I would vote for option #2 as it gives the most flexibilty - you can  
query with or without concern for language.

> I'm also not sure of the storage and performance consequences of 2/.

Adding an additional field will be of little consequence.

> Approach 3/ seems like it might be the most complex from an
> implementation/code point of view.

I don't think #3 is all that complex to implement beyond the other  
options, except if you want to search across all languages - but the  
MultiSearcher can handle that.

> Does anyone have any thoughts or recommendations on this?

It's tough to give a general recommendation - it really depends on  
how each of these solutions fit into the architecture and what needs  
you have in terms of querying across multiple languages and such.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message