lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Jaquemet <olivier.jaque...@jalios.com>
Subject Multiple Language Indexing and Searching
Date Mon, 05 Sep 2005 08:44:32 GMT
Hi,

I'd like to go in details regarding issues that occurs when you want to 
index and search contents in multiple languages.

I have read Lucene in Action book, and many thread on this mailing list, 
the most interesting so far being this one:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c19ADCC0B9D4CAD4582BB9900BBCE35740194503C@tayexc13.americas.cpqcorp.net%3e

The solution choosen/recommended by Doug Cutting in this message:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/%3c42A0841C.2090202@apache.org%3e
is the number '2/':
Having one index for all languages one Document per content's language 
with a field specify its language, and using a query filter when searching.

While I think it is a good solution:
- If you have N languages, if you search for something in 1 language, 
you are going to search an index N times too large.
Wouldn't it be better to have N indices for N languages? That way, each 
index could benefit of its specialized analyser, and if you need to 
search in multiple languages, you just need to merge result of those 
differents analyzer.
- If you have contents in multiple language like we do, and by that I 
don't mean multiple contents each one having its own language, but 
multiple content, each one being in many languages. You are going to 
have a N to 1, Document/content relation in the index.
As far as update, delete, and search in multiple language are concerned, 
wouldn't it be simpler to alway keep a 1 to 1 Document/content relation 
in an index?

As you may have guess, my original thought, even before I read those 
thread, was that the solution number 3. might be more flexible/modular 
than the others, of course it also has its drawbacks:
- performance issue when doing multiple language search, specially when 
merging results of different index.
- more complex to code
- other?

Can you clarify on this?
What solutions all of you have choosen til now regarding indexing and 
searching of multiple content in multiple language ?

Thanks!

Olivier



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message