lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Vergnaud <dvergn...@yahoo.com>
Subject Re: Designing a multilingual index
Date Thu, 01 Apr 2010 14:41:50 GMT
Hi,

thx for sharing your experience with us. I'm happy to see that both methods I've thought of
are apparently sensible ;-)

However, it might be due to my lack of experience in that domain, but some of your arguments
in favor of a multi-index solution seem to me to be also compatible with a single index, or
to apply only in a particular situation:

> Maintenance was easier this
> way; when refining/updating the settings (say 
> adding synonyms or stemmers
> for instance), you may need to reindex and smaller 
> indices allow faster deployments. 
If I understand this correctly, that is true if you're indexing distinct collections of documents
that are all in the same language, which you know in advance. In that case, you would of course
only need to reindex one of these collections into the corresponding index to update it. In
my case though, there is no such separation and only upon processing a document can I decide
what language it is in. I therefore cannot decide to update only documents in one language,
it's a "all or nothing". 

> It's also "dead-easy" to add a new language (esp. compared to
> the one index solution).
I'm not really sure why adding new languages should be complicated in the one-index solution:
the languages I can process (i.e. languages for which I have recognition data and the proper
analyzer) are listed in a configuration file. Upon recognizing one of these languages, the
module stores it in the internal data structure, which is then eventually passed to the indexing
module. The latter retrieves the language from the structure and uses it to create the corresponding
fields in the index dynamically (e.g. "content-de", "title-en" etc). It's all really automatic.


> It also makes replication or partitioning easier.
Not sure what you mean by that... 

> Users were able to set in which language they were "fluent" (default being
> browser locale) so queries would only be performed in those and results
> "clustered" per locale (no need to return results that can not be
> understood...). Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.

So if I understand correctly you'd always only search in one of the indices? In that case,
I can understand why the multi-index solution works better for you. However, assuming most
of my users speak at least English *and* one, two or three of the other languages, all searches
have to be conducted in all languages. Besides, the various languages are not localized versions
of the same content, they may really be anything in any language, depending on the person
who wrote them and their mood at the time of writing. Some documents also contains parts in
English and parts in German... I'm definitely thinking about making it possible to enable/disable
some languages, but parallel searching will definitely be required. 

> Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.
Not too sure about that. If for instance you were to search for "firewall" in a German/English
index, the word may appear in documents in both languages. Lucene's ranking algorithm is based
on the number of tokens and the number of occurrences in fields, and while of course the ratio
may vary (e.g. German tends to collate several words into one single entity, resulting in
one single token, while English rather uses phrases, resulting in several tokens), I still
think, assuming the user can understand both, that it kind of makes sense to rank a short
German document where "firewall" occurs 10 times higher than an English document where the
same word occurs only 5 times. 

> Finally, query expansion can also be used in the multiple indices case and
> might even use automated/guided translation.
I do agree about that, and I guess query expansion is much easier when dealing with a single,
consistent index structure. I'm thinking writing an algorithm to automatically expand a query
to all languages might be ok, but I'm worried about the performance of performing such a query
if the number of languages grows larger....

One other problem I'm seeing with the multi-index solution is the ranking of documents that
are spanned across several indices (multilingual documents). If say my search term matches
a document in the English index and the same document in the French index (which would quite
often be the case for e.g. proper names), then how do I get about mixing the two rankings?
(as I don't want to display the same result twice) I think using a single index would solve
that problem, since the ranking would already take all fields (hence all languages into account).


Cheers,

David

----- Original Message ----
From: henrib <henrib@apache.org>
To: java-user@lucene.apache.org
Sent: Thu, April 1, 2010 2:19:07 PM
Subject: Re: Designing a multilingual index


Hi,
I worked some time ago on a similar system (using Solr) and used the
multiple indices route (the multicore feature in Solr). In our case, the
"same" document could exist in different languages; different localized
versions of the same information (same Solr unique id for each l10n
version). 

This allowed to have the same index structure across locales but different
settings for each (synonyms, stemmers, etc). Maintenance was easier this
way; when refining/updating the settings (say adding synonyms or stemmers
for instance), you may need to reindex and smaller indices allow faster
deployments. It's also "dead-easy" to add a new language (esp. compared to
the one index solution). It also makes replication or partitioning easier.
Overall, IMO, this is a more scalable architecture than the single-index
one.

Users were able to set in which language they were "fluent" (default being
browser locale) so queries would only be performed in those and results
"clustered" per locale (no need to return results that can not be
understood...). Besides, IMO, scoring / ordering documents in different
languages is a bit like comparing apples and oranges.

Finally, query expansion can also be used in the multiple indices case and
might even use automated/guided translation.

In my experience, multiple indices had many advantages over the single index
solution, be them functional or operational. YMMV.
Hope this helps,
Henrib

-- 
View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690625.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message