lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Vergnaud <>
Subject Designing a multilingual index
Date Wed, 31 Mar 2010 16:20:25 GMT
Hi everyone!

I'm about to build a search engine that will handle documents in several languages (4 for
now but the number will increase in the near future). In order to index them properly and
offer the best user experience, I'm automatically recognizing the language of each document
in order to be able to use appropriate language-dependent analysis, so that e.g. "the" is
recognized as a stopword in an English document, but is interpreted as potentially meaning
"thé" in a French document. I also want stemming rules to be applied depending on the document's
language, so that a search for "bank" matches "Banken" in a German document and "banks" in
an English document -- and not the other way round. 

Now I've thought of two possible architectures to achieve this, perhaps some experienced Lucene
user might give me some advice on which approach would be better -- or is there yet another
one which would be even more appropriate? 

The first method I'm thinking of, and I've already partially implemented, is to have several
indices -- one per language. Upon recognizing the language of a document, a flag is set in
my internal data structure, and this is used by the indexing module to decide which index
the document should be added to. There is an additional index for documents where no language
could be recognized -- in that case, a simple tokenizer is used. 
As I see it, one advantage of this approach is that all indices share the same structure,
which makes it easier to build queries. Also, they can all be searched parallely, but I'm
not sure that this is a great advantage. However, one drawback might be that searching several
smaller indices might not be as efficient as searching one big index containing all documents.
Besides, I'm planning on improving the processing to allow a single document to be assigned
several languages (e.g. paragraph-based recognition), and this architecture would mean that,
in order to properly analyze the parts, a multilingual document would have to be split between
several indices, therefore either repeating common information (like e.g. document name) or
having yet another index containing language-independent information. This might quickly become
rather cumbersome to manage. 

The second method I've thought of is to have all languages in the same index and use different
analyzers on fields that require analysis. In order to do that, I was thinking of extending
the names of the fields with the names of the languages -- like e.g. "content-en" vs "content-fr"
vs "content-xx" (for "no language recognized"). Then using a customized analyzer, the name
of the field would be parsed in method tokenStream and the proper language-dependent analyzer
would be selected. 
The drawback of this method, as I see it, is that the number of fields in the index increases
drastically, which in turn means that building queries becomes rather cumbersome -- but still
doable, assuming (which also is the case) that I know the exact list of languages I'm dealing
with. Also, it means that Lucene would be searching in non-existing fields in most documents,
as I doubt many of them would contain *all* languages. But it keeps the complete information
about one document gathered in one place and requires searching only one index. 

As I said, I've already implemented the first method some time ago and it works fine. I've
only just thought about the second one when I read about this PerFieldAnalyzerWrapper, which
allows to do just what I want in the second method. Since my index won't be that big at first,
I doubt either architecture would prove to be much more efficient than the other, however
I want to use a scaleable design right from the start, so I was wondering whether some Lucene
gurus might give me some insights as to what in their eyes would be the better approach --
or whether there might be a different, much better technique I haven't thought of. 

Thanks a lot in advance for your support and ideas!



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message