lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject multilingual word/term lookup -- which side should I index?
Date Tue, 01 Nov 2011 14:38:32 GMT
Hello everyone,

I am writing an application that is supposed to do multilingual word/term lookup in text documents.
Basically, I have a wordlist (WL) that has terms in different languages. Actually, the WL
consists of a set of English terms and then corresponding translations of these terms into
other languages. We receive documents that are most likely not in English, although I don't
have any automatic tools that would detect  the language or languages of an incoming document
and/or segment the doc into language blocks if there are several langs present. I have to
search for all terms from WL in each doc. If there is a hit and it's not in English, I need
to provide the user with a English translation of it - which I can do by using links between
terms in different languages provided in the WL.

The question is which side to index, the docs or the WL. The 1st option that comes to mind
is to index the docs. That is, for each incoming doc that I would need to search, I would
create N indices where N is the number of languages appearing in the WL. When creating these
indices, I would use the different language analyzers that are available with Lucene. Then
I would use each block of terms from WL in a given language to search against the corresponding
index (i.e. French terms against the French index, Chinese terms against the Chinese index
etc). I would have to index each incoming document once - for each language - and won't have
to apply any language analysis tools to the WL entries to segment them since they are already
given as separate terms. At the same time, I'd have to make sure I delete the indices when
a document is processed and removed from the system.

But would it make sense to do it the other way around and index the WL and then search each
doc against this index (well, these indices, to be precise)? Would I need to perform some
sort of language-dependent segmentation of a document before I can do a meaningful search
of it against the WL indices? Any other caveats? What do I have to take into account when
deciding which direction to go?

Thank you in advance for any info

Ilya Zavorin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message